book

Apache Oozie

by Mohammad Kamrul Islam, Aravind Srinivasan

May 2015

Beginner to intermediate

272 pages

7h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Contents of This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
Big Data ProcessingA Recurrent ProblemA Common Solution: OozieA Simple Oozie JobOozie ReleasesSome Oozie Usage Numbers
Oozie ApplicationsOozie WorkflowsOozie CoordinatorsOozie BundlesParameters, Variables, and FunctionsApplication Deployment ModelOozie Architecture
Oozie DeploymentBasic InstallationsRequirementsBuild OozieInstall Oozie ServerHadoop ClusterStart and Verify the Oozie ServerAdvanced Oozie InstallationsConfiguring Kerberos SecurityDB SetupShared Library InstallationOozie Client Installations
WorkflowActionsAction Execution ModelAction DefinitionAction TypesMapReduce ActionJava ActionPig ActionFS ActionSub-Workflow ActionHive ActionDistCp ActionEmail ActionShell ActionSSH ActionSqoop ActionSynchronous Versus Asynchronous Actions
Outline of a Basic WorkflowControl Nodes<start> and <end><fork> and <join><decision><kill><OK> and <ERROR>Job ConfigurationGlobal ConfigurationJob XMLInline ConfigurationLauncher ConfigurationParameterizationEL VariablesEL FunctionsEL ExpressionsThe job.properties FileCommand-Line OptionThe config-default.xml FileThe <parameters> SectionConfiguration and Parameterization ExamplesLifecycle of a WorkflowAction States
Coordinator ConceptTriggering MechanismTime TriggerData Availability TriggerCoordinator Application and JobCoordinator ActionOur First Coordinator JobCoordinator SubmissionOozie Web Interface for Coordinator JobsCoordinator Job LifecycleCoordinator Action LifecycleParameterization of the CoordinatorEL Functions for FrequencyDay-Based FrequencyMonth-Based FrequencyExecution ControlsAn Improved Coordinator
Expressing Data DependencyDatasetExample: RollupParameterization of Dataset Instancescurrent(n)latest(n)Parameter Passing to WorkflowdataIn(eventName):dataOut(eventName)nominalTime()actualTime()dateOffset(baseTimeStamp, skipInstance, timeUnit)formatTime(timeStamp, formatString)A Complete Coordinator Application
Bundle BasicsBundle DefinitionWhy Do We Need Bundles?Bundle SpecificationExecution ControlsBundle State Transitions
Managing Libraries in OozieOrigin of JARs in OozieDesign ChallengesManaging Action JARsSupporting the User’s JARJAR Precedence in classpathOozie SecurityOozie Security OverviewOozie to HadoopOozie Client to ServerSupporting Custom CredentialsSupporting New API in MapReduce ActionSupporting Uber JARCron SchedulingA Simple Cron-Based CoordinatorOozie Cron SpecificationEmulate Asynchronous Data ProcessingHCatalog-Based Data Dependency
Developing Custom EL FunctionsRequirements for a New EL FunctionImplementing a New EL FunctionSupporting Custom Action TypesCreating a Custom Synchronous ActionOverriding an Asynchronous Action TypeImplementing the New ActionMain ClassTesting the New Main ClassCreating a New Asynchronous ActionWriting an Asynchronous Action ExecutorWriting the ActionMain ClassWriting Action’s SchemaDeploying the New Action TypeUsing the New Action Type
Oozie CLI ToolCLI SubcommandsUseful CLI CommandsOozie REST APIOozie Java ClientThe oozie-site.xml FileThe Oozie Purge ServiceJob MonitoringJMS-Based MonitoringOozie Instrumentation and MetricsReprocessingWorkflow ReprocessingCoordinator ReprocessingBundle ReprocessingServer TuningJVM TuningService SettingsOozie High AvailabilityDebugging in OozieOozie LogsDeveloping and Testing Oozie ApplicationsApplication Deployment TipsCommon Errors and DebuggingMiniOozie and LocalOozieThe Competition

Content preview from Apache Oozie

Chapter 1. Introduction to Oozie

In this chapter, we cover some of the background and motivations that led to the creation of Oozie, explaining the challenges developers faced as they started building complex applications running on Hadoop.¹ We also introduce you to a simple Oozie application. The chapter wraps up by covering the different Oozie releases, their main features, their timeline, compatibility considerations, and some interesting statistics from large Oozie deployments.

Big Data Processing

Within a very short period of time, Apache Hadoop, an open source implementation of Google’s MapReduce paper and Google File System, has become the de facto platform for processing and storing big data.

Higher-level domain-specific languages (DSL) implemented on top of Hadoop’s MapReduce, such as Pig² and Hive, quickly followed, making it simpler to write applications running on Hadoop.

A Recurrent Problem

Hadoop, Pig, Hive, and many other projects provide the foundation for storing and processing large amounts of data in an efficient way. Most of the time, it is not possible to perform all required processing with a single MapReduce, Pig, or Hive job. Multiple MapReduce, Pig, or Hive jobs often need to be chained together, producing and consuming intermediate data and coordinating their flow of execution.

Tip

Throughout the book, when referring to a MapReduce, Pig, Hive, or any other type of job that runs one or more MapReduce jobs on a Hadoop cluster, we refer to it as a Hadoop job. We ...