Chapter 1. Introduction to Oozie
In this chapter, we cover some of the background and motivations that led to the creation of Oozie, explaining the challenges developers faced as they started building complex applications running on Hadoop.1 We also introduce you to a simple Oozie application. The chapter wraps up by covering the different Oozie releases, their main features, their timeline, compatibility considerations, and some interesting statistics from large Oozie deployments.
Big Data Processing
Within a very short period of time, Apache Hadoop, an open source implementation of Google’s MapReduce paper and Google File System, has become the de facto platform for processing and storing big data.
Higher-level domain-specific languages (DSL) implemented on top of Hadoop’s MapReduce, such as Pig2 and Hive, quickly followed, making it simpler to write applications running on Hadoop.
A Recurrent Problem
Hadoop, Pig, Hive, and many other projects provide the foundation for storing and processing large amounts of data in an efficient way. Most of the time, it is not possible to perform all required processing with a single MapReduce, Pig, or Hive job. Multiple MapReduce, Pig, or Hive jobs often need to be chained together, producing and consuming intermediate data and coordinating their flow of execution.
Tip
Throughout the book, when referring to a MapReduce, Pig, Hive, or any other type of job that runs one or more MapReduce jobs on a Hadoop cluster, we refer to it as a Hadoop job. We ...