Goals
- JIRA: CDAP-3980: Multiple input datasets for MapReduce.
Checklist
- User stories documented (Ali)
- User stories reviewed (Nitin)
- Design documented (Ali)
- Design reviewed (Albert/Terence/Andreas)
- Feature merged (Ali)
- Examples and guides (Ali)
- Integration tests (Ali)
- Documentation for feature (Ali)
- Blog post
Misc Section
Misc section.
Use Cases
- TODO...
User Stories
- The user wants to process data from multiple datasets in one MapReduce job.
- The user wants to load data from a 'users' table that contains user id, and various attributes such as age, gender, email, etc. The user also wants to load data from a 'purchases' table that contains a user id, item id, purchase time, and purchase price. The user then wants to join both tables on userid, then run a collaborative filtering algorithm to generate a model that can be used to recommend shopping items to people. (from Cask Hydrator++)
Approach for CDAP-3980
There already exists a MultipleInputs class in Hadoop, which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
Two downsides to this are:
- If user uses this functionality, their mapper class can no longer implement ProgramLifeCycle<MapReduceTaskContext> and expect initialize/destroy methods to be called.
- Datasets can not be used as the input with this implementation.
The APIs exposed in the Hadoop Multiple Inputs class:
public static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass); public static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass, Class<? extends Mapper> mapperClass);
There are several 'setInput' methods on MapReduceContext. These will be deprecated and replaced by corresponding addInput methods. Pend
// Insert example here
Work in progress...