Goals

JIRA: CDAP-3980: Multiple input datasets for MapReduce.

Checklist

User stories documented (Ali)
User stories reviewed (Nitin)
Design documented (Ali)
Design reviewed (Albert/Terence/Andreas)
Feature merged (Ali)
Examples and guides (Ali)
Integration tests (Ali)
Documentation for feature (Ali)
Blog post

Misc Section

Misc section.

Use Cases

TODO...

User Stories

The user wants to process data from multiple datasets in one MapReduce job.
The user wants to load data from a 'users' table that contains user id, and various attributes such as age, gender, email, etc. The user also wants to load data from a 'purchases' table that contains a user id, item id, purchase time, and purchase price. The user then wants to join both tables on userid, then run a collaborative filtering algorithm to generate a model that can be used to recommend shopping items to people. (from Cask Hydrator++)

Approach for CDAP-3980

There already exists a MultipleInputs class in Hadoop, which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
Two downsides to this are:

If user uses this functionality, their mapper class can no longer implement ProgramLifeCycle<MapReduceTaskContext> and expect initialize/destroy methods to be called.
Datasets can not be used as the input with this implementation.

The APIs exposed in the Hadoop Multiple Inputs class:

public static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass);
public static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass, Class<? extends Mapper> mapperClass);

There are several 'setInput' methods on MapReduceContext. These will be deprecated and replaced by corresponding addInput methods. Pend
```
 
```
```
// Insert example here
```
Work in progress...

Multiple Inputs of MapReduce