Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. A developer should be able to set multiple datasets as input to one MapReduce job.
    1. Each dataset has the same type
    2. Each dataset has different types (this will require different Mapper classes). Note that the restriction here is that each of the Mappers must have the same output type (single Reducer class).
  2. A developer should be able to read from different partitions of a PartitionedFileSet (multiple time ranges of a TimePartitionedFileSet).
  3. A developer should be able to know which input they are processing data from, in their Mapper/Reducer.

Approach for CDAP-3980 (wip)

There already exists a MultipleInputs class in Hadoop, which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
Two downsides to this are:

...