Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  •  User stories documented (Ali)
  •  User stories reviewed (Nitin)
  •  Design documented (Ali)
  •  Design reviewed (Albert/Terence/Andreas)
  •  Feature merged (Ali)
  •  Examples and guides (Ali)
  •  Integration tests (Ali) 
  •  Documentation for feature (Ali)
  •  Blog post

Use Cases

  1. The user wants to process data from multiple datasets in A developer wants to compute an aggregation (for instance, word count) across data that is stored in multiple datasets.
  2. Joining in MapReduce:
    A developer wants to load data from a 'customers' dataset which has the customer's details. The developer then wants to load a 'transactions' dataset which holds less information about the customer, but more about a particular transaction. The developer should be able to join the data of these two datasets. (see Use Case #2 on Cask Hydrator++).

User Stories

  1. A developer should be able to set multiple datasets as input to one MapReduce job.
    1. Each dataset has the same type
    2. Each dataset has different types (this will require different Mapper classes). Note that the restriction here is that each of the Mappers must have the same output type (single Reducer class).
  2. Performing a join on a particular key, from two different datasets (see Use Case #2 on Cask Hydrator++).
  3. Reading A developer should be able to read from different partitions of a PartitionedFileSet (multiple time ranges of a TimePartitionedFileSet).
  4. User A developer should be able to know which input they are processing data from (, in their mapperMapper/reducer)Reducer.

...

Approach for CDAP-3980

There already exists a MultipleInputs class in Hadoop, which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
Two downsides to this are:

...