Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejava
// adds a Dataset to the set of output Datasets for the MapReduce job:
context.addOutput(String datasetName);
context.addOutput(String datasetName, Dataset dataset); 

 

New APIs (in our custom MultipleOutputs, to be used from mapper/reducer)- note that this will be a custom mapper, reducer, and context classes which override the hadoop classes, providing the additional functionality of writing to multiple outputs:

Code Block
languagejava
// specifies which Dataset to write to and handles the delegation to the appropriate OutputFormat:
moscontext.write(String datasetName, KEY key, VALUE value);

...

Code Block
languagejava
// adds a Dataset to the set of output Datasets for the Adapter job:
context.addOutput(String datasetName);
context.addOutput(String datasetName, Dataset dataset); 

Example Usage:

Code Block
languagejava
public void beforeSubmit(MapReduceContext context) throws Exception {
  context.addOutput("cleanCounts");
  context.addOutput("invalidCounts");
  // ...
}

public static class Counter extends Reducer<Text, IntWritable, byte[], Long> {
  private MultipleOutputs mos;

  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context) {
    // do computation and output to the desired dataset
    if ( ... ) {
      moscontext.write("cleanCounts", key.getBytes(), val);
    } else {
      moscontext.write("invalidCounts", key.getBytes(), val);
    }
  } }

Approach:

Take an approach similar to org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.
The Datasets to be written to must be defined in advance, in the beforeSubmit of the MapReduce job.
In the mapper/reducer, the user specifies the name of the output Dataset, and our helper class (MultipleOutputs) determines the appropriate OutputFormat and configuration for writing.
The MapperWrapper and ReducerWrapper will be responsible for instantiating the MultipleOutputs class and setting it on the user's mapper/reducer in a similar fashion as Metrics are set. The MapperWrapper and ReducerWrapper will also be responsible for closing the MultipleOutputs object.

...