...

Code Block

language	java

// specifies which Dataset to write to and handles the delegation to the appropriate OutputFormat:
mos.write(String datasetName, KEY key, VALUE value);

New APIs (in BatchSinkContext, used in prepareRun of the BatchSink):

Code Block

language	java

// adds a Dataset to the set of output Datasets for the Adapter job:
context.addOutput(String datasetName);
context.addOutput(String datasetName, Dataset dataset);

Example Usage:

Code Block

language	java

public void beforeSubmit(MapReduceContext context) throws Exception {
  context.addOutput("cleanCounts");
  context.addOutput("invalidCounts");
  // ...
}

public static class Counter extends Reducer<Text, IntWritable, byte[], Long> {
  private MultipleOutputs mos;

  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context) {
    // do computation and output to the desired dataset
    if ( ... ) {
      mos.write("cleanCounts", key.getBytes(), val);
    } else {
      mos.write("invalidCounts", key.getBytes(), val);
    }
  }
}

...

Deprecate the setting of output dataset from the configure method as it provides no utility over setting it in the beforeSubmit.

New APIs in BatchSinkContext will simply delegate to MapReduceContext's new APIs for having multiple output Datasets.

Questions:

Naming of the MultipleOutputs class that we expose is up for change.
Should we allow the user to write to non-Dataset files from our MultipleOutputs class? I suggest no for simplicity. What this will disallow is the ability to write to both a Dataset and non-Dataset files from the same MapReduce.
Should we restrict users from simply calling context.write(k, v), after having set multiple Datasets as the output?

...

Versions Compared

Old Version 11

New Version 12

Key

Example Usage:

Questions:

Page Comparison

Versions Compared

Old Version 11

New Version 12

Key

Example Usage:

Questions: