Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With the transaction support mentioned about, we can have the SparkExecutionContext returns a DatasetContext that is serializable so that it can be used inside closure function. The only challenge left is when to flush the dataset (mainly Table dataset). We can use a similar approach as the MapReduce case, which we do periodic flush. Also, we need to add a TaskCompletionListener so that we can do a final flush when the task execution completed. The listener can be added when DatasetContext.getDataset is called from the closure functionneed to modify the Table implementation hierarchy to have the effect of BufferingTable optional. The effect of turning off buffering will impact direct writer through dataset (not the one that done by saving of RDD to dataset), however, it should be acceptable, because it's never a good idea to write to dataset from a Spark function, as function as expected to have no side effect and most of those writes can be done by saving RDD to dataset.