...
With the transaction support mentioned about, we can have the SparkExecutionContext
returns a DatasetContext
that is serializable so that it can be used inside closure function. The only challenge left is when to flush the dataset (mainly Table dataset). We can use a similar approach as the MapReduce case, which we do periodic flush. Also, we need to add a TaskCompletionListener
so that we can do a final flush when the task execution completed. The listener can be added when DatasetContext.getDataset
is called from the closure functionneed to modify the Table
implementation hierarchy to have the effect of BufferingTable
optional. The effect of turning off buffering will impact direct writer through dataset (not the one that done by saving of RDD to dataset), however, it should be acceptable, because it's never a good idea to write to dataset from a Spark function, as function as expected to have no side effect and most of those writes can be done by saving RDD to dataset.