Note: Additional details and implementation specifics will be added over time.
Andreas: This needs more details use cases. I know a few:
- a worker that looks for new partitions continuously
- a workflow that runs periodically and consumes all new partitions each time
- a workflow that catches up with old partitions while new partitions are created
- a scheduler that triggers a workflow every time there is a new partition
A requirement/use-case is to continuously process partitions of a PartitionedFileSetDataset from within a periodically running MapReduce job, scheduled by a Workflow. This means that each time the Workflow runs, it should process all of the partitions created since the last one it processed, so that it has processed all existing partitions after running. In order to do this, we will be adding an index on creation time of the partitions, as they are added to the dataset, in order to avoid inefficiently scanning all of the partitions each run. Then, one possibility is that the workflow will be responsible for maintaining the end timestamp used when requesting the partitions by time (lastProcessedTimestamp). In the next run of the workflow, it can then simply request from the dataset the partitions that have been created after lastProcessedTimestamp up until the then current time.
However, there is an issue with this idea - a partition could have had a creation time smaller/earlier than lastProcessedTimestamp, but not have been committed/visible at the time the workflow requested partitions up until lastProcessedTimestamp. So, even though the partition became visible after lastProcessedTimestamp, the next scan from lastProcessedTimestamp onwards will not include this partition because the partition's creation time will not match the time range. While this approach is simple, it has the possibility to omit the processing of partitions, and so it is not good enough.
...