Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Please read HTTP Batch Source first to grab the core principals of pagination, formats parsing etc.

...

So for such a cases I propose to load the pages for maximum batchInterval seconds, before returning them to RDD. And on the next HttpInputDStream#compute call continue from that place.

The interval should stop pages loading only when all the records from it are processed, so that we don't put multiple same records into RDD.  

InputDStream return value: pages or structured records? (InputDStream<StructuredRecord> vs InputDStream<BasePage>)

...