Page Comparison

Please read HTTP Batch Source first to grab the core principals of pagination, formats parsing etc.

...

So for such a cases I propose to load the pages for maximum batchInterval seconds, before returning them to RDD. And on the next HttpInputDStream#compute call continue from that place.

The interval should stop pages loading only when all the records from it are processed, so that we don't put multiple same records into RDD.

InputDStream return value: pages or structured records? (InputDStream<StructuredRecord> vs InputDStream<BasePage>)

...

Versions Compared

Old Version 40

New Version 41

Key