Please read HTTP Batch Source first to grab the core principals of pagination, formats parsing etc.
...
So for such a cases I propose to load the pages pages for maximum batchInterval seconds, before returning them to RDD. And on the next HttpInputDStream#compute call continue from that place.
...