Lag in streaming pipeline

Problem

Streaming pipeline is slow in processing data. Errors like the following are seen in the logs:

2020-05-01 15:47:41,841 - WARN [task-result-getter-2:o.a.s.s.TaskSetManager@66] - Lost task 0.0 in stage 561443.0 (TID 185879, cdap-hl7v2tofh-f483df52-8955-11ea-a9da-e29cdef61699-w-1.c.ccf-cdw-df2-prod.internal, executor 4662): java.lang.Exception: Could not compute split, block input-0-1588186863359 of RDD 329677 not found
2020-05-01 15:47:41,843 - WARN [task-result-getter-3:o.a.s.s.TaskSetManager@66] - Lost task 2.0 in stage 561443.0 (TID 185881, cdap-hl7v2tofh-f483df52-8955-11ea-a9da-e29cdef61699-w-4.c.ccf-cdw-df2-prod.internal, executor 4660): java.lang.Exception: Could not compute split, block input-0-1588186863361 of RDD 329677 not found

The lag in the pipeline results in data getting buffered in memory. Eventually we'll run out of memory and crash.

Symptom(s)

Streaming pipeline keeps on running for a long time after the ‘Stop’ button is clicked.
Logs indicate task failures and RDDs are not found.
The Spark Streaming UI indicates that there are many active batches, and batch processing time is greater than the configured batch interval.
- Navigate to Spark Streaming UI from Dataproc → Clusters → Web Interfaces → YARN ResourceManager → ApplicationMaster under Tracking UI → Streaming

Solution(s)

Increase batch interval

Set batch interval to a value that’s greater than the batch processing time. You can see the batch processing time in the Spark Streaming UI, as explained above.

The pipeline should be cloned before making configuration changes.

Configuration changes will not take effect if you edit the configuration of a deployed pipeline. Upon restart, Spark will read its configuration from checkpoint, so any config changes made in the UI will not take effect.

Use Pub/Sub instead of GCS sink

Writing to GCS is slow. Consider writing to Pub/Sub sink instead.