Fine-tuning real-time pipelines
If realtime pipelines aren’t configured properly, they’ll likely fail due to YARN killing containers as the RDD caching reaches the allocated memory limit.
Instructions
Below are the configurations that need to be applied during running a pipeline that contains a GCS sink.
Also make sure the number of executors is set correct. By default it’s set to 1.
Set the engine config
spark.streaming.blockInterval
to30000
(30 seconds). This configuration has to be applied when a realtime pipeline has a GCS sink. This will reduce the number of part files created in GCS sink.Set a runtime argument
system.resources.reserved.memory.override
to1024
to reserve 1 GB of memory overhead for the Spark process to avoid YARN killing.