Realtime pipeline if not configured properly will fail due to YARN killing containers as the RDD caching reaches the allocated memory limit.
Instructions
Below is the configurations that need to be applied during running a pipeline that contains a GCS sink.
Also make sure the number of executors is set correct. By default it’s set to 1.
Set the engine config
spark.streaming.blockInterval
to30000
(30 seconds). This configuration has to be applied when realtime pipeline has a a GCS sink. This will reduce the number of part files created in GCS sink.Set a runtime argument
system.resources.reserved.memory.override
to1024
to reserve 1GB of memory overhead for Spark process to avoid YARN killing.