Fine-tuning real-time pipelines

If realtime pipelines aren’t configured properly, they’ll likely fail due to YARN killing containers as the RDD caching reaches the allocated memory limit.

Instructions

Below are the configurations that need to be applied during running a pipeline that contains a GCS sink.

Also make sure the number of executors is set correct. By default it’s set to 1.

  1. Set the engine config spark.streaming.blockInterval to 30000 (30 seconds). This configuration has to be applied when a realtime pipeline has a GCS sink. This will reduce the number of part files created in GCS sink.

  2. Set a runtime argument system.resources.reserved.memory.override to 1024 to reserve 1 GB of memory overhead for the Spark process to avoid YARN killing.

Related articles