Realtime pipeline if not If realtime pipelines aren’t configured properly will , they’ll likely fail due to YARN killing containers as the RDD caching reaches the allocated memory limit.
Instructions
Below is are the configurations that need to be applied during running a pipeline that contains a GCS sink.
...
Set the engine config
spark.streaming.blockInterval
to30000
(30 seconds). This configuration has to be applied when a realtime pipeline has a a GCS sink. This will reduce the number of part files created in GCS sink.Set a runtime argument
system.resources.reserved.memory.override
to1024
to reserve 1GB 1 GB of memory overhead for the Spark process to avoid YARN killing.
Related articles
Filter by label (Content by label) | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...