Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Realtime pipeline if not If realtime pipelines aren’t configured properly will , they’ll likely fail due to YARN killing containers as the RDD caching reaches the allocated memory limit.

Instructions

Below is are the configurations that need to be applied during running a pipeline that contains a GCS sink.

...

  1. Set the engine config spark.streaming.blockInterval to 30000 (30 seconds). This configuration has to be applied when a realtime pipeline has a a GCS sink. This will reduce the number of part files created in GCS sink.

  2. Set a runtime argument system.resources.reserved.memory.override to 1024 to reserve 1GB 1 GB of memory overhead for the Spark process to avoid YARN killing.

Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@957
showSpacefalse
sortmodified
typepage
reversetrue
labelsrealtime spark
cqllabel in ( "realtime" , "spark" ) and type = "page" and space = "KB"

...