Not enough space to cache rdd

Problem

Customers are seeing the following issue when running pipelines.

2020-04-22 17:40:01,982 - WARN  [Executor task launch worker for task 1610:o.a.s.s.BlockManager@66] - Persisting block rdd_7_276 to disk instead.

2020-04-22 17:40:43,018 - WARN  [Executor task launch worker for task 1610:o.a.s.s.m.MemoryStore@66] - Not enough space to cache rdd_7_276 in memory! (computed 1528.4 MB so far)

Symptom(s)

  • The pipelines will keep running for a long time and it seems like they never finish.

  • Pipeline metrics keep resetting, indicating that the jobs are reprocessing.

  • Logs indicate that Spark is not able to fit RDD in memory.

  • False message that RDD is being persisted to disk.

Solution(s)

Turning off Auto-Caching

By default, pipelines will cache intermediate data in the pipeline in order to prevent Spark from re-computing data. This requires a substantial amount of memory, so pipelines that process a large amount of data will often need to turn this off.

  1. Navigate to the pipeline detail page.

  2. In the Configure menu, click on Engine config.

  3. Enter 'spark.cdap.pipeline.autocache.enable' as the key, and 'false' as the value.

Â