Managing Spark Pipeline Memory Settings

This page describes the different configuration settings that can be used to manage the amount of memory a pipeline uses. Memory management is especially important in Spark pipelines that contain aggregations or joins.

Before you begin

Deploy a pipeline that uses the Spark engine.

Setting Executor Memory

Spark pipelines consists of a driver and multiple executors. Executors are the ones that do most of the work, and are usually the ones that will require more memory.

Navigate to the pipeline detail page.
In the Configure menu, click on Resources.
Enter the desired amount under Executor.
In the same Configure menu, click on Compute config.
Click customize on the desired compute profile.
Ensure that the worker memory is a multiple of the executor memory. For example, if executor memory is 4096, worker memory should use 4, 8, 12, etc GB of memory. Also scale the worker cores accordingly. Note that it is not strictly necessary for worker memory to be an exact multiple, but if it is not, it is more likely for cluster capacity to be wasted.

Turning off Auto-Caching

By default, pipelines will cache intermediate data in the pipeline in order to prevent Spark from re-computing data. This requires a substantial amount of memory, so pipelines that process a large amount of data will often need to turn this off.

Navigate to the pipeline detail page.
In the Configure menu, click on Engine config.
Enter 'spark.cdap.pipeline.autocache.enable' as the key, and 'false' as the value.