Overview

This article documents recommended configurations for running pipelines against a static Dataproc cluster. As an additional note, please refer to this article on how to Run pipelines against existing Dataproc clusters

General Tips

Set the following configurations while creating a static Dataproc cluster to run pipelines.
- yarn.nodemanager.delete.debug-delay-sec - This is the configuration to retain YARN logs. Recommended value 86400 (which is 1 day)
- yarn.nodemanager.pmem-check-enabled - This configuration enables YARN to check for physical memory limit and kill containers if they go beyond physical memory. Recommended value false
- yarn.nodemanager.vmem-check-enabled - This configuration enables YARN to check for virtual memory limit and kill containers if they go beyond physical memory. Recommended value false.
- dataproc:dataproc.conscrypt.provider.enable - This configuration enables conscrypt. Since it causes errors with some plugins, the recommended value is false.
- yarn.nodemanager.vmem-check-enabled - This configuration enables YARN to check for virtual memory limit and kill containers if they go beyond physical memory. Recommended value false.
- spark:spark.default.parallelism - This configuration specify the number of executors job available to use. If a cluster is fully allocated to the job and properly configured, it is usually equal to the total number of cores (number of workers * cores per worker). It's fine to overestimate it up to 2x-3x times, usually better than underestimate.
- spark:spark.sql.adaptive.coalescePartitions.initialPartitionNum - Set it to 32x of previous number. It's applicable only to Spark 3, but you can set it for all clusters, Spark 2 should ignore it.
These configurations can be set by clicking on Add Cluster Property while creating the cluster from cloud console.

Configurations for a static Dataproc cluster

Overview

General Tips