Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This article documents recommended configurations for running pipelines against a static Dataproc cluster. As an additional note, please refer to this article on how to Run pipelines against existing Dataproc clusters

General Tips

  • Set the following configurations while creating a static Dataproc cluster to run pipelines.

    • yarn.nodemanager.delete.debug-delay-sec - This is the configuration to retain yarn YARN logs. Recommended value 86400 (which is 1 day)

    • yarn.nodemanager.pmem-check-enabled - This configuration enables yarn YARN to check for physical memory limit and kill containers if they go beyond physical memory. Recommended value false

    • yarn.nodemanager.vmem-check-enabled - This configuration enables yarn YARN to check for virtual memory limit and kill containers if they go beyond physical memory. Recommended value false.

    • dataproc:dataproc.conscrypt.provider.enable - This configuration enables conscrypt. Since it causes errors with some plugins, the recommended value is false.

    • spark:spark.default.parallelism - This configuration specify the number of executors job available to use. If a cluster is fully allocated to the job and properly configured, it is usually equal to the total number of cores (number of workers * cores per worker). It's fine to overestimate it up to 2x-3x times, usually better than underestimate.

    • spark:spark.sql.adaptive.coalescePartitions.initialPartitionNum - Set it to 32x of previous number. It's applicable only to Spark 3, but you can set it for all clusters, Spark 2 should ignore it.

  • These configurations can be set by clicking on Add Cluster Property while creating the cluster from cloud console.

Table of Contents