Problem
Pipelines fail with the following error in the log:
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0
This error means that the Dataproc cluster provisioned by your pipeline would cause it to exceed the GCE quota for compute disks. Since the Dataproc cluster cannot be provisioned, the pipeline fails.
Solution(s)
There are two ways to resolve this issue: raise your project quota, or configure Dataproc disk sizes.
Raise your project quota
This quota that must be raised for this error is Persistent disk standard (GB)
. There are both project wide and regional quotas. You can see that documentation here for more information as well as steps on how to raise it: https://cloud.google.com/compute/quotas
Configure Dataproc disk sizes
The size of the Dataproc cluster can be configured through the use of cluster properties in order to keep it under quota. The defaults can be overridden by adding runtime arguments to the pipeline as described in Setting custom Dataproc cluster properties.
In this case, the relevant properties are:
system.profile.properties.masterDiskGB
system.profile.properties.workerDiskGB
Set these to a low enough value (in Gb) so that the resultant dataproc cluster remains under quota.