Failed to provision Dataproc cluster due to missing bucket
Problem
Pipeline runs fail after a few seconds with a log message complaining about a GCS bucket not found. All pipeline runs using the same profile fail with a message about the same bucket.
For example:
2020-05-14 15:10:41,852 - ERROR [provisioning-service-2:i.c.c.i.p.t.ProvisioningTask@151] - PROVISION task failed in REQUESTING_CREATE state for program run program_run:default.t.-SNAPSHOT.workflow.DataPipelineWorkflow.bc504d34-962f-11ea-b486-000000f78a44.
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Google Cloud Storage bucket does not exist '[bucket redacted]'.
Solution(s)
Manually create the missing bucket, or configure the profile to use a different staging bucket.
Dataproc will try to re-use the same staging bucket for all cluster creation requests if no staging bucket is given (the default behavior for CDF dataproc profiles). See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/staging-bucket for more information.
Â