Run pipelines against existing Dataproc clusters

This article explains how to run pipelines against an existing Dataproc clusters step by step. This feature is available only on the Enterprise edition of Cloud Data Fusion ("Execution environment selection").

Prerequisites:

An existing Dataproc cluster, which can be setup following this guide.
A Cloud Data Fusion instance and a data pipeline as you desire. Learn how to create a new instance following this guide.

Instructions

SSH Setup on Dataproc Cluster
1. Navigate to Dataproc console on Google Cloud Platform. Go to “Cluster details” by clicking on your Dataproc cluster name.
2. Under “VM Instances”, click on the “SSH“ button to connect to the Dataproc VM.
3. Follow the steps here to create a new SSH key, format the public key file to enforce an expiration time, and add the newly created SSH public key at project or instance level.
  1. Use command ssh-keygen -m PEM -t rsa -b 4096 instead of the one in the doc link to generate a SSH key that is compatible for CDF to use
4. If the SSH is setup successfully, you should be able to see the SSH key you just added in the Metadata section of your Compute Engine console, as well as the authorized_keys file in your Dataproc VM.
Create a customized system compute profile for your Data Fusion instance
1. Navigate to your Data Fusion instance console by clicking on “View Instance”.
2. Click on “System Admin“ on the top right corner.
3. Under “Configuration“ tab, expand “System Compute Profiles”. Click on “Create New Profile“, and choose “Remote Hadoop Provisioner“ on the next page.
4. Fill out the general information for the profile.
5. You can find the SSH host IP information on the “VM instance details“ page under Compute Engine.
6. Copy the SSH private key created in step 1, and paste it to the “SSH Private Key“ field.
7. Click “Create” to create the profile.
Configure your Data Fusion pipeline to use the customized profile
1. Click on the pipeline.
2. Click on Configure -> Compute config and choose your newly created profile.
Start the pipeline, which will be running against your existing Dataproc cluster!

Troubleshoot

If the pipeline fails on connection timeout, check if the SSH key and the firewall rules are configured correctly. Check step 1 for the SSH setting, and here for firewall rules.
If you get an ‘invalid privatekey’ error while running the pipeline, check if the first line of your private key is:
'----BEGIN OPENSSH PRIVATE KEY-----'. If so, try generating a key pair with:
- ssh-keygen -m PEM -t rsa -b 4096