Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This article explains how to run pipelines against an existing Dataproc clusters step-by-step. This feature is available only on the Enterprise edition of Cloud Data Fusion ("Execution environment selection").

...

  • An existing Dataproc cluster, which can be setup set up following this guide.

  • A Cloud Data Fusion instance and a data pipeline as you desire. Learn how to create a new instance by following this guide.

Instructions

  1. SSH Setup on Dataproc Cluster.

    1. Navigate to Dataproc console on Google Cloud Platform. Go to “Cluster details” by clicking on your Dataproc cluster name.

    2. Under “VM Instances”, click on the “SSH“ button to connect to the Dataproc VM.

    3. Follow the steps here to To create a new SSH key, follow the steps here, format the public key file to enforce an expiration time, and add the newly created SSH public key at the project or instance level..

      1. Use command ssh-keygen -m PEM -t rsa -b 4096 instead of the one in the doc link to generate a SSH key that is compatible for CDF to use

    4. If the SSH is setup set up successfully, you should be able to see the SSH key you just added in the Metadata section of your Compute Engine console, as well as the authorized_keys file in your Dataproc VM.

    5. Check GCE VM instance detail page to see if your public SSH key is added to the ‘SSH Keys’ session. If not, please edit the page and add your username and public key.

...