Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Note

See the public version of this document. If you update this page, please also update the public page.

This article explains how to run pipelines against existing Dataproc clusters step-by-step. This feature is available only on the Enterprise edition of Cloud Data Fusion ("Execution environment selection").

...

  1. SSH Setup on Dataproc Cluster.

    1. Navigate to the Dataproc console on Google Cloud Platform. Go to “Cluster details” Cluster details by clicking on your Dataproc cluster name.

    2. Under “VM Instances” VM Instances, click on the “SSH“ SSH button to connect to the Master Dataproc VM.

    3. To create a new SSH key, use command:

      1. ssh-keygen -m PEM -t rsa -b 4096 -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]

      2. Remember to leave the passphrase empty i.e. when prompted for one, hit enter.

    4. This will create 2 key files

      1. ~/.ssh/[KEY_FILENAME] (Private Key)

      2. ~/.ssh/[KEY_FILENAME].pub (Public Key)

    5. To view these in an easy copiable format, use commands:

      1. cat [KEY_FILENAME].pub

      2. cat [KEY_FILENAME]

    6. Navigate to the GCE VM instance detail page. Click Metadata > SSH Keys. Edit and add the full public key from the copy in step [1.e.i]. Make sure to delete all Newlines that may be pasted over.

...

  1. Create a customized system compute profile for your Data Fusion instance

    1. Navigate to your Data Fusion instance console by clicking on “View Instance” View Instance.

    2. Click on “System Admin“ System Admin on the top right corner.

    3. Under “Configuration“ the Configuration tab, expand “System System Compute Profiles”Profiles. Click on “Create Create New Profile“Profile, and choose “Remote Remote Hadoop Provisioner“ Provisioner on the next page.

    4. Fill out the general information for the profile.

    5. Host: You can find the SSH host IP information of the Master Node in the “VM VM instance details“ details page under Compute Engine.
      If the instance is private, use the master's internal IP rather than the external

    6. User: This is the username you specified when creating the keys in step [1.c.i]

    7. SSH private key: Copy the SSH private key created in step [1.e.ii] , and paste it to the “SSH SSH Private Key“ Key field.

      1. Including the the beginning and ending comments in your copy:
        -----BEGIN RSA PRIVATE KEY-----
        -----END RSA PRIVATE KEY-----

      2. Make sure your key is an RSA Private key, not OPENSSH key (if OPENSSH, make sure you used the command in step [1.c.i] and included PEM)

    8. Click “Create” Create to create the profile.

  2. Configure your Data Fusion pipeline to use the customized profile.

    1. Click on the pipeline.

    2. Click on Configure - > Compute config and choose your newly created profile.

  3. Start the pipeline, which will be running against your existing Dataproc cluster!

...

  • If the pipeline fails on connection timeout, check if the SSH key and the firewall rules are configured correctly. Check step 1 for the SSH setting , and here for firewall rules.

  • If you get an ‘invalid privatekey’ error while running the pipeline, check if the first line of your private key is:
    '----BEGIN OPENSSH PRIVATE KEY-----'. If so, try generating a key pair with:

    • ssh-keygen -m PEM -t rsa -b 4096

  • If connecting to the VM via SSH from the command line and a private key works, but the same setup results in an “Auth failed” exception from JSch, then verify that OS login is not enabled. From the Compute Engine UI, click “Metadata” Metadata from the menu on the left, and then click on the “Metadata” Metadata tab. Delete the “osLogin” kay key or set it to “FALSE”.