Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. SSH Setup on Dataproc Cluster.

    1. Navigate to Dataproc console on Google Cloud Platform. Go to “Cluster details” by clicking on your Dataproc cluster name.

    2. Under “VM Instances”, click on the “SSH“ button to connect to the Master Dataproc VM.

    3. To create a new SSH key, follow the steps here, format the public key file to enforce an expiration time, and add the newly created SSH public key at the project or instance level..

      Use command

      use command:

      Check GCE VM instance detail page to see if your public SSH key is added to the ‘SSH Keys’ session. If not, please edit the page and add your username and public key.
      1. ssh-keygen -m PEM -t rsa -b 4096 instead of the one in the doc link to generate a SSH key that is compatible for CDF to use

    4. If the SSH is set up successfully, you should be able to see the SSH key you just added in the Metadata section of your Compute Engine console, as well as the authorized_keys file in your Dataproc VM.

      1. -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]

    5. This will create 2 key files

      1. ~/.ssh/[KEY_FILENAME] (Private Key)

      2. ~/.ssh/[KEY_FILENAME].pub (Public Key)

    6. To view these in an easy copiable format, use commands:

      1. cat [KEY_FILENAME].pub

      2. cat [KEY_FILENAME]

    7. Navigate to the GCE VM instance detail page. Metadata > SSH Keys. Edit and add the full public key from the copy in step [1.e.i]. Make sure to delete all Newlines that may be pasted over.

...

  1. Create a customized system compute profile for your Data Fusion instance

    1. Navigate to your Data Fusion instance console by clicking on “View Instance”.

    2. Click on “System Admin“ on the top right corner.

    3. Under “Configuration“ tab, expand “System Compute Profiles”. Click on “Create New Profile“, and choose “Remote Hadoop Provisioner“ on the next page.

    4. Fill out the general information for the profile.

    5. Host: You can find the SSH host IP information on of the Master Node in the “VM instance details“ page under Compute Engine.
      If the instance is private, use the master's internal IP rather than the external

    6. User: This is the username you specified when creating the keys in step [1.c.i]

    7. SSH private key: Copy the SSH private key created in step [1.e.ii] , and paste it to the “SSH Private Key“ field.

      1. Including the the beginning and ending comments in your copy:
        -----BEGIN RSA PRIVATE KEY-----
        -----END RSA PRIVATE KEY-----

      2. Make sure your key is an RSA Private key, not OPENSSH key (if OPENSSH, make sure you used the command in step [1.c.i] and included PEM)

    8. Click “Create” to create the profile.

  2. Configure your Data Fusion pipeline to use the customized profile

    1. Click on the pipeline.

    2. Click on Configure -> Compute config and choose your newly created profile.

  3. Start the pipeline, which will be running against your existing Dataproc cluster!

...