Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This document provides instructions for configuring Wrangler and Pipelines in Cloud Data Fusion (CDF) to read from Amazon S3. Please ensure that you’ve met the prerequisites in order to follow the instructions presented in this document.

Prerequisites

There are a number of prerequisites that you need to ensure are in place before you can proceed with the configuration instructions. The following steps need to be performed on your Google Cloud Console.

...

Before you begin

Before you begin the other sections, follow these steps.

  1. In the Google Cloud Console, create a Data Fusion instance. Follow the instructions here and ensure that they include the following IAM roles in IAM for the service account:

    • Cloud Data Fusion Admin

    • Cloud Data Fusion API Service Agent

    • Storage Object Viewer

  2. For your existing Amazon S3 bucket, make sure you know what region it is 's in. Determine whether that region accepts requests in both versions or only version 4of 4 of the Signature protocol. You can find this information on Amazon’s website at https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

  3. If the region accepts version 4 only, note down one of the valid endpoint names.

Instructions

...

Configuring Cloud Data Fusion to read from Amazon S3

  1. In CDFthe Cloud Data Fusion UI, go to Wrangler.

...

32. If this is the first time you are configuring S3 as a source, in the Wrangler page, click on the Add Connection button from the Wrangler screen and choose S3

...

. Choose S3.

...

.

43. Enter the information for this connection. Note that the credentials in this screenshot are made up and will not work. You need to use your own AWS access key.

...

7. You will now see the connection in your the left hand panel on the screen.

...

...

Transforming your data

  1. Choose a bucket and a file within that bucket

...

  1. . Note that even though you specified a region when creating the connection, Amazon S3

...

  1. lists buckets from all regions that you have access to.

...

  1. Choose a bucket that’s in the same region as the connection.

...

9. Once you are done wrangling2. Transform your data. When you’re done, click Create a Pipeline and select Batch:.

...

This creates a pipeline and takes you the Pipeline Studio.

103. If the S3 bucket is in a region that accepts version 4 of the Signature protocol only, then:

...

b) In the dialog window that pops upopens, change the schema of S3 Path to s3a:.

...

c) Scroll down and add the following under File System Properties:

...

For example:

...

d) Close the pop-up window.

114. Finish, deploy and run your pipeline.

Note

Note: Pipeline preview will not currently work with Amazon S3.

Page Properties
hiddentrue

Related issues

Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@957
showSpacefalse
sortmodified
typepage
reversetrue
labelskb-how-to-article
cqllabel = "kb-how-to-article" and type = "page" and space = "KB"