Reading from Amazon S3

This document provides instructions for configuring Wrangler and Pipelines in Cloud Data Fusion (CDF) to read from Amazon S3.

Before you begin

Before you begin the other sections, follow these steps.

  1. In the Google Cloud Console, create a Data Fusion instance. Follow the instructions here and ensure they include the following IAM roles for the service account:

    • Cloud Data Fusion Admin

    • Cloud Data Fusion API Service Agent

    • Storage Object Viewer

  2. For your existing Amazon S3 bucket, make sure you know what region it's in. Determine whether that region accepts requests in both versions or only version 4 of the Signature protocol. You can find this information on Amazon’s website at https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

  3. If the region accepts version 4 only, note down one of the valid endpoint names.

Configuring Cloud Data Fusion to read from Amazon S3

  1. In the Cloud Data Fusion UI, go to Wrangler.

2. If this is the first time you are configuring S3 as a source, in the Wrangler page, click Add Connection. Choose S3.

.

3. Enter the information for this connection. Note that the credentials in this screenshot are made up and will not work. You need to use your own AWS access key.

4. Click Test Connection to verify that the connection can successfully be established with the database.

5. Click Add Connection.

6. You will now see the connection in the left panel on the screen.

Transforming your data

  1. Choose a bucket and a file within that bucket. Note that even though you specified a region when creating the connection, Amazon S3 lists buckets from all regions that you have access to. Choose a bucket that’s in the same region as the connection.

 

2. Transform your data. When you’re done, click Create a Pipeline and select Batch.

This creates a pipeline and takes you to the Pipeline Studio.

3. If the S3 bucket is in a region that accepts version 4 of the Signature protocol only, then:

a) Open the Properties of the created S3 source.

b) In the dialog window that opens, change the schema of S3 Path to s3a.

c) Scroll down and add the following under File System Properties:

{ "fs.s3a.endpoint": "<valid endpoint for the region" }

For example:

d) Close the pop-up window.

4. Finish, deploy, and run your pipeline.

Note: Pipeline preview doesn’t currently work with Amazon S3.

Related articles