This document provides instructions for configuring Wrangler and Pipelines in Cloud Data Fusion (CDF) to read from Amazon S3. Please ensure that you’ve met the prerequisites in order to follow the instructions presented in this document.
Prerequisites
There are a number of prerequisites that you need to ensure are in place before you can proceed with the configuration instructions. The following steps need to be performed on your Google Cloud Console.
Create Data Fusion instance. Follow the instructions here and ensure that include the following roles in IAM for the service account:
Cloud Data Fusion Admin
Cloud Data Fusion API Service Agent
Storage Object Viewer
For your existing Amazon S3 bucket, make sure you know what region it is in.
Determine whether that region accepts requests in both versions or only version 4of the Signature protocol. You can find this information on Amazon’s website at https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.
If the region accepts version 4 only, note down one of the valid endpoint names.
Instructions
Configure Cloud Data Fusion
In CDF, go to Wrangler
3. If this is the first time you are configuring S3 as a source, click on the Add Connection button from the Wrangler screen and choose S3
.
4. Enter the information for this connection. Note that the credentials in this screenshot are made up and will not work. You need to use your own AWS access key.
5. Click Test Connection to verify that the connection can successfully be established with the database.
6. Click Add Connection.
7. You will now see the connection in your left hand screen.
8. Pick a bucket and a file within that bucket to wrangle. Note that even though you specified a region when creating the connection, Amazon S3 will list buckets from all regions that you have access to. It is advisable to choose a bucket in the same region as the connection.
9. Once you are done wrangling, click Create a Pipeline and select Batch:
This creates a pipeline and takes you the Studio.
10. If the S3 bucket is in a region that accepts version 4 of the Signature protocol only, then:
a) Open the Properties of the created S3 source.
b) In the dialog window that pops up, change the schema of S3 Path to s3a:
c) Scroll down and add the following under File System Properties:
{ "fs.s3a.endpoint": "<valid endpoint for the region" }
For example:
d) Close the pop-up.
11. Finish, deploy and run your pipeline.
Note: Pipeline preview will not currently work with Amazon S3.