Migrating data from a PostgreSQL database to Cloud BigQuery

This document provides a step-by-step guide to build a Data Fusion data pipeline that reads data from Postgres, transforms the data, and writes to Cloud BigQuery.

Prerequisites

Before creating a Cloud Data Fusion pipeline that reads data from PostgreSQL and writes to BigQuery, make sure PostgreSQL is set up and accessible from the Cloud Data Fusion instance.

Instructions

Add a PostgreSQL password as a secure key to encrypt, and store it on a Data Fusion instance.

On any of the pipeline pages on Cloud Data Fusion, click the System Admin tab in the top right menu.

2. Click the Configuration tab.

3. Click Make HTTP Calls.

4. In the dropdown menu, choose PUT.

5. In the body of your HTTP call, enter

{“data”:”<your_password>”}

Replace “<your_password>” with your PostgreSQL password.

Connect to PostgreSQL using Wrangler

1. Navigate to the Wrangler page.

2. Click on Add Connection. Select Database as the source to connect.

3. Choose Google Cloud SQL for PostgreSQL or one of the PostgreSQL databases to connect to, and click on the Upload link.

4. In the Add third-party driver window that opens, upload the JAR file that contains your PostgreSQL driver. Your JAR file must follow the format <name>-<version>.jar. If your JAR file doesn't follow this format, rename it before you upload.

5. Click Next.

6. In the Driver configuration tab, enter your driver’s name and class name or leave the default values as they are.

7. Click Finish.

8. In the Add connection: Database window, click on the database type you chose in the previous steps. It should now appear with your JAR name underneath it instead of the previous Upload link.

9. Add PostgreSQL connection details and select the “pg_password” secure key that was created earlier. This step will ensure that the password is retrieved using Cloud KMS.

10. Click Test Connection to verify that the connection can successfully be established with the database.

11. Click Add Connection to complete the task.

Once you’ve completed all the steps, you will be able to click on the newly-connected database in the left navigation panel and see the list of tables for that database.

Transform data using Wrangler and build your Data Fusion pipeline

This section uses an example to demonstrate how to transform data. We search for a “persons” table and remove the “first_name” column from the table.

1. Search for the table you want to wrangle. For example, “persons”.

2. Transform the data in the table. For example, remove the “first_name” column from the table.

3. After wrangling your data, click on Create a Pipeline. This creates a batch pipeline.

4. Click on Sink in Pipeline Studio and choose BigQuery sink.

5. Configure the BigQuery sink as below.

6. Click on Deploy once the pipeline is built.

7. Click on Run to run the pipeline.

Once the above pipeline succeeds, preview the written data in BigQuery.