Overview

Often times there is need for stitching together a series of actions that allow a pipeline to achieve a specific use case. One of the action that has recently been requested is a Cloud BigQuery Execute Action. This action is responsible for running a Cloud BigQuery SQL, waiting for it to finish, and continue further with processing in a pipeline on success.

Issue

[CDAP-14539] Add a Google BigQuery Execute Action Plugin.

Use case

ELT use case - The raw data is loaded into a data warehouse staging tables. Once, the data is loaded into staging tables, the transforms are applied using SQL. A series SQL need to execute into order to prepare data for analytics. The SQL queries that specify the transformation have to be stitched together as part of the pipeline and need to be scheduled either in parallel or serial. Execution of series of SQL queries will transform the data from staging tables into final fact tables in the warehouse.

As an example:

Periodically fetch Omniture Click Data from SFTP
Load the raw data into Cloud BigQuery staging table
Apply a series of SQL queries to transform the data
Generate or update 5-6 auxiliary tables and update main fact table.

Requirements

User can specify Legacy SQL or Non-Legacy SQL to execute - default is non-legacy sql
User can set the priority for the query - default is 1
User can specify the mode - interactive or batch mode for the query execution - default is batch
User can specify a non-temporary (permanent) table in which the results can be written - transient
User can specify the query cache to be used - default false
User can specify different retry options in case of failures - default retry 1
User can specify whether to continue further in case of failure - default fail is not successful
User can use the plugin in a non-GCP environment
User is able to set the maximum billing tier for processing - default 1
User can specify encryption with KMS Key - Unset
Plugin passes the query and job id to the next node in the DAG
User is able to specify 1-9 as macros.

Limitations

No records retrieved are stored

Design

Plugin type : Action
Configurations
- Query
  - SQL
    - Specifies the query to be executed. The query should be validated at deployment time through a dry-run of BigQuery job.
  - Legacy SQL
    - Boolean option to specify whether the SQL specified above is a legacy sql
- Execution Options
  - Model - Batch or Interactive
    - Specifies whether the SQL query should be batched and executed when resources are available or should run immediately.
  - Cache Enabled/Disabled
    - Boolean option that specifies if cache can be used. Enabling caching helps boost query execution.
  - KMS Key Name
    - Specifies the name of the KMS key name to used for encrypting
  - Max Billing Tier
    - Specifies the billing tier to be used for executing the plugin
  - Non-Temporary Dataset
    - Specifies the name of the permanent dataset in which the results of execution should be stored.
  - Non-Temporary Table
    - Specifies the name of the permanent table within a dataset in which the results of the execution should be stored.
- Project (Optional if running in GCP)
  - Project Id
    - Specifies the GCP project id. This is specified when CDAP is running outside of GCP environment. In GCP environment this is defaulted to 'auto-detect'
  - Service Account File
    - Specifies the path to service account file that is used to access Google BigQuery when run in a CDAP instance that is running outside of GCP. In GCP environment this is defaulted to 'auto-detect'

Example

A sample pipeline that reads data from a URL, parses, cleanse and writes to BQ staging tables before ELT'ing into separate tables.