As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.

Requirements

User should provide the correct project id which he has access to.
User should provide the SQL query against a dataset inside his project.
User should specify the limit time for the queryingprovide the Json Key File of the project.
User should provide the temporary google cloud storage bucket to store the intermediate query result.
User should specify the limit time for the querying. provide the fully qualified input table in BigQuery. If import query is not provided, this table will be downloaded to
temporary Google Cloud Storage directory and finally to the CDAP source. If import query is given, this table will be a intermediate
storage to cache the query result and then downloaded to the temporary directory in Google Cloud Storage before going to CDAP source.
The temporary Google Cloud Storage directory should be deleted manually to avoid unnecessary charge from Google.

Example

Following is a are two simple example examples showing how BigQuery Source would work.

...

A dataset already exist in Google BigQuery:121

project Id: vernal-seasdf-123456project1

dataset name: baby_names

table name: names_2014

name	count
Emma	100
Oscar	334
Peter	223
Jay	1123
Nicolas	764

...

example1:

...

Code Block

...

language

...

Output schema is as follows:

Schema	Type	Required	Description
name	String	Yes	names of baby born in 2014
count	Integer	Yes	the number of occurrences of the name

User run query agains dataset in BigQuery and pull the records:

Configuration is specified as follows

♦ project Id

♦ vernal-seasdf-123456

♦ query

♦ SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3

Out put is as follows

xml

{
        "name": "BigQuery",
          "properties": {
            "referenceName": "bigquery",
            "projectId": "vernal-project1",
            "tempBuketPath": "gs://bucketName.datasetName/tableName",
            "jsonFilePath": "/path/to/jsonkeyfile",
            "InputTableId": "vernal-project1:babynames.names_2014",
            "outputSchema": "name:string,count:int"
          }
        }
      }

This source will read the vernal-project1:babynames.names_2014 table, download the whole table to gs://bucketName.datasetName/tableName, and then get the data from there.

example2:

Code Block

language	xml

{
        "name": "BigQuery",
          "properties": {
            "referenceName": "bigquery",
            "projectId": "vernal-project1",
            "tempBuketPath": "gs://bucketName.datasetName/tableName",
            "jsonFilePath": "/path/to/jsonkeyfile",
            "importQuery":"SELECT name as babyName, count as nameCount FROM [vernal-project1:babynames.names_2014] ORDER BY count DESC LIMIT 3",
            "InputTableId": "vernal-project1:babynames.blankTable",
            "outputSchema": "babyName:string,babyCount:int"
          }
        }
      }

Before running this source, user should create a blank table with schema : {babyname:string, babyCount:int}. File in r example, make this blank table in vernal-project1:babynames.

and the output of the source is as follows:

babyName	babyCount
Jay	1123
Nicolas	764
Oscar	334

Implementation Tips

Design

Inputs	type	required	default
ProjectId	String	Yes
Credencial	String	Yes
Query	String	Yes
Limit Time	Integer (min)	No	10
Limit Size	Integer (GB)	No	50

...

Versions Compared

Old Version 20

New Version Current

Key

Example

example1:

User run query agains dataset in BigQuery and pull the records:

example2:

Implementation Tips

Page Comparison

Versions Compared

Old Version 20

New Version Current

Key

Example

example1:

User run query agains dataset in BigQuery and pull the records:

example2:

Implementation Tips