Introduction

Google provides BigQuery for querying massive datasets by enabling super-fast SQL queries against append-only tables using the processing power of Googles's infrastructure. Users can move their data into BigQuery and let it to handle the hard work.

Now CDAP provides the interface for users to handle their datasets in BigQuery.

Use-case

Users want to integrate CDAP with their already stored dataset in Google BigQuery.

User Stories

1. As a User, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records into a hydrator pipeline.

2. User should specify the limit time for the querying.

3. User is able to specify the limit size of the dataset to query.

4. User is able to poll for the result.

5. User can list the query result history for a duration of time.

6. The schema is automatically pulled from the table.

7. User can pull the field names from the query.

Example

Following is a simple example showing how BigQuery Source would work.

A dataset already exist in Google BigQuery:

Project Id: vernel-ssasie-123456

name	count
Emma	100
Oscar	334
Peter	223
Jay	1123
Nicolas	764

Design

CDAP provides two type of operations on the dataset stored in BigQuery: Query and Poll Results.

Users can use Query operation to do SQL query on specified dataset in BigQuery.

For Poll Results, user can fetch the result using specified job ID or fetch the a specified number of latest query results.

Query:

Inputs	type	required	default
ProjectId	String	Yes
Credencial	String	Yes
Query	String	Yes
Limit Time	Integer (min)	No	10
Limit Size	Integer (GB)	No	50

Poll Results:

Using jobId:

Inputs	type	Required
PorjectId	String	Yes
JobId	String	Yes

Polling Latest Results:

Inputs	Type	Required
ProjectId	String	Yes
Poll Numer	Integer	Yes

BigQuery Plugin