Introduction
Google provides BigQuery for querying massive datasets by enabling super-fast SQL queries against append-only tables using the processing power of Googles's infrastructure. Users can move their data into BigQuery and let it to handle the hard work.
Now CDAP provides the interface for users to handle their datasets in BigQuery.
Use-case
Users want to integrate CDAP with their already stored dataset in Google BigQuery.
User Stories
1. As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.
2. As a user, i would like to store data from a Hydrator pipeline into a table (dataset) in BigQuery. If the table doesn't exist, it should be created.
Requirements
1. User should specify the limit time for the querying.
2. User is able to specify the limit size of the dataset to query.
3. The schema is automatically pulled from the table.
4. User can pull the field names from the query.
Example
Following is a simple example showing how BigQuery Source would work.
A dataset already exist in Google BigQuery:121
project Id: vernal-seasdf-123456
dataset name: baby_names
name | count |
---|---|
Emma | 100 |
Oscar | 334 |
Peter | 223 |
Jay | 1123 |
Nicolas | 764 |
User pull the schema of the dataset:
Inputs | Value |
---|---|
project Id | vernal-seasdf-123456 |
dataset name | baby_names |
output schema:
Schema | Type | Required | Description |
---|---|---|---|
name | String | Yes | names of baby born in 2014 |
count | Integer | Yes | the number of occurrences of the name |
User run query agains dataset in BigQuery and pull the records:
Inputs | Value |
---|---|
project Id | vernal-seasdf-123456 |
query | SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3 |
output:
name | count |
---|---|
Jay | 1123 |
Nicolas | 764 |
Oscar | 334 |
Design
CDAP provides two type of operations on the dataset stored in BigQuery: Query and Poll Results.
Users can use Query operation to do SQL query on specified dataset in BigQuery.
For Poll Results, user can fetch the result using specified job ID or fetch the a specified number of latest query results.
Query:
Inputs | type | required | default |
---|---|---|---|
ProjectId | String | Yes |
|
Credencial | String | Yes | |
Query | String | Yes | |
Limit Time | Integer (min) | No | 10 |
Limit Size | Integer (GB) | No | 50 |
Poll Results:
Using jobId:
Inputs | type | Required |
---|---|---|
PorjectId | String | Yes |
JobId | String | Yes |
Polling Latest Results:
Inputs | Type | Required |
---|---|---|
ProjectId | String | Yes |
Poll Numer | Integer | Yes |