Introduction
Google provides BigQuery for querying massive datasets by enabling super-fast SQL queries against append-only tables using the processing power of Googles's infrastructure. Users can move their data into BigQuery and let it to handle the hard work.
Now CDAP provides the interface for users to handle their datasets in BigQuery.
Use-case
Users want to integrate CDAP with their already stored dataset in Google BigQuery.
User Stories
As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.
Requirements
- User should provide the correct project id which he has access to.
- User should provide the SQL query against a dataset inside his project.
- User should specify the limit time for the querying.
- User should specify the limit time for the querying.
Example
Following is a simple example showing how BigQuery Source would work.
A dataset already exist in Google BigQuery:121
project Id: vernal-seasdf-123456
dataset name: baby_names
name | count |
---|---|
Emma | 100 |
Oscar | 334 |
Peter | 223 |
Jay | 1123 |
Nicolas | 764 |
User pull the schema of the dataset:
Inputs | Value |
---|---|
project Id | vernal-seasdf-123456 |
dataset name | baby_names |
Output schema is as follows:
Schema | Type | Nullable | Description |
---|---|---|---|
name | String | No | names of baby born in 2014 |
count | Integer | No | the number of occurrences of the name |
User run query agains dataset in BigQuery and pull the records:
Configuration is specified as follows
♦ project Id
♦ vernal-seasdf-123456
♦ query
♦ SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3
Out put is as follows
name | count |
---|---|
Jay | 1123 |
Nicolas | 764 |
Oscar | 334 |
Implementation Tips
- What authorization roles are required by this plugin?
- An application default credential is required. Here is where to get such a credential.
- I see a few additional config options on the query API. Are those configurable by the user?
- Now what the user need to configure are project Id, credential path to the local private key, query string, time limit.
- Create a simple batch source inside hydrator plugin with all dependencies needed.
- Add an endpoint to run query against datasets in BigQuery.
Design
Inputs | type | required | default |
---|---|---|---|
ProjectId | String | Yes |
|
Credentials | String | Yes | |
Query | String | Yes | |
Limit Time | Integer (min) | No | 10 |
Limit Size | Integer (GB) | No | 50 |