...
- User should provide the correct project id which he has access to.
- User should provide the SQL query against a dataset inside his project.
- User should provide the Json Key File of the project.
- User should provide the temporary google cloud storage bucket to store the intermediate query result.
- User should provide the fully qualified input table in BigQuery. If import query is not provided, this table will be downloaded to
temporary Google Cloud Storage directory and finally to the CDAP source. If import query is given, this table will be a intermediate
storage to cache the query result and then downloaded to the temporary directory in Google Cloud Storage before going to CDAP source. - The temporary Google Cloud Storage directory should be deleted manually to avoid unnecessary charge from Google.
Example
Following is a are two simple example examples showing how BigQuery Source would work.
...
A dataset already exist in Google BigQuery:121
project Id: vernal-seasdf-123456project1
dataset name: baby_names
table name: names_2014
name | count |
---|---|
Emma | 100 |
Oscar | 334 |
Peter | 223 |
Jay | 1123 |
Nicolas | 764 |
...
example1:
...
Code Block |
---|
...
|
...
Output schema is as follows:
Schema | Type | Nullable | Description |
---|---|---|---|
name | String | No | names of baby born in 2014 |
count | Integer | No | the number of occurrences of the name |
User run query agains dataset in BigQuery and pull the records:
Configuration is specified as follows
♦ project Id
♦ vernal-seasdf-123456
♦ query
♦ SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3
Out put is as follows
| |
{
"name": "BigQuery",
"properties": {
"referenceName": "bigquery",
"projectId": "vernal-project1",
"tempBuketPath": "gs://bucketName.datasetName/tableName",
"jsonFilePath": "/path/to/jsonkeyfile",
"InputTableId": "vernal-project1:babynames.names_2014",
"outputSchema": "name:string,count:int"
}
}
}
|
This source will read the vernal-project1:babynames.names_2014 table, download the whole table to gs://bucketName.datasetName/tableName, and then get the data from there.
example2:
Code Block | ||
---|---|---|
| ||
{
"name": "BigQuery",
"properties": {
"referenceName": "bigquery",
"projectId": "vernal-project1",
"tempBuketPath": "gs://bucketName.datasetName/tableName",
"jsonFilePath": "/path/to/jsonkeyfile",
"importQuery":"SELECT name as babyName, count as nameCount FROM [vernal-project1:babynames.names_2014] ORDER BY count DESC LIMIT 3",
"InputTableId": "vernal-project1:babynames.blankTable",
"outputSchema": "babyName:string,babyCount:int"
}
}
}
|
Before running this source, user should create a blank table with schema : {babyname:string, babyCount:int}. File in r example, make this blank table in vernal-project1:babynames.
and the output of the source is as follows:
babyName | babyCount |
---|---|
Jay | 1123 |
Nicolas | 764 |
Oscar | 334 |
Implementation Tips
- What authorization roles are required by this plugin?
- An application default credential is required. Here is where to get such a credential.
- I see a few additional config options on the query API. Are those configurable by the user?
- Now what the user need to configure are project Id, credential path to the local private key, query string, time limit.
- Create a simple batch source inside hydrator plugin with all dependencies needed.
- Add an endpoint to run query against datasets in BigQuery.
Design
Inputs | type | required | default |
---|---|---|---|
ProjectId | String | Yes |
|
Credentials | String | Yes | |
Query | String | Yes | |
Limit Time | Integer (min) | No | 10 |
Limit Size | Integer (GB) | No | 50 |
...