Introduction
A batch sink for writing to Google Cloud Storage in Avro format.
Use-case
This source is used whenever you need to write to Google Cloud Storage in Avro format. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file. The output of the run will be stored in a directory with the name user customized in a specified bucket in google cloud storage.
Properties
As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.
Requirements
- User should provide the correct project id which he has access to.
- User should provide the SQL query against a dataset inside his project.
- User should specify the limit time for the querying.
- User should specify the limit time for the querying.
Example
Following is a simple example showing how BigQuery Source would work.
A dataset already exist in Google BigQuery:121
project Id: vernal-seasdf-123456
dataset name: baby_names
name | count |
---|---|
Emma | 100 |
Oscar | 334 |
Peter | 223 |
Jay | 1123 |
Nicolas | 764 |
User pull the schema of the dataset:
Inputs | Value |
---|---|
project Id | vernal-seasdf-123456 |
dataset name | baby_names |
Output schema is as follows:
Schema | Type | Nullable | Description |
---|---|---|---|
name | String | No | names of baby born in 2014 |
count | Integer | No | the number of occurrences of the name |
User run query agains dataset in BigQuery and pull the records:
Configuration is specified as follows
♦ project Id
♦ vernal-seasdf-123456
♦ query
♦ SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3
Out put is as follows
name | count |
---|---|
Jay | 1123 |
Nicolas | 764 |
Oscar | 334 |
Implementation Tips
- What authorization roles are required by this plugin?
- An application default credential is required. Here is where to get such a credential.
- I see a few additional config options on the query API. Are those configurable by the user?
- Now what the user need to configure are project Id, credential path to the local private key, query string, time limit.
- Create a simple batch source inside hydrator plugin with all dependencies needed.
- Add an endpoint to run query against datasets in BigQuery.
Design
Inputs | type | required | default |
---|---|---|---|
ProjectId | String | Yes |
|
Credentials | String | Yes | |
Query | String | Yes | |
Limit Time | Integer (min) | No | 10 |
Limit Size | Integer (GB) | No | 50 |