Introduction
A batch sink for writing to Google Cloud Storage in Avro format.
Use-case
This source is used whenever you need to write to Google Cloud Storage in Avro format. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file. The output of the run will be stored in a directory with the name user customized in a specified bucket in google cloud storage.
Properties
referenceName:This will be used to uniquely identify this sink for lineage, annotating metadata, etc.
projectID: Google Cloud Project ID which has the access to a specified bucket.
jsonKey:The json certificate file of the service account used for GCS access
path: the directory inside the bucket where the data is stored. Need to be a new directory.
bucketKey: The bucket inside google cloud storage to store the data.
fileSystemProperties: JSON string representing a map of properties needed for the distributed file system.The property names needed for GCS (projectID and jsonKeyFile) will be included as 'fs.gs.project.id'
and 'google.cloud.auth.service.account.json.keyfile'
.
schema:The Avro schema of the record being written to the sink as a JSON object.
Example
This example will write to an Google Cloud Storage output located at gs://bucket/directory. It will write data in Avro format using the given schema. Every time the pipeline runs, user should specified a new directory name.
Requirements
- User should provide the correct project id which he has access to.
- User should provide the path to the JSON KEY FILE of the service account which has the create permission of the bucket.
- User should specify a bucket inside the google cloud storage.
- User should specify the limit time for the querying.
Example
Following is a simple example showing how BigQuery Source would work.
A dataset already exist in Google BigQuery:121
project Id: vernal-seasdf-123456
dataset name: baby_names
name | count |
---|---|
Emma | 100 |
Oscar | 334 |
Peter | 223 |
Jay | 1123 |
Nicolas | 764 |
User pull the schema of the dataset:
Inputs | Value |
---|---|
project Id | vernal-seasdf-123456 |
dataset name | baby_names |
Output schema is as follows:
Schema | Type | Nullable | Description |
---|---|---|---|
name | String | No | names of baby born in 2014 |
count | Integer | No | the number of occurrences of the name |
User run query agains dataset in BigQuery and pull the records:
Configuration is specified as follows
♦ project Id
♦ vernal-seasdf-123456
♦ query
♦ SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3
Out put is as follows
name | count |
---|---|
Jay | 1123 |
Nicolas | 764 |
Oscar | 334 |
Implementation Tips
- What authorization roles are required by this plugin?
- An application default credential is required. Here is where to get such a credential.
- I see a few additional config options on the query API. Are those configurable by the user?
- Now what the user need to configure are project Id, credential path to the local private key, query string, time limit.
- Create a simple batch source inside hydrator plugin with all dependencies needed.
- Add an endpoint to run query against datasets in BigQuery.
Design
Inputs | type | required | default |
---|---|---|---|
ProjectId | String | Yes |
|
Credentials | String | Yes | |
Query | String | Yes | |
Limit Time | Integer (min) | No | 10 |
Limit Size | Integer (GB) | No | 50 |