Introduction
A batch sink for writing to Google Cloud Storage in Avro format.
Use-case
This source is used whenever you need to write to Google Cloud Storage in Avro format. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file. The output of the run will be stored in a directory with the name user customized in a specified bucket in google cloud storage.
Properties
referenceName:This will be used to uniquely identify this sink for lineage, annotating metadata, etc.
projectID: Google Cloud Project ID which has the access to a specified bucket.
Properties
As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.
jsonKey:The json certificate file of the service account used for GCS access
path: the directory inside the bucket where the data is stored. Need to be a new directory.
bucketKey: The bucket inside google cloud storage to store the data.
fileSystemProperties: JSON string representing a map of properties needed for the distributed file system.The property names needed for GCS (projectID and jsonKeyFile) will be included as 'fs.gs.project.id'
and 'google.cloud.auth.service.account.json.keyfile'
.
schema:The Avro schema of the record being written to the sink as a JSON object.
Example
This example will write to an Google Cloud Storage output located at gs://bucket/directory. It will write data in Avro format using the given schema. Every time the pipeline runs, user should specified a new directory name.
Wiki Markup |
---|
{
"name": "GCSAvro",
"plugin": {
"name": "GCSAvro",
"type": "batchsink",
"label": "GCSAvro",
"artifact": {
"name": "core-plugins",
"version": "1.4.0-SNAPSHOT",
"scope": "SYSTEM"
},
"properties": {
"schema": "{
\"type\":\"record\",
\"name\":\"etlSchemaBody\",
\"fields\":[
{\"name\":\"ts\",\"type\":\"long\"},
{\"name\":\"body\",\"type\":\"string\"}]}",
"Bucket_Key": "bucket",
"path_to_store": "directory",
"Project_Id": "projectid",
"Json_Key_File": "path_to_jsonKeyFile",
"referenceName": "name"
}
}
} |
Requirements
- User should provide the correct project id which he has access to.
- User should provide the SQL query against a dataset inside his projectthe path to the JSON KEY FILE of the service account which has the create permission of the bucket.
- User should specify the limit time for the queryinga bucket inside the google cloud storage.
- User should specify the limit time for the querying.
Example
Following is a simple example showing how BigQuery Source would work.
...