Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Introduction

A batch sink for writing to Google Cloud Storage in Avro format.

 Use-case

This source is used whenever you need to write to Google Cloud Storage in Avro format. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file. The output of the run will be stored in a directory with the name user customized in a specified bucket in google cloud storage.

Properties

referenceName:This will be used to uniquely identify this sink for lineage, annotating metadata, etc.

projectID: Google Cloud Project ID which has the access to a specified bucket.

jsonKey:The json certificate file of the service account used for GCS access

path: the directory inside the bucket where the data is stored. Need to be a new directory.

bucketKey: The bucket inside google cloud storage to store the data.

 

fileSystemProperties: JSON string representing a map of properties needed for the distributed file system.The property names needed for GCS (projectID and jsonKeyFile) will be included as 'fs.gs.project.id' and 'google.cloud.auth.service.account.json.keyfile'.

schema:The Avro schema of the record being written to the sink as a JSON object.

Example

This example will write to an Google Cloud Storage output located at gs://bucket/directory. It will write data in Avro format using the given schema. Every time the pipeline runs, user should specified a new directory name.

 

{
	 "name": "GCSAvro",
        "plugin": {
          "name": "GCSAvro",
          "type": "batchsink",
          "label": "GCSAvro",
          "artifact": {
            "name": "core-plugins",
            "version": "1.4.0-SNAPSHOT",
            "scope": "SYSTEM"
          },
          "properties": {
            "schema": "{
            \"type\":\"record\",
            \"name\":\"etlSchemaBody\",
            \"fields\":[
            {\"name\":\"ts\",\"type\":\"long\"},
            {\"name\":\"body\",\"type\":\"string\"}]}",
            "Bucket_Key": "bucket",
            "path_to_store": "directory",
            "Project_Id": "projectid",
            "Json_Key_File": "path_to_jsonKeyFile",
            "referenceName": "name"
     }
  }
}



Requirements

  1. User should provide the correct project id which he has access to. 
  2. User should provide the path to the JSON KEY FILE of the service account which has the create permission of the bucket.
  3. User should specify a bucket inside the google cloud storage. 
  4. User should specify the limit time for the querying. 

Example

Following is a simple example showing how BigQuery Source would work.

 

A dataset already exist in Google BigQuery:121

project Id: vernal-seasdf-123456

dataset name: baby_names

namecount
Emma100
Oscar334
Peter223
Jay1123
Nicolas764

 

User pull the schema of the dataset:

InputsValue
project Id vernal-seasdf-123456
dataset namebaby_names

 

Output schema is as follows:

SchemaTypeNullableDescription
nameStringNonames of baby born in 2014
countIntegerNothe number of occurrences of the name

 

User run query agains dataset in BigQuery and pull the records:

Configuration is specified as follows

      ♦ project Id

         ♦ vernal-seasdf-123456

     ♦ query

        ♦  SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3

 

Out put is as follows

namecount
Jay1123
Nicolas764
Oscar334


Implementation Tips

  • What authorization roles are required by this plugin? 
    • An application default credential is required. Here is where to get such a credential.
  • I see a few additional config options on the query API. Are those configurable by the user?
    • Now what the user need to configure are project Id, credential path to the local private key, query string, time limit.
  • Create a simple batch source inside hydrator plugin with all dependencies needed.
  • Add an endpoint to run query against datasets in BigQuery.

 

Design

Inputstyperequireddefault
ProjectIdStringYes

 

CredentialsStringYes 
QueryStringYes 
Limit TimeInteger (min)No10
Limit SizeInteger (GB)No50
  • No labels