Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

A batch sink for writing to Google Cloud Storage in Avro format.

 Use-case

This source is used whenever you need to write to Google Cloud Storage in Avro format. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file. The output of the run will be stored in a directory with the name user customized in a specified bucket in google cloud storage.

Properties

referenceName:This will be used to uniquely identify this sink for lineage, annotating metadata, etc.

projectID: Google Cloud Project ID which has the access to a specified bucket.

jsonKey:The json certificate file of the service account used for GCS access

path: the directory inside the bucket where the data is stored. Need to be a new directory.

bucketKey: The bucket inside google cloud storage to store the data.

 

fileSystemProperties: JSON string representing a map of properties needed for the distributed file system.The property names needed for GCS (projectID and jsonKeyFile) will be included as 'fs.gs.project.id' and 'google.cloud.auth.service.account.json.keyfile'.

schema:The Avro schema of the record being written to the sink as a JSON object.

Example

This example will write to an Google Cloud Storage output located at gs://bucket/directory. It will write data in Avro format using the given schema. Every time the pipeline runs, user should specified a new directory name.

 

Wiki Markup
 {
        "name": "GCSAvro",
        "plugin": {
          "name": "GCSAvro",
          "type": "batchsink",
          "label": "GCSAvro",
          "artifact": {
            "name": "core-plugins",
            "version": "1.4.0-SNAPSHOT",
            "scope": "SYSTEM"
          },
          "properties": {
            "schema": "{
            \"type\":\"record\",
            \"name\":\"etlSchemaBody\",
            \"fields\":[
            {\"name\":\"ts\",\"type\":\"long\"},
            {\"name\":\"body\",\"type\":\"string\"}]}",
            "Bucket_Key": "bucket",
            "path_to_store": "directory",
            "Project_Id": "projectid",
            "Json_Key_File": "path_to_jsonKeyFile",
            "referenceName": "name"
         }
     }
  }

...