Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Introduction

A batch sink for writing to Google Cloud Storage in Avro format.

Use-case

This source is used whenever you need to write to Google Cloud Storage in Avro format. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file. The output of the run will be stored in a directory with the name user customized in a specified bucket in google cloud storage.

 

Properties

As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.

 

Requirements

  1. User should provide the correct project id which he has access to. 
  2. User should provide the SQL query against a dataset inside his project.
  3. User should specify the limit time for the querying. 
  4. User should specify the limit time for the querying. 

Example

Following is a simple example showing how BigQuery Source would work.

 

A dataset already exist in Google BigQuery:121

project Id: vernal-seasdf-123456

dataset name: baby_names

namecount
Emma100
Oscar334
Peter223
Jay1123
Nicolas764

 

User pull the schema of the dataset:

InputsValue
project Id vernal-seasdf-123456
dataset namebaby_names

 

Output schema is as follows:

SchemaTypeNullableDescription
nameStringNonames of baby born in 2014
countIntegerNothe number of occurrences of the name

 

User run query agains dataset in BigQuery and pull the records:

Configuration is specified as follows

      ♦ project Id

         ♦ vernal-seasdf-123456

     ♦ query

        ♦  SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3

 

Out put is as follows

namecount
Jay1123
Nicolas764
Oscar334


Implementation Tips

  • What authorization roles are required by this plugin? 
    • An application default credential is required. Here is where to get such a credential.
  • I see a few additional config options on the query API. Are those configurable by the user?
    • Now what the user need to configure are project Id, credential path to the local private key, query string, time limit.
  • Create a simple batch source inside hydrator plugin with all dependencies needed.
  • Add an endpoint to run query against datasets in BigQuery.

 

Design

Inputstyperequireddefault
ProjectIdStringYes

 

CredentialsStringYes 
QueryStringYes 
Limit TimeInteger (min)No10
Limit SizeInteger (GB)No50
  • No labels