Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 22 Next »

Introduction

Google provides BigQuery for querying massive datasets by enabling super-fast SQL queries against append-only tables using the processing power of Googles's infrastructure. Users can move their data into BigQuery and let it to handle the hard work. 

Now CDAP provides the interface for users to handle their datasets in BigQuery. 


Use-case

Users want to integrate CDAP with their already stored dataset in Google BigQuery. 

 

User Stories

As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.

 

Requirements

  1. User should provide the correct project id which he has access to. 
  2. User should provide the SQL query against a dataset inside his project.
  3. User should specify the limit time for the querying. 
  4. User should specify the limit time for the querying. 

Example

Following is a simple example showing how BigQuery Source would work.

 

A dataset already exist in Google BigQuery:121

project Id: vernal-seasdf-123456

dataset name: baby_names

namecount
Emma100
Oscar334
Peter223
Jay1123
Nicolas764

 

User pull the schema of the dataset:

InputsValue
project Id vernal-seasdf-123456
dataset namebaby_names

 

Output schema is as follows:

SchemaTypeNullableDescription
nameStringYesnames of baby born in 2014
countIntegerYesthe number of occurrences of the name

 

User run query agains dataset in BigQuery and pull the records:

Configuration is specified as follows

      ♦ project Id

         ♦ vernal-seasdf-123456

     ♦ query

        ♦  SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3

 

Out put is as follows

namecount
Jay1123
Nicolas764
Oscar334


Implementation Tips

  • What authorization roles are required by this plugin? 
  • I see a few additional config options on the query API. Are those configurable by the user?

 

 

Design

 

 

Inputstyperequireddefault
ProjectIdStringYes

 

CredentialsStringYes 
QueryStringYes 
Limit TimeInteger (min)No10
Limit SizeInteger (GB)No50

 



 

 

 

  • No labels