Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Now CDAP provides the interface for users to handle their datasets in BigQuery. 

 


Use-case

Users want to integrate CDAP with their already stored dataset in Google BigQuery. 

...

As a user, I would like to run arbitrary queries synchronously against my datasets in BigQuery and pull those records in BigQuery and pull those records in a hydrator pipeline.

...

  1. User should provide the correct project id which he has access to. 
  2.   

1. User should specify the limit time for the querying. 

  2. User is able to specify the limit size of the dataset to query. 

  3. The schema is automatically pulled from the table. 

  4. User can pull the field names from the query.

 

Example

  1. User should provide the SQL query against a dataset inside his project.
  2. User should provide the Json Key File of the project.
  3. User should provide the temporary google cloud storage bucket to store the intermediate query result. 
  4. User should provide the fully qualified input table in BigQuery.  If import query is not provided, this table will be downloaded to 
    temporary Google Cloud Storage directory and finally to the CDAP source. If import query is given, this table will be a intermediate 
    storage to cache the query result and then downloaded to the temporary directory in Google Cloud Storage before going to CDAP source.
  5. The temporary Google Cloud Storage directory should be deleted manually to avoid unnecessary charge from Google.

Example

Following are two simple examples showing how BigQuery Source would work.

...

A dataset already exist in Google BigQuery:121

project Id: vernal-seasdf-123456project1

dataset name: baby_names

table name: names_2014

namecount
Emma100
Oscar334
Peter223
Jay1123
Nicolas764

 

...

example1:

...

Code Block

...

 

Output schema is as follows:

SchemaTypeRequiredDescription
nameStringYesnames of baby born in 2014
countIntegerYesthe number of occurrences of the name

 

User run query agains dataset in BigQuery and pull the records:

Configuration is specified as follows

      ♦ project Id

         ♦ vernal-seasdf-123456

     ♦ query

        ♦  SELECT name, count FROM baby_names ORDER BY count DESC LIMIT 3

 

Out put is as follows

...

language

...

xml
{
        "name": "BigQuery",
          "properties": {
            "referenceName": "bigquery",
            "projectId": "vernal-project1",
            "tempBuketPath": "gs://bucketName.datasetName/tableName",
            "jsonFilePath": "/path/to/jsonkeyfile",
            "InputTableId": "vernal-project1:babynames.names_2014",
            "outputSchema": "name:string,count:int"
          }
        }
      }

 

This source will read the vernal-project1:babynames.names_2014 table, download the whole table to gs://bucketName.datasetName/tableName, and then get the data from there.

 

example2:

Code Block
languagexml
{
        "name": "BigQuery",
          "properties": {
            "referenceName": "bigquery",
            "projectId": "vernal-project1",
            "tempBuketPath": "gs://bucketName.datasetName/tableName",
            "jsonFilePath": "/path/to/jsonkeyfile",
            "importQuery":"SELECT name as babyName, count as nameCount FROM [vernal-project1:babynames.names_2014] ORDER BY count DESC LIMIT 3",
            "InputTableId": "vernal-project1:babynames.blankTable",
            "outputSchema": "babyName:string,babyCount:int"
          }
        }
      }

 

Before running this source, user should create a blank table with schema : {babyname:string, babyCount:int}. File in r example, make this blank table in vernal-project1:babynames.

and the output of the source is as follows:

babyNamebabyCount
Jay1123
Nicolas764
Oscar334

 

Design

CDAP provides two type of operations on the dataset stored in BigQuery: Query and Poll Results. 

Users can use Query operation to do SQL query on specified dataset in BigQuery.

For Poll Results, user can fetch the result using specified job ID or fetch the a specified number of latest query results. 

Query:

 

Inputstyperequireddefault
ProjectIdStringYes

 

CredencialStringYes 
QueryStringYes 
Limit TimeInteger (min)No10
Limit SizeInteger (GB)No50

...