Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Google Sheets plugins will allow users to import data from Google Sheets into their pipeline, so that they can transform and enrich with other data sources.

User Storie(s)

  • As a pipeline developer, I want to import data from Google Sheets, so that I can transform and enrich it using CDAP
  • As a pipeline developer, I want to move all sheets from a given Google drive directory to a destination
  • As a pipeline developer, I want to be able to pick certain sheets from a particular sheet to process using CDAP, so that I do not have to process all sheets all the time. I want to be able to specify the sheet using the sheet name or number.
  • As a pipeline developer, I want to treat the first row of a sheet as a header, so that CDAP can automatically treat it as schema
  • As a pipeline developer, I want to be able to specify a section at the top of my Sheet as a header, so that it is extracted as metadata, and not as actual data
  • As a pipeline developer, I want to be able to specify a section at the bottom of my sheet as a footer, so that it is extracted as metadata, and not as actual data

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

Source

This section defines properties that are configurable for this plugin. 

SectionUser Facing NameTypeDescriptionOptionalConstraints
Basic
App IdstringNoAccess tokenstring
No
Directory IdstringDirectory ID is the last part of the URL, such as https://drive.google.com/drive/folders/0B2kqcwp2ycGZanhSR3JmREw5VTVNo
FilteringFilterString
A filter

Filter that can be applied to the files in the selected directory. Filters follow

the 

the Google Drive

Filter Syntax

filters syntax.

Yes
Modification date range
StringIn addition to the filter specified above, also filter files to only pull those that were modified between the date range. Defaults to last year.YesSheets to pullmulti-selectSelect from a list of sheets to pull. Defaults to all.YesAdvancedColumn Names Selection
Radio buttonsChoose between "No column names", "Treat first row as column names", "Custom row as column names" Defaults to "Treat first row as column names".YesCustom row for column namesNumberOnly shown when the header selection is set to "Custom row as column names". Accepts the row number of the row to be treated as a column names. Defaults to 0.YesExtract metadataToggleDetermines if a certain section of the sheet should be treated as metadata. Useful when you want
Select

Filter that narrows set of files by modified date range. User can select either among predefined or custom entered ranges. For Custom selection the dates range can be specified via Start date and End date.

Yes
Start dateString

Start date for custom modification date range. Is shown only when 'Custom' range is selected for 'Modification date range' field. RFC3339 (https://tools.ietf.org/html/rfc3339) format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.



End dateString

End date for custom modification date range. Is shown only when 'Custom' range is selected for 'Modification date range' field. RFC3339 (https://tools.ietf.org/html/rfc3339) format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.



Sheets to pullSelect

Filter for specifying set of sheets to process. For numbers or titles selections user can populate specific values in Sheets identifiers field. Default is all value.


all
Sheets identifiersCSV

Set of sheets' numbers/titles to process. Is shown only when titles or numbers are selected for Sheets to pull field.



Authentication



Authentication typeRadio-group

Defines the authentication type. OAuth2 and Service account types are available.

No
Client IDStringOAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes
Client secretStringOAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes
Refresh tokenStringOAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes
Account file path
String

Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property.
Can be set to 'auto-detect' when running on a Dataproc cluster, then plugin uses value of environment variable "GOOGLE_APPLICATION_CREDENTIALS". 
When running on other clusters, the file must be present on every node in the cluster.
Service account json can be generated on Google Cloud Service Account page

Yesauto-detect
RetryingMax Retry CountNumberMaximum number of retry attempts.Yes10
Max Retry WaitNumberMaximum wait time for attempt in seconds.Yes500
Max Retry Jitter WaitNumberMaximum additional wait time is milliseconds.Yes100
Metadata extractionExtract metadataToggle

Field to enable metadata extraction. Metadata extraction is useful when user wants to specify a header or a footer for a sheet. The rows in headers and footers are not available as data records. Instead, they are available in every record as a field called

"

'metadata', "

,

which is a record of the specified metadata.

No
Metadata record nameStringName of the record with metadata content. It is needed to distinguish metadata record from possible column with the same name.Yesmetadata
First header rowNumber
The row
Row number of the first row to be treated as header
. Defaults to 0
.Yes
Last header rowNumber
The row
Row number of the last row to be treated as header.Yes
First footer rowNumber
The row
Row number of the first row to be treated as footer.Yes
Last footer rowNumber
The row
Row number of the last row to be treated as footer.Yes
Metadata
key
cellsKeyValue
Specifies

Set of the cells

and the corresponding key names for the keys

for key-value pairs to extract as metadata from the specified metadata sections. Only shown if

Capture Metadata

Extract metadata is set to true. The cell numbers should be within the header or footer.

E.g. A1 →

department

B5,

B5

A5

category

C4

Yes
Metadata value cellsKeyValue

Specifies the cells and the corresponding key names of the values to extract as metadata from the specified metadata sections. Only shown if Capture Metadata is set to true. The cell numbers should be within the header or footer.

E.g. department → A2, category → C5

YesCustom footer last rowNumberOnly shown when the footer selection is set to Custom footer. Accepts the row number of the last row to be treated as a header.Yes

AdvancedNumeric formattingRadio buttons

Output format for numeric sheet cells. In Formatted values case the value will contain appropriate format of source cell e.g. '1.23$', '123%'. For Values only only number value will be returned.



Skip empty dataToggle


Column Names Selection
Radio buttonsSource for column names. Choose between "No column names", "Treat first row as column names", "Custom row as column names" Defaults to "Treat first row as column names".Yes
Custom row for column namesNumber

Row number of the row to be treated as a header. Only shown when the Column Names Selection field is set to Custom row as column names header.

Yes1
Last data column
NumberLast column number of the maximal field of plugin work for data.Yes
Last data rowNumberLast row number of the maximal field of plugin work for data.

Note: The data in the specified header and footer rows should not be available as records to the rest of the pipeline. 

Sink

User Facing NameTypeDescriptionOptionalConstraints
App Idstring
No
Access tokenstring
No
Directory IdstringDirectory ID is the last part of the URL, such as https://drive.google.com/drive/folders/0B2kqcwp2ycGZanhSR3JmREw5VTVNo
Sheet namestringName of the sheet. Defaults to Sheet 1Yes
Write first row as headersToggleIf true, the schema is written as the first row of the sheet. Defaults to True.Yes
Format for nested dataselectChoose amongst JSON, CSV. Format to serialize complex (nested) data as. Defaults to JSON.Yes

Note: Incoming records should be written to columns in the sheet

Design / Implementation Tips

  • Tip #1
  • Tip #2

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2



Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature