Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.

User Storie(s)

  • As a pipeline developer, I want to move all files from a Google drive directory to a different destination
  • As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination
  • As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs
  • As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.
  • As a pipeline developer, I want to move all files from an FTP source into Google drive.

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

Source

Option levelUser Facing NameTypeDescriptionOptionalConstraintsDefault valueBasic

Introduction

Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.

User Storie(s)

  • As a pipeline developer, I want to move all files from a Google drive directory to a different destination
  • As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination
  • As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs
  • As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.
  • As a pipeline developer, I want to move all files from an FTP source into Google drive.

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

Source

Option levelUser Facing NameTypeDescriptionOptionalConstraintsDefault value
BasicDirectory identifierString

Identifier of the source folder.

no

File metadata propertiesMulti-selectProperties which should be get for each file in directory. Allowed names can be get from Google Drive API: FilesYes

FilteringFilterStringA filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter SyntaxYes

Modification date rangeSelectIn addition to the filter specified above, also filter files to only pull those that were modified between the date rangeYes
select
Start DatetextboxOnly shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.No

End datetextboxOnly shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.No

File types to pullMulti-selectTypes of files should be pulled from specified directory.Yes
binary
AuthenticationAuthentication typeRadio-group

Defines the authentication type. OAuth2 and Service account types are available.

No
OAuth2
Client IDStringOAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes

Client secretStringOAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes

Refresh tokenStringOAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes

Account file path
String

Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property.
Can be set to 'auto-detect' when running on a Dataproc cluster, then plugin uses value of environment variable "GOOGLE_APPLICATION_CREDENTIALS".
When running on other clusters, the file must be present on every node in the cluster.
Service account json can be generated on Google Cloud Service Account page

Yes
auto-detect
Advanced

Maximum partition size

Number

Maximum partition size specified in bytes. Default 0 value means unlimited.

Yes
0
Body output formatRadio-groupFormat of body of file. "Bytes" and "String" values are available.Yes
bytes
ExportingGoogle Documents export formatSelectMIME type for Google Documents. Allowed values from Downloading Google Documents.Yes
text/plain
Google Spreadsheets export formatSelect
MIME type for Google Spreadsheets.Yes
text/csv
Google Drawings export formatSelect
MIME type for Google Drawings.Yes
image/svg+xml
Google Presentations export formatSelect
MIME type for Google Presentations.Yes
text/plain

Sink

Option levelUser Facing NameTypeDescriptionOptionalConstraints
Basic


File name field

StringName of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names.
Yes

File body field

StringName of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field.
No

File mime fieldString

Name of the schema field (should be STRING type) which will be used as MIME type of file.
All MIME types are supported except Google Drive types.
In the case it is not set Google API will try to recognize file's MIME type automatically.

Yes

Directory identifierStringIdentifier of the
source folder.
noFilterStringA filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter SyntaxYesModification date rangeSelectIn addition to the filter specified above, also filter files to only pull those that were modified between the date rangeYesselectStart DatetextboxOnly shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.NoEnd datetextboxOnly shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.NoFile propertiesMulti-selectProperties which should be get for each file in directory. Allowed names can be get from Google Drive API: FilesYesFile types to pullMulti-selectTypes of files should be pulled from specified directory.YesbinaryAuthenticationClient IDStringOAuth2 client id.NoClient secretString

OAuth2 client secret.

NoRefresh tokenStringOAuth2 refresh token.NoAccess tokenStringOAuth2 access token.NoAdvanced

Maximum partition size

Number

Maximum partition size specified in bytes. Default 0 value means unlimited.

Yes0Body output formatRadio-groupFormat of body of file. "Bytes" and "String" values are available.YesbytesExportingGoogle Documents export formatSelectMIME type for Google Documents. Allowed values from Downloading Google Documents.Yestext/plainGoogle Spreadsheets export formatSelect
MIME type for Google Spreadsheets.Yestext/csvGoogle Drawings export formatSelect
MIME type for Google Drawings.Yesimage/svg+xmlGoogle Presentations export formatSelect
MIME type for Google Presentations.Yestext/plain

Sink

Option levelUser Facing NameTypeDescriptionOptionalConstraintsBasic

File name field

StringName of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names.
Yes

File body field

StringName of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field.
NoDirectory identifierStringIdentifier of the destination folder.NoAuthentication
Client IDStringOAuth2 client id.NoClient secretString

OAuth2 client secret.

NoRefresh tokenStringOAuth2 refresh token.NoAccess tokenStringOAuth2 access token.No

Design / Implementation Tips

  • Tip #1
  • Tip #2

Design

Approach(s)

Properties

Modification date range

Filters files by last modified date. Available values:

  1. None - files will not be filtered.
  2. Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".
  3. Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".
  4. Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".
  5. Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".
  6. Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".
  7. Current year  - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".
  8. Custom - user should enter start and end dates by himself.

This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.

File properties

User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.

Property nameTypeDescriptionidstringThe ID of the file.namestringThe name of the file. This is not necessarily unique within a folder.mimeTypestringThe MIME type of the file.

Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided.

descriptionstringA short description of the file.starredbooleanWhether the user has starred the file.trashedbooleanWhether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash.explicitlyTrashedbooleanWhether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder.trashedTimetimestamp millisecondsThe time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives.parentsarray of stringsThe IDs of the parent folders which contain the file.propertiesrecord of key-value stringsA collection of arbitrary key-value pairs which are visible to all apps.spacesstringThe list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'.createdTimetimestamp millisecondsThe time at which the file was created (RFC 3339 date-time).modifiedTimetimestamp millisecondsThe last time the file was modified by anyone (RFC 3339 date-time).driveIdID of the shared drive the file resides in. Only populated for items in shared drives.originalFilenamestringThe original filename of the uploaded content if available, or else the original value of the name field. This is only available for files with binary content in Google Drive.fullFileExtensionstringThe full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive.md5ChecksumstringThe MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive.sizelongThe size of the file's content in bytes. This is only applicable to files with binary content in Google Drive.imageMediaMetadata.widthintThe width of the image in pixels.imageMediaMetadata.heightintThe height of the image in pixels.imageMediaMetadata.rotationintThe rotation in clockwise degrees from the image's original orientation.imageMediaMetadata.location.latitudedoubleThe latitude stored in the image.imageMediaMetadata.location.longitudedoubleThe longitude stored in the image.imageMediaMetadata.location.altitudedoubleThe altitude stored in the image.imageMediaMetadata.timestringThe date and time the photo was taken (EXIF DateTime).imageMediaMetadata.cameraMakestringThe make of the camera used to create the photo.imageMediaMetadata.cameraModelstringThe model of the camera used to create the photo.imageMediaMetadata.exposureTimefloatThe length of the exposure, in seconds.imageMediaMetadata.aperturefloatThe aperture used to create the photo (f-number).imageMediaMetadata.flashUsedbooleanWhether a flash was used to create the photo.imageMediaMetadata.focalLengthfloatThe focal length used to create the photo, in millimeters.imageMediaMetadata.isoSpeedintThe ISO speed used to create the photo.imageMediaMetadata.meteringModestringThe metering mode used to create the photo.imageMediaMetadata.sensorstringThe type of sensor used to create the photo.imageMediaMetadata.exposureModestringThe exposure mode used to create the photo.imageMediaMetadata.colorSpacestringThe color space of the photo.imageMediaMetadata.whiteBalancestringThe white balance mode used to create the photo.imageMediaMetadata.exposureBiasfloatThe exposure bias of the photo (APEX value).imageMediaMetadata.maxApertureValuefloatThe smallest f-number of the lens at the focal length used to create the photo (APEX value).imageMediaMetadata.subjectDistanceintThe distance to the subject of the photo, in meters.imageMediaMetadata.lensstringThe lens used to create the photo.videoMediaMetadata.widthintThe width of the video in pixels.videoMediaMetadata.heightintThe height of the video in pixels.videoMediaMetadata.durationMillislongThe duration of the video in milliseconds.

File types to pull

All files in Google Drive can be divided by format between two types: Google formats and all other (binary). Google formats are hidden and can not be exported directly, instead of this they should be exported into any binary format firstly and only if this option is available for specified format. Binary files can be downloaded directly. User can specify which formats he wants download/export:

  • Binary - will be downloaded directly (text/plain, image/bmp, video/mp4 etc.)
  • Google Documents - will be exported to format specified in Google Documents export format property before.
  • Google Spreadsheets - will be exported to format specified in Google Spreadsheets export format property before.
  • Google Drawings - will be exported to format specified in Google Drawings export format property before.
  • Google Presentations - will be exported to format specified in Google Presentations export format property before.
    destination folder.No

    Authentication



    Authentication typeRadio-group

    Defines the authentication type. OAuth2 and Service account types are available.

    No
    OAuth2
    Client IDStringOAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes

    Client secretStringOAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes

    Refresh tokenStringOAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.Yes

    Account file path
    String

    Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property.
    Can be set to 'auto-detect' when running on a Dataproc cluster, then plugin uses value of environment variable "GOOGLE_APPLICATION_CREDENTIALS".
    When running on other clusters, the file must be present on every node in the cluster.
    Service account json can be generated on Google Cloud Service Account page

    Yes
    auto-detect

    Design / Implementation Tips

    • Tip #1
    • Tip #2

    Design

    Approach(s)

    Source

    Google Drive Source plugin reads files from specified Google Drive folder via Drive API.

    Output schema if fully defined by plugin settings provided by user. Two fields are mandatory: body and offset, everyone else depends on file's fields selected in File properties property.

    Body format is defined by Body output format property and may has BYTES and STRING formats. Offset field provides number of byte in the original file body starts from.

    Plugin is able to limit maximal body size per partition with Maximum partition size set. For default value 0 plugin processes entire files without partitioning and offset field has always 0 value.

    Example of output schema, when user selected id, name, mimeType, description, createdTime, modifiedTime, size, imageMediaMetadata.width, imageMediaMetadata.height and imageMediaMetadata.rotation fields from File properties property:

    Image Added

    Sink

    Google Drive Sink plugin writes received data as files to specified Google Drive folder via Drive API.

    Sink writes single separate file per received partition. The only required input field from schema is file's body. The name of the schema's field with file body can be specified with File name field property. BYTES format is only supported for body input field.

    There are also similar properties for name and mime type of file. Both requires STRING format. In case name field is not specified sink will generate random 16-symbol names. In case mime type field is not set Google Drive will try define it automatically.

    Sink plugin doesn't support partitioned files writing.

    Properties

    Modification date range

    Filters files by last modified date. Available values:

    1. None - files will not be filtered.
    2. Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".
    3. Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".
    4. Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".
    5. Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".
    6. Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".
    7. Current year  - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".
    8. Custom - user should enter start and end dates by himself.

    This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.


    File properties

    User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.

    Property nameTypeDescription
    idstringThe ID of the file.
    namestringThe name of the file. This is not necessarily unique within a folder.
    mimeTypestringThe MIME type of the file.

    Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided.

    descriptionstringA short description of the file.
    starredbooleanWhether the user has starred the file.
    trashedbooleanWhether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash.
    explicitlyTrashedbooleanWhether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder.
    trashedTimetimestamp millisecondsThe time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives.
    parentsarray of stringsThe IDs of the parent folders which contain the file.
    propertiesrecord of key-value stringsA collection of arbitrary key-value pairs which are visible to all apps.
    spacesstringThe list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'.
    createdTimetimestamp millisecondsThe time at which the file was created (RFC 3339 date-time).
    modifiedTimetimestamp millisecondsThe last time the file was modified by anyone (RFC 3339 date-time).
    driveId
    ID of the shared drive the file resides in. Only populated for items in shared drives.
    originalFilenamestringThe original filename of the uploaded content if available, or else the original value of the name field. This is only available for files with binary content in Google Drive.
    fullFileExtensionstringThe full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive.
    md5ChecksumstringThe MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive.
    sizelongThe size of the file's content in bytes. This is only applicable to files with binary content in Google Drive.
    imageMediaMetadata.widthintThe width of the image in pixels.
    imageMediaMetadata.heightintThe height of the image in pixels.
    imageMediaMetadata.rotationintThe rotation in clockwise degrees from the image's original orientation.
    imageMediaMetadata.location.latitudedoubleThe latitude stored in the image.
    imageMediaMetadata.location.longitudedoubleThe longitude stored in the image.
    imageMediaMetadata.location.altitudedoubleThe altitude stored in the image.
    imageMediaMetadata.timestringThe date and time the photo was taken (EXIF DateTime).
    imageMediaMetadata.cameraMakestringThe make of the camera used to create the photo.
    imageMediaMetadata.cameraModelstringThe model of the camera used to create the photo.
    imageMediaMetadata.exposureTimefloatThe length of the exposure, in seconds.
    imageMediaMetadata.aperturefloatThe aperture used to create the photo (f-number).
    imageMediaMetadata.flashUsedbooleanWhether a flash was used to create the photo.
    imageMediaMetadata.focalLengthfloatThe focal length used to create the photo, in millimeters.
    imageMediaMetadata.isoSpeedintThe ISO speed used to create the photo.
    imageMediaMetadata.meteringModestringThe metering mode used to create the photo.
    imageMediaMetadata.sensorstringThe type of sensor used to create the photo.
    imageMediaMetadata.exposureModestringThe exposure mode used to create the photo.
    imageMediaMetadata.colorSpacestringThe color space of the photo.
    imageMediaMetadata.whiteBalancestringThe white balance mode used to create the photo.
    imageMediaMetadata.exposureBiasfloatThe exposure bias of the photo (APEX value).
    imageMediaMetadata.maxApertureValuefloatThe smallest f-number of the lens at the focal length used to create the photo (APEX value).
    imageMediaMetadata.subjectDistanceintThe distance to the subject of the photo, in meters.
    imageMediaMetadata.lensstringThe lens used to create the photo.
    videoMediaMetadata.widthintThe width of the video in pixels.
    videoMediaMetadata.heightintThe height of the video in pixels.
    videoMediaMetadata.durationMillislongThe duration of the video in milliseconds.


    File types to pull

    All files in Google Drive can be divided by format between two types: Google formats and all other (binary). Google formats are hidden and can not be exported directly, instead of this they should be exported into any binary format firstly and only if this option is available for specified format. Binary files can be downloaded directly. User can specify which formats he wants download/export:

    • Binary - will be downloaded directly (text/plain, image/bmp, video/mp4 etc.)
    • Google Documents - will be exported to format specified in Google Documents export format property before.
    • Google Spreadsheets - will be exported to format specified in Google Spreadsheets export format property before.
    • Google Drawings - will be exported to format specified in Google Drawings export format property before.
    • Google Presentations - will be exported to format specified in Google Presentations export format property before.


    Exporting

    This section presents exporting map for Google Drive formats. All available exporting ways:

    Original format nameOriginal format mimeExport format nameExport format mime
    Google Documents






    application/vnd.google-apps.document






    HTMLtext/html
    HTML (zipped)application/zip
    Plain texttext/plain
    Rich textapplication/rtf
    Open Office docapplication/vnd.oasis.opendocument.text
    PDFapplication/pdf
    MS Word documentapplication/vnd.openxmlformats-officedocument.wordprocessingml.document
    EPUBapplication/epub+zip
    Google Spreadsheets




    application/vnd.google-apps.spreadsheet




    MS Excelapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    Open Office sheetapplication/x-vnd.oasis.opendocument.spreadsheet
    PDFapplication/pdf
    CSV (first sheet only)text/csv
    TSV (first sheet only)text/tab-separated-values
    HTML (zipped)application/zip
    Google Drawings


    application/vnd.google-apps.drawing

    JPEGimage/jpeg
    PNGimage/png
    SVGimage/svg+xml
    PDFapplication/pdf
    Google Presentationsapplication/vnd.google-apps.presentationMS PowerPointapplication/vnd.openxmlformats-officedocument.presentationml.presentation
    Open Office presentationapplication/vnd.oasis.opendocument.presentation
    PDFapplication/pdf
    Plain texttext/plain
    Google Apps Scriptsapplication/vnd.google-apps.scriptJSONapplication/vnd.google-apps.script+json


    Security

    Limitation(s)

    Future Work

    • Some future work – HYDRATOR-99999
    • Another future work – HYDRATOR-99999

    Test Case(s)

    • Test case #1
    • Test case #2

    Sample Pipeline

    Please attach one or more sample pipeline(s) and associated data. 

    Pipeline #1

    Pipeline #2




    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature