Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.

User Storie(s)

  • As a pipeline developer, I want to move all files from a Google drive directory to a different destination
  • As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination
  • As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs
  • As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.
  • As a pipeline developer, I want to move all files from an FTP source into Google drive.

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

Source

Option levelUser Facing NameTypeDescriptionOptionalConstraintsDefault value
BasicDirectory identifierString

Identifier of the source folder.

no

FilterStringA filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter SyntaxYes

Modification date rangeSelectIn addition to the filter specified above, also filter files to only pull those that were modified between the date rangeYes
select
Start DatetextboxOnly shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.No

End datetextboxOnly shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.No

File propertiesMulti-selectProperties which should be get for each file in directory. Allowed names can be get from Google Drive API: FilesYes

File types to pullMulti-selectTypes of files should be pulled from specified directory.Yes
binary
AuthenticationClient IDStringOAuth2 client id.No

Client secretString

OAuth2 client secret.

No

Refresh tokenStringOAuth2 refresh token.No

Access tokenStringOAuth2 access token.No

Advanced

Maximum partition size

Number

Maximum partition size specified in bytes. Default 0 value means unlimited.

Yes
0
Body output formatRadio-groupFormat of body of file. "Bytes" and "String" values are available.Yes
bytes
ExportingGoogle Documents export formatSelectMIME type for Google Documents. Allowed values from Downloading Google Documents.Yes
text/plain
Google Spreadsheets export formatSelect
MIME type for Google Spreadsheets.Yes
text/csv
Google Drawings export formatSelect
MIME type for Google Drawings.Yes
image/svg+xml
Google Presentations export formatSelect
MIME type for Google Presentations.Yes
text/plain

Sink

Option levelUser Facing NameTypeDescriptionOptionalConstraints
Basic


File name field

StringName of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names.
Yes

File body field

StringName of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field.
No
Directory identifierStringIdentifier of the destination folder.No
Authentication

Client IDStringOAuth2 client id.No
Client secretString

OAuth2 client secret.

No
Refresh tokenStringOAuth2 refresh token.No
Access tokenStringOAuth2 access token.No

Design / Implementation Tips

  • Tip #1
  • Tip #2

Design

Approach(s)

Properties

Modification date range

Filters files by last modified date. Available values:

  1. None - files will not be filtered.
  2. Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".
  3. Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".
  4. Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".
  5. Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".
  6. Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".
  7. Current year  - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".
  8. Custom - user should enter start and end dates by himself.

This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.


File properties

User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.

Property nameTypeDescription
idstringThe ID of the file.
namestringThe name of the file. This is not necessarily unique within a folder.
mimeTypestringThe MIME type of the file.

Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided.

descriptionstringA short description of the file.
starredbooleanWhether the user has starred the file.
trashedbooleanWhether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash.
explicitlyTrashedbooleanWhether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder.
trashedTimetimestamp millisecondsThe time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives.
parentsarray of stringsThe IDs of the parent folders which contain the file.
propertiesrecord of key-value stringsA collection of arbitrary key-value pairs which are visible to all apps.
spacesstringThe list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'.
createdTimetimestamp millisecondsThe time at which the file was created (RFC 3339 date-time).
modifiedTimetimestamp millisecondsThe last time the file was modified by anyone (RFC 3339 date-time).
driveId
ID of the shared drive the file resides in. Only populated for items in shared drives.
originalFilenamestringThe original filename of the uploaded content if available, or else the original value of the name field. This is only available for files with binary content in Google Drive.
fullFileExtensionstringThe full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive.
md5ChecksumstringThe MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive.
sizelongThe size of the file's content in bytes. This is only applicable to files with binary content in Google Drive.
imageMediaMetadata.widthintThe width of the image in pixels.
imageMediaMetadata.heightintThe height of the image in pixels.
imageMediaMetadata.rotationintThe rotation in clockwise degrees from the image's original orientation.
imageMediaMetadata.location.latitudedoubleThe latitude stored in the image.
imageMediaMetadata.location.longitudedoubleThe longitude stored in the image.
imageMediaMetadata.location.altitudedoubleThe altitude stored in the image.
imageMediaMetadata.timestringThe date and time the photo was taken (EXIF DateTime).
imageMediaMetadata.cameraMakestringThe make of the camera used to create the photo.
imageMediaMetadata.cameraModelstringThe model of the camera used to create the photo.
imageMediaMetadata.exposureTimefloatThe length of the exposure, in seconds.
imageMediaMetadata.aperturefloatThe aperture used to create the photo (f-number).
imageMediaMetadata.flashUsedbooleanWhether a flash was used to create the photo.
imageMediaMetadata.focalLengthfloatThe focal length used to create the photo, in millimeters.
imageMediaMetadata.isoSpeedintThe ISO speed used to create the photo.
imageMediaMetadata.meteringModestringThe metering mode used to create the photo.
imageMediaMetadata.sensorstringThe type of sensor used to create the photo.
imageMediaMetadata.exposureModestringThe exposure mode used to create the photo.
imageMediaMetadata.colorSpacestringThe color space of the photo.
imageMediaMetadata.whiteBalancestringThe white balance mode used to create the photo.
imageMediaMetadata.exposureBiasfloatThe exposure bias of the photo (APEX value).
imageMediaMetadata.maxApertureValuefloatThe smallest f-number of the lens at the focal length used to create the photo (APEX value).
imageMediaMetadata.subjectDistanceintThe distance to the subject of the photo, in meters.
imageMediaMetadata.lensstringThe lens used to create the photo.
videoMediaMetadata.widthintThe width of the video in pixels.
videoMediaMetadata.heightintThe height of the video in pixels.
videoMediaMetadata.durationMillislongThe duration of the video in milliseconds.


File types to pull

All files in Google Drive can be divided by format between two types: Google formats and all other (binary). Google formats are hidden and can not be exported directly, instead of this they should be exported into any binary format firstly and only if this option is available for specified format. Binary files can be downloaded directly. User can specify which formats he wants download/export:

  • Binary - will be downloaded directly (text/plain, image/bmp, video/mp4 etc.)
  • Google Documents - will be exported to format specified in Google Documents export format property before.
  • Google Spreadsheets - will be exported to format specified in Google Spreadsheets export format property before.
  • Google Drawings - will be exported to format specified in Google Drawings export format property before.
  • Google Presentations - will be exported to format specified in Google Presentations export format property before.


Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2




Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature