Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Google drive plugins will help usersIntroduction
Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.
User Storie(s)
- As a pipeline developer, I want to move all files from a Google drive directory to a different destination
- As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination
- As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs
- As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.
- As a pipeline developer, I want to move all files from an FTP source into Google drive.
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
This section defines properties that are configurable for this plugin.
Source
Option level | User Facing Name | Type | Description | Optional | Constraints | Default value | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Basic | Directory identifier | String | Identifier of the source folder. | no | String | Identifier of the source folder. | no | |||||||||||||||||||||||||
File metadata properties | Multi-select | Properties which should be get for each file in directory. Allowed names can be get from Google Drive API: Files | Yes | |||||||||||||||||||||||||||||
Filtering | Filter | String | A filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter Syntax | Yes | ||||||||||||||||||||||||||||
Modification date range | Select | In addition to the filter specified above, also filter files to only pull those that were modified between the date range | Yes | select | ||||||||||||||||||||||||||||
Start Date | textbox | Only shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | No | |||||||||||||||||||||||||||||
End date | textbox | Only shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00:00-08:00. | No | |||||||||||||||||||||||||||||
File types to pull | Multi-select | Types of files should be pulled from specified directory. | Yes | binary | ||||||||||||||||||||||||||||
Authentication | Authentication type | Radio-group | Defines the authentication type. OAuth2 and Service account types are available. | No | File properties | Multi-select | Properties which should be get for each file in directory. Allowed names can be get from Google Drive API: Files | Yes | File types to pull | Multi-select | Types of files should be pulled from specified directory. | Yes | binary | Authentication | Client ID | String | OAuth2 client id. | No | Client secret | String | OAuth2 client secret. | No | Refresh token | String | OAuth2 refresh token. | No | Access token | String | OAuth2 access token. | No | OAuth2 | |
Client ID | String | OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | |||||||||||||||||||||||||||||
Client secret | String | OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | |||||||||||||||||||||||||||||
Refresh token | String | OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | |||||||||||||||||||||||||||||
Account file path | String | Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property. | Yes | auto-detect | ||||||||||||||||||||||||||||
Advanced | Maximum partition size | Number | Maximum partition size specified in bytes. Default 0 value means unlimited. | Yes | 0 | |||||||||||||||||||||||||||
Body output format | Radio-group | Format of body of file. "Bytes" and "String" values are available. | Yes | bytes | ||||||||||||||||||||||||||||
Exporting | Google Documents export format | Select | MIME type for Google Documents. Allowed values from Downloading Google Documents. | Yes | text/plain | |||||||||||||||||||||||||||
Google Spreadsheets export format | Select | MIME type for Google Spreadsheets. | Yes | text/csv | ||||||||||||||||||||||||||||
Google Drawings export format | Select | MIME type for Google Drawings. | Yes | image/svg+xml | ||||||||||||||||||||||||||||
Google Presentations export format | Select | MIME type for Google Presentations Presentations. | Yes | text/plain |
Sink
File name
Option level | User Facing Name | Type | Description | Optional | Constraints | |
---|---|---|---|---|---|---|
Basic | File name field | String | Name of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names. | Yes |
Sink
File body field | String | Name of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field. | No | ||
File mime field | String | Name of the schema field (should be STRING type) which will be used as |
MIME type of file. |
|
Google API will try to recognize file's MIME type automatically. | Yes |
Directory identifier | String |
Name of the schema field (should be STRING type) which will be used as MIME type of file.
All MIME types are supported except Google Drive types.
In the case it is not set Google API will try to recognize file's MIME type automatically.
OAuth2 client secret.
Identifier of the destination folder. | No | |||||
Authentication | Authentication type | Radio-group | Defines the authentication type. OAuth2 and Service account types are available. | No | OAuth2 | |
Client ID | String | OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | |||
Client secret | String | OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | |||
Refresh token | String | OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | |||
Account file path | String | Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property. | Yes | auto-detect |
Design / Implementation Tips
- Tip #1
- Tip #2
Design
Approach(s)
Source
Google Drive Source plugin reads files from specified Google Drive folder via Drive API.
Output schema if fully defined by plugin settings provided by user. Two fields are mandatory: body and offset, everyone else depends on file's fields selected in File properties property.
Body format is defined by Body output format property and may has BYTES and STRING formats. Offset field provides number of byte in the original file body starts from.
Plugin is able to limit maximal body size per partition with Maximum partition size set. For default value 0 plugin processes entire files without partitioning and offset field has always 0 value.
Example of output schema, when user selected id, name, mimeType, description, createdTime, modifiedTime, size, imageMediaMetadata.width, imageMediaMetadata.height and imageMediaMetadata.rotation fields from File properties property:
Sink
Google Drive Sink plugin writes received data as files to specified Google Drive folder via Drive API.
Sink writes single separate file per received partition. The only required input field from schema is file's body. The name of the schema's field with file body can be specified with File name field property. BYTES format is only supported for body input field.
There are also similar properties for name and mime type of file. Both requires STRING format. In case name field is not specified sink will generate random 16-symbol names. In case mime type field is not set Google Drive will try define it automatically.
Sink plugin doesn't support partitioned files writing.
Properties
Modification date range
Filters files by last modified date. Available values:
- None - files will not be filtered.
- Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".
- Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".
- Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".
- Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".
- Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".
- Current year - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".
- Custom - user should enter start and end dates by himself.
This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.
File properties
User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.
Property name | Type | Description |
---|---|---|
id | string | The ID of the file. |
name | string | The name of the file. This is not necessarily unique within a folder. |
mimeType | string | The MIME type of the file. Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided. |
description | string | A short description of the file. |
starred | boolean | Whether the user has starred the file. |
trashed | boolean | Whether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash. |
explicitlyTrashed | boolean | Whether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder. |
trashedTime | timestamp milliseconds | The time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives. |
parents | array of strings | The IDs of the parent folders which contain the file. |
properties | record of key-value strings | A collection of arbitrary key-value pairs which are visible to all apps. |
spaces | string | The list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'. |
createdTime | timestamp milliseconds | The time at which the file was created (RFC 3339 date-time). |
modifiedTime | timestamp milliseconds | The last time the file was modified by anyone (RFC 3339 date-time). |
driveId | ID of the shared drive the file resides in. Only populated for items in shared drives. | |
originalFilename | string | The original filename of the uploaded content if available, or else the original value of the name field. This is only available for files with binary content in Google Drive. |
fullFileExtension | string | The full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive. |
md5Checksum | string | The MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive. |
size | long | The size of the file's content in bytes. This is only applicable to files with binary content in Google Drive. |
imageMediaMetadata.width | int | The width of the image in pixels. |
imageMediaMetadata.height | int | The height of the image in pixels. |
imageMediaMetadata.rotation | int | The rotation in clockwise degrees from the image's original orientation. |
imageMediaMetadata.location.latitude | double | The latitude stored in the image. |
imageMediaMetadata.location.longitude | double | The longitude stored in the image. |
imageMediaMetadata.location.altitude | double | The altitude stored in the image. |
imageMediaMetadata.time | string | The date and time the photo was taken (EXIF DateTime). |
imageMediaMetadata.cameraMake | string | The make of the camera used to create the photo. |
imageMediaMetadata.cameraModel | string | The model of the camera used to create the photo. |
imageMediaMetadata.exposureTime | float | The length of the exposure, in seconds. |
imageMediaMetadata.aperture | float | The aperture used to create the photo (f-number). |
imageMediaMetadata.flashUsed | boolean | Whether a flash was used to create the photo. |
imageMediaMetadata.focalLength | float | The focal length used to create the photo, in millimeters. |
imageMediaMetadata.isoSpeed | int | The ISO speed used to create the photo. |
imageMediaMetadata.meteringMode | string | The metering mode used to create the photo. |
imageMediaMetadata.sensor | string | The type of sensor used to create the photo. |
imageMediaMetadata.exposureMode | string | The exposure mode used to create the photo. |
imageMediaMetadata.colorSpace | string | The color space of the photo. |
imageMediaMetadata.whiteBalance | string | The white balance mode used to create the photo. |
imageMediaMetadata.exposureBias | float | The exposure bias of the photo (APEX value). |
imageMediaMetadata.maxApertureValue | float | The smallest f-number of the lens at the focal length used to create the photo (APEX value). |
imageMediaMetadata.subjectDistance | int | The distance to the subject of the photo, in meters. |
imageMediaMetadata.lens | string | The lens used to create the photo. |
videoMediaMetadata.width | int | The width of the video in pixels. |
videoMediaMetadata.height | int | The height of the video in pixels. |
videoMediaMetadata.durationMillis | long | The duration of the video in milliseconds. |
File types to pull
All files in Google Drive can be divided by format between two types: Google formats and all other (binary). Google formats are hidden and can not be exported directly, instead of this they should be exported into any binary format firstly and only if this option is available for specified format. Binary files can be downloaded directly. User can specify which formats he wants download/export:
- Binary - will be downloaded directly (text/plain, image/bmp, video/mp4 etc.)
- Google Documents - will be exported to format specified in Google Documents export format property before.
- Google Spreadsheets - will be exported to format specified in Google Spreadsheets export format property before.
- Google Drawings - will be exported to format specified in Google Drawings export format property before.
- Google Presentations - will be exported to format specified in Google Presentations export format property before.
Exporting
This section presents exporting map for Google Drive formats. All available exporting ways:
Original format name | Original format mime | Export format name | Export format mime |
---|---|---|---|
Google Documents | application/vnd.google-apps.document | HTML | text/html |
HTML (zipped) | application/zip | ||
Plain text | text/plain | ||
Rich text | application/rtf | ||
Open Office doc | application/vnd.oasis.opendocument.text | ||
application/pdf | |||
MS Word document | application/vnd.openxmlformats-officedocument.wordprocessingml.document | ||
EPUB | application/epub+zip | ||
Google Spreadsheets | application/vnd.google-apps.spreadsheet | MS Excel | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Open Office sheet | application/x-vnd.oasis.opendocument.spreadsheet | ||
application/pdf | |||
CSV (first sheet only) | text/csv | ||
TSV (first sheet only) | text/tab-separated-values | ||
HTML (zipped) | application/zip | ||
Google Drawings | application/vnd.google-apps.drawing | JPEG | image/jpeg |
PNG | image/png | ||
SVG | image/svg+xml | ||
application/pdf | |||
Google Presentations | application/vnd.google-apps.presentation | MS PowerPoint | application/vnd.openxmlformats-officedocument.presentationml.presentation |
Open Office presentation | application/vnd.oasis.opendocument.presentation | ||
application/pdf | |||
Plain text | text/plain | ||
Google Apps Scripts | application/vnd.google-apps.script | JSON | application/vnd.google-apps.script+json |
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature