Introduction

Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.

Use case(s)

Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch
Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
- The tool tip should describe accurately what each field is used for
Users should get field level lineage for the source and sink that is being used
Reference documentation be available from the source and sink plugins

User Storie(s)

Source code in data integrations org
Integration test code
Relevant documentation in the source repo and reference documentation section in plugin

Plugin Type

Batch Source
Batch Sink
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Design

Properties

Source

User Facing Name	Type	Description	Constraints
Project ID	String	Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.	Required.
JSON key file path	String	The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key	Required.
Namespace	String	A namespace partitions entities into a subset of Datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy	Optional. If not provided, [default] namespace will be used.
Kind	String	The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Required.
Has Ancestor	String	Ancestor identifies the common root entity in which the entities are grouped. Must be written in Key Literal format: key(kind_1, identifier_1, kind_2, identifier_2, [...]). Example: key(kind_1, 'stringId', kind_2, 100) https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency	Optional.
Filters	String	List of filter property names and values pairs by which equality filter will be applied. This is a semi-colon separated list of key-value pairs, where each pair is separated by a pipe sign `\|`. Filter properties must be present in the schema. Allowed property types are STRING, LONG, DOUBLE, BOOLEAN and TIMESTAMP. Property value indicated as `null` string will be treated as `is null` clause. TIMESTAMP string should be in the RFC 3339 format without the timezone offset (always ends in Z). Expected pattern: yyyy-MM-dd'T'HH:mm:ssX, example: 2011-10-02T13:12:55Z	Optional.
Number of Splits	Integer	Desired number of splits to split a query into multiple shards during execution. Will be created up to desired number of splits, however less splits can be created if desired number is unavailable.	Required. Min value: 1. Max value: 2147483647.
Key Type	String	Key is unique identifier assigned to the entity when it is created. Property defines if key will be included in the output, commonly is needed to perform upserts to the Cloud Datastore. Can be one of three options: None - key will not be included. Key literal - key will be included in Datastore key literal format including complete path with ancestors. URL-safe key - key will be included in the encoded form that can be used as part of a URL. Note, if Key literal or URL-safe key is selected, default key name (__key__) or its alias must be present in the schema with non-nullable STRING type. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Required. None by default.
Key Alias	String	Allows to set user-friendly name for the key column which default name is __key__. Only applicable, if Key Type is set to Key literal or URL-safe key. If Key Type is set to None, property must be empty.	Optional.
Schema	JSON schema	Schema of the data to read, can be imported or fetched by clicking the Get Schema button.	Required.

Sink

User Facing Name	Type	Description	Constraints
Project ID	String	Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.	Required.
JSON key file path	String	The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key	Required.
Namespace	String	A namespace partitions entities into a subset of Datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy	Optional. If not provided, [default] namespace will be used.
Kind	String	The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Required.
Key Type	String	Key is unique identifier assigned to the entity when it is created. Property defines what type of key will be added to the entity, commonly is needed to perform upserts to the Cloud Datastore. Can be one of four options: Auto-generated key - key will be generated by Datastore as Numeric ID. Custom name - key will be provided by user. Supported types: non-nullable STRING, INT or LONG. Key literal - key will be provided in Datastore key literal format including complete path with ancestors. Supported type: non-nullable STRING in key literal format: key(<kind>, <identifier>, <kind>, <identifier>, [...]). Example: key(kind_name, 'stringId') URL-safe key - key will be provided in the encoded form that can be used as part of a URL. Supported type: non-nullable STRING in URL-safe key format. Note, if Custom name, Key literal or URL-safe key is selected, default key name (__key__) or its alias must be present in the schema. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Required. Auto-generated key by default.
Key Alias	String	Allows to set user-friendly name for the key column which default name is __key__. Only applicable, if Key Type is set to Custom name, Key literal or URL-safe key. If Key Type is set to Auto-generated key, property must be empty.	Optional.
Ancestor	String	Ancestor identifies the common root entity in which the entities are grouped. Must be written in Key Literal format: key(kind_1, identifier_1, kind_2, identifier_2, [...]). Example: key(kind_1, 'stringId', kind_2, 100) https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency	Optional.
Index Strategy	String	Index strategy defines which fields defined in schema will be indexed in Cloud Datastore. Can be one of three options: All - all fields will be indexed None - none of fields will be indexed Custom - indexed fields will be provided in Indexed Properties	Required. All by default.
Indexed Properties	String	List of property names to be marked as indexed separated by comma. https://cloud.google.com/datastore/docs/concepts/indexes	Optional. Must be provided if Index Strategy is set to Custom, otherwise must be empty.
Batch size	Integer	Maximum number of entities that can be passed in one batch to a Commit operation. https://cloud.google.com/datastore/docs/concepts/limits	Required. Default value 25. Min value 1. Max value 500.

Implementation Tips

Implementation will be done using Google Cloud Datastore Data API since it provides mechanisms to allow query splitting during data reads.

https://cloud.google.com/datastore/docs/reference/data/rpc/

https://github.com/GoogleCloudPlatform/google-cloud-datastore

Examples

Initial Dataset

Kind: TaskType

Key	Label
name='DEV'	Development

Kind: Task

Key	Parent	Priority
id=1	Key(TaskType, 'DEV')	1
id=2		1

Source examples

Read data filtered by ancestor and property, including Key

Kind	Task
Has Ancestor	Key(TaskType, 'DEV')
Filters	Priority	1
Key Type	Key literal
Key Alias	TaskKey

Output Schema

TaskKey	String
Priority	Long

Output Dataset

TaskKey	Priority
key(Task, 1)	1

Read data filtered by property, without including Key

Kind	Task
Has Ancestor
Filters	Priority	1
Key Type	None
Key Alias

Output Schema

Priority	Long

Output Dataset

Priority
1
1

Sink examples

Insert new row with Ancestor and Custom name

Input Dataset

TaskId	Priority
3	2

Sink properties

Kind	Task
Ancestor	Key(TaskType, 'DEV')
Key Type	Custom name
Key Alias	TaskId

Input Schema

TaskKey	Long
Priority	Long

Resulting Dataset (new row was inserted)

Key	Parent	Priority
id=1	Key(TaskType, 'DEV')	1
id=2		1
id=3	Key(TaskType, 'DEV')	2

Insert new row without Ancestor and with Auto-generated key

Input Dataset

Priority
2

Sink properties

Kind	Task
Ancestor
Key Type	Auto-generated key
Key Alias

Resulting Dataset (new row was inserted)

Key	Parent	Priority
id=1	Key(TaskType, 'DEV')	1
id=2		1
id=11010104985		2

Security

Limitation(s)

Future Work

Some future work – HYDRATOR-99999
Another future work – HYDRATOR-99999

Test Case(s)

Test case #1
Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data.

Pipeline #1

Pipeline #2

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

CDAP

Google Cloud Datastore Source and Sink