Introduction

Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.

Use case(s)

Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch
Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
- The tool tip should describe accurately what each field is used for
Users should get field level lineage for the source and sink that is being used
Reference documentation be available from the source and sink plugins

User Storie(s)

Source code in data integrations org
Integration test code
Relevant documentation in the source repo and reference documentation section in plugin

Plugin Type

Batch Source
Batch Sink
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Design

Properties

Source

User Facing NameTypeDescriptionConstraintsProject IDString

Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Required.JSON key file pathString

The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

https://cloud.google.com/storage/docs/authentication#generating-a-private-key

Required.NamespaceString

A namespace partitions entities into a subset of datastore.

https://cloud.google.com/datastore/docs/concepts/multitenancy

Optional. If not provided, [default] namespace will be used.

Kind

StringThe kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.

Introduction

Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.

Use case(s)

Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch
Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
- The tool tip should describe accurately what each field is used for
Users should get field level lineage for the source and sink that is being used
Reference documentation be available from the source and sink plugins

User Storie(s)

Source code in data integrations org
Integration test code
Relevant documentation in the source repo and reference documentation section in plugin

Plugin Type

Batch Source
Batch Sink
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Design

Properties

Source

User Facing Name	Type	Description	Constraints
Project ID	String	Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.	Required.
JSON key file path	String	The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key	Required.
Namespace	String	A namespace partitions entities into a subset of datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy	Optional. If not provided, [default] namespace will be used.
Kind	String	The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Optional. Should be empty if GQL is indicated.
Ancestors	String	List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value). Example: name=KEY_NAME, id=100 https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency	Optional.
Filter property value	String	List of property values by which equality filter will be applied. Must include property name and its value.	Optional.
Number of splits	Integer	Desired number of splits to split a query into multiple shards during execution. Will create up to desired number of splits, however it may return less splits if desired number is unavailable. Only applicable for `Query by Kind` which allows only ancestor queries and equality filters by properties.	Required. Min value: 1. Max value: 2147483647.
Include Key	Boolean	Key is unique identifier assigned to the entity when it is created. If property is set to true, __key__ column or its alias indicated in Key Alias property must be present in the schema definition. Key type is String. For named keys, key value will be returned in name=string_value format, for numeric in id=long_value. Is needed when performing upserts to the Cloud Datastore. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Optional. Should be empty if GQL is indicated.
Ancestors	String	List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value). Example: name=KEY_NAME, id=100 https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency	Optional.
Filter property types	String	List of property types by which equality filter will be applied. Must include property name and its type. Note, property names and their count must match to properties provided in `Filter properties values`.	Optional.
Filter property values	String	List of property values by which equality filter will be applied. Must include property name and its value. Note, property names and their count must match to properties provided in `Filter properties types`.	Optional.
Number of splits	Integer	Desired number of splits to split a query into multiple shards during execution. Will create up to desired number of splits, however it may return less splits if desired number is unavailable. Only applicable for `Query by Kind` which allows only ancestor queries and equality filters by properties.	Required. Min value: 1. Max value: 2147483647.
GQL	String	QL-like language which allows to query data by specific kind, keys with option to apply various filter conditions. Should be empty if `Kind` is indicated. Note, GQL queries can not be split into multiple shards. Example: SELECT FROM myKind WHERE myProp >= 100 AND myProp < 200* https://cloud.google.com/datastore/docs/concepts/queriesRequired. False by default.
Key Alias	String	Allows to set user-friendly name for the key column which default name is __key__. Only applicable, if Include Key is enabled. If not set and Include Key is enabled, default naming will be used.	Optional.
Schema	JSON schema	The schema of records output by the source. Will be mapped to the data returned from the query. Should contain column name, type and nullability. Can be imported or obtained using Get Schema button.	Required.

Sink

User Facing Name	Type	Description	Constraints
Project ID	String	Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.	Required.
JSON key file path	String	The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key	Required.
Namespace	String	A namespace partitions entities into a subset of datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy	Optional. If not provided, [default] namespace will be used.
Kind	String	The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers	Required.
Indexed Properties	String	List of property names to be marked as indexed. Equivalent to relational database column notion. https://cloud.google.com/datastore/docs/

reference

concepts/

gql_reference

indexes

Optional.

Should be empty if Kind is indicated.Include

If not indicated, all properties are considered to be indexed by default.
Allow Auto-generated Key	Boolean	Key is unique identifier assigned to the entity when it is

created

created. User can specify its own key for the entity or already existing key to perform upserts. If property is set to

true

false, __key__ column or its alias indicated in Key Alias property must be present in the schema definition. Key type is String.

For named keys, key value will be returned in name=string_value format, for numeric in id=long_value. Is needed when performing upserts to the Cloud Datastore

Otherwise, Cloud Datastore will automatically assign numeric ID to the entity.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required.

False

True by default.
Key Alias	String

Allows to set user-friendly name for the key column which default name is

Indicates key name alias if it is different from the default one ( __key__

. Only applicable, if Include Key is enabled. If not set and Include Key is enabled, default naming will be used.Optional.Eventually ConsistentBooleanTo improve performance, user can set eventually consistent read policy for ancestor queries. Note, this option takes no effect on global queries, since they are always eventually consistent regardless of the policy.

).

Optional.

Ancestors

String

List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value).

Example: name=KEY_NAME, id=100

https://cloud.google.com/datastore/docs/concepts/

queries#ancestor_queries

structuring_for_strong_consistency

Optional.

Batch size

Integer

Maximum number of entities that can be passed in one batch to a Commit operation.

https://cloud.google.com/datastore/docs/concepts/

structuring_for_strong_consistencyRequired. False by default.SchemaJSON schema

The schema of records output by the source. Will be mapped to the data returned from the query. Should contain column name, type and nullability. Can be imported or obtained using Get Schema button.

Required.

Sink

The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster

User Facing Name	Type	Description	Constraints
Project ID	String	Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.	Required.
JSON key file path	String	limits	Required. Default value 25. Min value 1. Max value 500.
Transactional	Boolean	Datastore commits are either transactional, meaning they take place in the context of a transaction and the transaction’s set of mutations are either all or none are applied, or non-transactional, meaning the set of mutations may not apply as all or none. https://cloud.google.com/datastore/docs/concepts/transactions	Required. False by default.
Schema	JSON schema	The schema of records to be written. Should contain column name, type and nullability. Can be imported.	Required.

Implementation Tips

Implementation will be done using Google Cloud Datastore Data API since it provides mechanisms to allow query splitting during data reads.

https://cloud.google.com/

storage

datastore/docs

/authentication#generating-a-private-keyRequired.NamespaceString

A namespace partitions entities into a subset of datastore.

https://cloud.google.com/datastore/docs/concepts/multitenancy

Optional. If not provided, [default] namespace will be used.

Kind

String

The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required.Indexed PropertiesString

List of property names to be marked as indexed. Equivalent to relational database column notion.

https://cloud.google.com/datastore/docs/concepts/indexes

Optional. If not indicated, all properties are considered to be indexed by default.Allow Generated KeyBoolean

Key is unique identifier assigned to the entity when it is created. User can specify its own key for the entity or already existing key to perform upserts. If property is set to false, __key__ column or its alias indicated in Key Alias property must be present in the schema definition. Key type is String. Otherwise, Cloud Datastore will automatically assign numeric ID to the entity.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required. True by default.Key AliasString

Indicates key name alias if it is different from the default one ( __key__).

Optional.AncestorsString

List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value).

Example: name=KEY_NAME, id=100

https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency

Optional.Batch sizeInteger

Maximum number of entities that can be passed in one batch to a Commit operation.

https://cloud.google.com/datastore/docs/concepts/limits

Required. Default value 25. Min value 1. Max value 500.TransactionalBoolean

Datastore commits are either transactional, meaning they take place in the context of a transaction and the transaction’s set of mutations are either all or none are applied, or non-transactional, meaning the set of mutations may not apply as all or none.

https://cloud.google.com/datastore/docs/concepts/transactions

Required. False by default.SchemaJSON schema

The schema of records to be written. Should contain column name, type and nullability. Can be imported.

Required.

/reference/data/rpc/

https://github.com/GoogleCloudPlatform/google-cloud-datastore

Examples

Initial Dataset

Kind: TaskType

Key	Label
name='DEV'	Development

Kind: Task

Key	Parent	Priority
id=1	Key(TaskType, 'DEV')	1
id=2		1

Source examples

Read data filtered by ancestor and property, including Key

Kind	Task
Ancestors	Key(TaskType, 'DEV')
Filter Property Value	Priority	1
Include Key	true
Key Alias	TaskKey

Output Schema

TaskKey	String
Priority	Integer

Output Dataset

TaskKey	Priority
id=1	1

Read data filtered by property, without including Key

Kind	Task
Ancestors
Filter Property Value	Priority	1
Include Key	false
Key alias

Output Schema

Priority	Integer

Output Dataset

Priority
1
1

Sink examples

Insert new row with ancestor and custom Key

Input Dataset

TaskKey	Priority
id=3	2

Sink properties

Kind	Task
Ancestors	Key(TaskType, 'DEV')
Allow Auto-generated Key	false
Key Alias	TaskKey

Input Schema

TaskKey	String
Priority	Integer

Resulting Dataset (new row was inserted)

Key	Parent	Priority
id=1	Key(TaskType, 'DEV')	1
id=2		1
id=3	Key(TaskType, 'DEV')	2

Insert new row without ancestor and with auto-generated Key

Input Dataset

Priority
2

Sink properties

Kind	Task
Ancestors
Allow Auto-generated Key	true
Key Alias

Resulting Dataset (new row was inserted)

Key	Parent	Priority
id=1	Key(TaskType, 'DEV')	1
id=2		1
id=11010104985		2

Security

Limitation(s)

Future Work

Some future work – HYDRATOR-99999
Another future work – HYDRATOR-99999

Test Case(s)

Test case #1
Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data.

Pipeline #1

Pipeline #2

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 5

New Version 6

Key

Introduction

Use case(s)

User Storie(s)

Plugin Type

Design

Properties

Source

Introduction

Use case(s)

User Storie(s)

Plugin Type

Design

Properties

Source

Sink

Sink

Implementation Tips

Examples

Initial Dataset

Source examples

Read data filtered by ancestor and property, including Key

Read data filtered by property, without including Key

Sink examples

Insert new row with ancestor and custom Key

Insert new row without ancestor and with auto-generated Key

Security

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

Pipeline #1

Pipeline #2

Page Comparison

Versions Compared

Old Version 5

New Version 6

Key

Introduction

Use case(s)

User Storie(s)

Plugin Type

Design

Properties

Source

Introduction

Use case(s)

User Storie(s)

Plugin Type

Design

Properties

Source

Sink

Sink

Implementation Tips

Examples

Initial Dataset

Source examples

Read data filtered by ancestor and property, including Key

Read data filtered by property, without including Key

Sink examples

Insert new row with ancestor and custom Key

Insert new row without ancestor and with auto-generated Key

Security

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

Pipeline #1

Pipeline #2