Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated design document according to implementation.

Introduction

Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.

Use case(s)

  • Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
  • Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch 
  • Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
    • The tool tip should describe accurately what each field is used for
  • Users should get field level lineage for the source and sink that is being used
  • Reference documentation be available from the source and sink plugins

User Storie(s)

  • Source code in data integrations org
  • Integration test code 
  • Relevant documentation in the source repo and reference documentation section in plugin

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Design

Properties

Source

User Facing NameTypeDescriptionConstraintsProject IDString

Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Required.JSON key file pathString

The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

https://cloud.google.com/storage/docs/authentication#generating-a-private-key

Required.NamespaceString

A namespace partitions entities into a subset of datastore.

https://cloud.google.com/datastore/docs/concepts/multitenancy

Optional. If not provided, [default] namespace will be used.

Kind

String

The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Optional. Should be empty if GQL is indicated.AncestorsString

List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value).

Example: name=KEY_NAME, id=100




Introduction

Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.

Use case(s)

  • Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
  • Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch 
  • Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
    • The tool tip should describe accurately what each field is used for
  • Users should get field level lineage for the source and sink that is being used
  • Reference documentation be available from the source and sink plugins

User Storie(s)

  • Source code in data integrations org
  • Integration test code 
  • Relevant documentation in the source repo and reference documentation section in plugin

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Design

Properties

Source

User Facing NameTypeDescriptionConstraints
Project IDString

Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Required.
JSON key file pathString

The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

https://cloud.google.com/storage/docs/authentication#generating-a-private-key

Required.
NamespaceString

A namespace partitions entities into a subset of Datastore.

https://cloud.google.com/datastore/docs/concepts/multitenancy

Optional. If not provided, [default] namespace will be used.

Kind

String

The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required.
Has AncestorStringAncestor identifies the common root entity in which the entities are grouped.
Must be written in Key Literal format: key(kind_1, identifier_1, kind_2, identifier_2, [...]). Example: key(kind_1, 'stringId', kind_2, 100)

https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency

Optional.
FiltersString

List of filter property names and values pairs by which equality filter will be applied.
This is a semi-colon separated list of key-value pairs, where each pair is separated by a pipe sign `|`.


Filter properties must be present in the schema. Allowed property types are STRING, LONG, DOUBLE, BOOLEAN and TIMESTAMP. Property value indicated as `null` string will be treated as `is null` clause. TIMESTAMP string should be in the RFC 3339 format without the timezone offset (always ends in Z). Expected pattern: yyyy-MM-dd'T'HH:mm:ssX, example: 2011-10-02T13:12:55Z

Optional.
Number of SplitsInteger

Desired number of splits to split a query into multiple shards during execution.
Will be created up to desired number of splits, however less splits can be created if desired number is unavailable.

Required. Min value: 1. Max value: 2147483647.
Key TypeString

Key is unique identifier assigned to the entity when it is created.
Property defines if key will be included in the output, commonly is needed to perform upserts to the Cloud Datastore. Can be one of three options:

None - key will not be included.

Key literal - key will be included in Datastore key literal format including complete path with ancestors.

URL-safe key - key will be included in the encoded form that can be used as part of a URL.

Note, if Key literal or URL-safe key is selected, default key name (__key__) or its alias must be present in the schema with non-nullable STRING type.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required. None by default.
Key AliasString

Allows to set user-friendly name for the key column which default name is __key__. Only applicable, if Key Type is set to Key literal or URL-safe key. If Key Type is set to None, property must be empty.

Optional.
SchemaJSON schema

Schema of the data to read, can be imported or fetched by clicking the Get Schema button.

Required.

Sink

User Facing NameTypeDescriptionConstraints
Project IDString

Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Required.
JSON key file pathString

The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

https://cloud.google.com/storage/docs/authentication#generating-a-private-key

Required.
NamespaceString

A namespace partitions entities into a subset of Datastore.

https://cloud.google.com/datastore/docs/concepts/multitenancy

Optional. If not provided, [default] namespace will be used.

Kind

String

The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required.
Key TypeString

Key is unique identifier assigned to the entity when it is created. Property defines what type of key will be added to the entity, commonly is needed to perform upserts to the Cloud Datastore. Can be one of four options:

Auto-generated key - key will be generated by Datastore as Numeric ID.

Custom name - key will be provided by user. Supported types: non-nullable STRING, INT or LONG.

Key literal - key will be provided in Datastore key literal format including complete path with ancestors. Supported type: non-nullable STRING in key literal format: key(<kind>, <identifier>, <kind>, <identifier>, [...]). Example: key(kind_name, 'stringId')

URL-safe key - key will be provided in the encoded form that can be used as part of a URL. Supported type: non-nullable STRING in URL-safe key format.

Note, if Custom name, Key literal or URL-safe key is selected, default key name (__key__) or its alias must be present in the schema.

https://cloud.google.com/datastore/docs/concepts/

structuring_for_strong_consistency
Optional.Filter property typesStringList of property types by which equality filter will be applied. Must include property name and its type. Note, property names and their count must match to properties provided in `Filter properties values`.Optional.Filter property valuesStringList of property values by which equality filter will be applied. Must include property name and its value. Note, property names and their count must match to properties provided in `Filter properties types`.Optional.Number of splitsIntegerDesired number of splits to split a query into multiple shards during execution. Will create up to desired number of splits, however it may return less splits if desired number is unavailable. Only applicable for `Query by Kind` which allows only ancestor queries and equality filters by properties.Required. Min value: 1. Max value: 2147483647.GQLString

QL-like language which allows to query data by specific kind, keys with option to apply various filter conditions. Should be empty if `Kind` is indicated. Note, GQL queries can not be split into multiple shards.

Example: SELECT * FROM myKind WHERE myProp >= 100 AND myProp < 200

https://cloud.google.com/datastore/docs/concepts/queries

entities#kinds_and_identifiers

Required. Auto-generated key by default.
Key AliasString

Allows to set user-friendly name for the key column which default name is __key__.
Only applicable, if Key Type is set to Custom name, Key literal or URL-safe key. If Key Type is set to Auto-generated key, property must be empty.

Optional.
AncestorStringAncestor identifies the common root entity in which the entities are grouped.
Must be written in Key Literal format: key(kind_1, identifier_1, kind_2, identifier_2, [...]). Example: key(kind_1, 'stringId', kind_2, 100)

https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency

Optional.
Index StrategyString

Index strategy defines which fields defined in schema will be indexed in Cloud Datastore. Can be one of three options:

All - all fields will be indexed

None - none of fields will be indexed

Custom - indexed fields will be provided in Indexed Properties

Required. All by default.
Indexed PropertiesString

List of property names to be marked as indexed separated by comma.

https://cloud.google.com/datastore/docs/

reference

concepts/

gql_reference

indexes

Optional.
Should
Must be
empty if Kind is indicated.Include KeyBooleanKey is unique identifier assigned to the entity when it is created. If property For named keys, key value will be returned in name=string_value format, for numeric in id=long_value. Is needed when performing upserts to the Cloud Datastore
provided if Index Strategy is set to
true, __key__ column with type String must be present in the schema definition.
Custom, otherwise must be empty.
Batch sizeInteger

Maximum number of entities that can be passed in one batch to a Commit operation.

https://cloud.google.com/datastore/docs/concepts/

entities#kinds_and_identifiersRequired. False by default.Eventually ConsistentBooleanTo improve performance, user can set eventually consistent read policy for ancestor queries. Note, this option takes no effect on global queries, since they are always eventually consistent regardless of the policy

limits

Required. Default value 25. Min value 1. Max value 500.

Implementation Tips

Implementation will be done using Google Cloud Datastore Data API since it provides mechanisms to allow query splitting during data reads.

https://cloud.google.com/datastore/docs

/concepts/queries#ancestor_queries

https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency

Required. False by default.SchemaJSON schema

The schema of records output by the source. Will be mapped to the data returned from the query. Should contain column name, type and nullability. Can be imported or obtained using Get Schema button.

Required.

Sink

User Facing NameTypeDescriptionConstraintsProject IDString

Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Required.JSON key file pathString

The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

https://cloud.google.com/storage/docs/authentication#generating-a-private-key

Required.NamespaceString

A namespace partitions entities into a subset of datastore.

https://cloud.google.com/datastore/docs/concepts/multitenancy

Optional. If not provided, [default] namespace will be used.

Kind

String

The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required.Indexed PropertiesString

List of property names to be marked as indexed. Equivalent to relational database column notion.

https://cloud.google.com/datastore/docs/concepts/indexes

Optional. If not indicated, all properties are considered to be indexed by default.Allow Generated KeyBoolean

Key is unique identifier assigned to the entity when it is created. User can specify its own key for the entity or already existing key to perform upserts. If property is set to false, __key__ field must be present in schema definition. Otherwise, Cloud Datastore will automatically assign numeric ID to the entity.

https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers

Required. True by default.AncestorsString

List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value).

Example: name=KEY_NAME, id=100

https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency

Optional.Batch sizeInteger

Maximum number of entities that can be passed in one batch to a Commit operation.

https://cloud.google.com/datastore/docs/concepts/limits

Required. Default value 25. Min value 1. Max value 500.TransactionalBoolean

Datastore commits are either transactional, meaning they take place in the context of a transaction and the transaction’s set of mutations are either all or none are applied, or non-transactional, meaning the set of mutations may not apply as all or none.

https://cloud.google.com/datastore/docs/concepts/transactions

Required. False by default.SchemaJSON schema

The schema of records to be written. Should contain column name, type and nullability. Can be imported.

Required.

/reference/data/rpc/

https://github.com/GoogleCloudPlatform/google-cloud-datastore

Examples

Initial Dataset

Kind: TaskType

KeyLabel
name='DEV'Development

Kind: Task

KeyParentPriority
id=1Key(TaskType, 'DEV')1
id=2
1

Source examples

Read data filtered by ancestor and property, including Key

KindTask
Has AncestorKey(TaskType, 'DEV')
FiltersPriority1
Key TypeKey literal
Key AliasTaskKey

Output Schema

TaskKeyString
PriorityLong

Output Dataset

TaskKeyPriority
key(Task, 1)1

Read data filtered by property, without including Key

KindTask
Has Ancestor

FiltersPriority1
Key TypeNone
Key Alias

Output Schema

PriorityLong

Output Dataset

Priority
1
1

Sink examples

Insert new row with Ancestor and Custom name

Input Dataset

TaskIdPriority
32

Sink properties

KindTask
AncestorKey(TaskType, 'DEV')
Key TypeCustom name
Key AliasTaskId

Input Schema

TaskKeyLong
PriorityLong

Resulting Dataset (new row was inserted)

KeyParentPriority
id=1Key(TaskType, 'DEV')1
id=2
1
id=3Key(TaskType, 'DEV')2

Insert new row without Ancestor and with Auto-generated key

Input Dataset

Priority
2

Sink properties

KindTask
Ancestor
Key TypeAuto-generated key
Key Alias

Resulting Dataset (new row was inserted)

KeyParentPriority
id=1Key(TaskType, 'DEV')1
id=2
1
id=11010104985 2

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2





Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature