Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.
Use case(s)
- Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
- Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch
- Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
- The tool tip should describe accurately what each field is used for
- Users should get field level lineage for the source and sink that is being used
- Reference documentation be available from the source and sink plugins
User Storie(s)
- Source code in data integrations org
- Integration test code
- Relevant documentation in the source repo and reference documentation section in plugin
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design
Properties
Source
Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.
The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.
https://cloud.google.com/storage/docs/authentication#generating-a-private-key
A namespace partitions entities into a subset of datastore.
https://cloud.google.com/datastore/docs/concepts/multitenancy
Kind
Introduction
Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.
Use case(s)
- Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
- Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch
- Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
- The tool tip should describe accurately what each field is used for
- Users should get field level lineage for the source and sink that is being used
- Reference documentation be available from the source and sink plugins
User Storie(s)
- Source code in data integrations org
- Integration test code
- Relevant documentation in the source repo and reference documentation section in plugin
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design
Properties
Source
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Project ID | String | Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. | Required. |
JSON key file path | String | The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key | Required. |
Namespace | String | A namespace partitions entities into a subset of datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy | Optional. If not provided, [default] namespace will be used. |
Kind | String | The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Optional. Should be empty if GQL is indicated. |
Ancestors | String | List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value). Example: name=KEY_NAME, id=100 https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency | Optional. |
Filter property value | String | List of property values by which equality filter will be applied. Must include property name and its value. | Optional. |
Number of splits | Integer | Desired number of splits to split a query into multiple shards during execution. Will create up to desired number of splits, however it may return less splits if desired number is unavailable. Only applicable for `Query by Kind` which allows only ancestor queries and equality filters by properties. | Required. Min value: 1. Max value: 2147483647. |
Include Key | Boolean | Key is unique identifier assigned to the entity when it is created. If property is set to true, __key__ column or its alias indicated in Key Alias property must be present in the schema definition. Key type is String. For named keys, key value will be returned in name=string_value format, for numeric in id=long_value. Is needed when performing upserts to the Cloud Datastore. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Optional. Should be empty if GQL is indicated. |
Ancestors | String | List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value). Example: name=KEY_NAME, id=100 https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency | Optional. |
Filter property types | String | List of property types by which equality filter will be applied. Must include property name and its type. Note, property names and their count must match to properties provided in `Filter properties values`. | Optional. |
Filter property values | String | List of property values by which equality filter will be applied. Must include property name and its value. Note, property names and their count must match to properties provided in `Filter properties types`. | Optional. |
Number of splits | Integer | Desired number of splits to split a query into multiple shards during execution. Will create up to desired number of splits, however it may return less splits if desired number is unavailable. Only applicable for `Query by Kind` which allows only ancestor queries and equality filters by properties. | Required. Min value: 1. Max value: 2147483647. |
GQL | String | QL-like language which allows to query data by specific kind, keys with option to apply various filter conditions. Should be empty if `Kind` is indicated. Note, GQL queries can not be split into multiple shards. Example: SELECT * FROM myKind WHERE myProp >= 100 AND myProp < 200 https://cloud.google.com/datastore/docs/concepts/queriesRequired. False by default. | |
Key Alias | String | Allows to set user-friendly name for the key column which default name is __key__. Only applicable, if Include Key is enabled. If not set and Include Key is enabled, default naming will be used. | Optional. |
Schema | JSON schema | The schema of records output by the source. Will be mapped to the data returned from the query. Should contain column name, type and nullability. Can be imported or obtained using Get Schema button. | Required. |
Sink
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Project ID | String | Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. | Required. |
JSON key file path | String | The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key | Required. |
Namespace | String | A namespace partitions entities into a subset of datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy | Optional. If not provided, [default] namespace will be used. |
Kind | String | The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Required. |
Indexed Properties | String | List of property names to be marked as indexed. Equivalent to relational database column notion. |
Optional. |
If not indicated, all properties are considered to be indexed by default. | ||
Allow Auto-generated Key | Boolean | Key is unique identifier assigned to the entity when it is |
created. User can specify its own key for the entity or already existing key to perform upserts. If property is set to |
false, __key__ column or its alias indicated in Key Alias property must be present in the schema definition. Key type is String. |
Otherwise, Cloud Datastore will automatically assign numeric ID to the entity. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Required. |
True by default. | |
Key Alias | String |
Indicates key name alias if it is different from the default one ( __key__ |
). | Optional. | |
Ancestors | String | List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value). Example: name=KEY_NAME, id=100 |
Optional. | ||
Batch size | Integer | Maximum number of entities that can be passed in one batch to a Commit operation. |
The schema of records output by the source. Will be mapped to the data returned from the query. Should contain column name, type and nullability. Can be imported or obtained using Get Schema button.
Sink
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Project ID | String | Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. | Required. |
JSON key file path | String | The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the clusterRequired. Default value 25. Min value 1. Max value 500. | |
Transactional | Boolean | Datastore commits are either transactional, meaning they take place in the context of a transaction and the transaction’s set of mutations are either all or none are applied, or non-transactional, meaning the set of mutations may not apply as all or none. https://cloud.google.com/datastore/docs/concepts/transactions | Required. False by default. |
Schema | JSON schema | The schema of records to be written. Should contain column name, type and nullability. Can be imported. | Required. |
Implementation Tips
Implementation will be done using Google Cloud Datastore Data API since it provides mechanisms to allow query splitting during data reads.
storage/authentication#generating-a-private-keyA namespace partitions entities into a subset of datastore.
https://cloud.google.com/datastore/docs/concepts/multitenancy
Kind
The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion.
https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers
List of property names to be marked as indexed. Equivalent to relational database column notion.
Key is unique identifier assigned to the entity when it is created. User can specify its own key for the entity or already existing key to perform upserts. If property is set to false, __key__ column or its alias indicated in Key Alias property must be present in the schema definition. Key type is String. Otherwise, Cloud Datastore will automatically assign numeric ID to the entity.
https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers
Indicates key name alias if it is different from the default one ( __key__).
List of ancestor paths which identifies the common root entity in which the entities are grouped. Each ancestor must have kind and key. Key can be named (name=string_value) or numeric (id=long_value).
Example: name=KEY_NAME, id=100
https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency
Maximum number of entities that can be passed in one batch to a Commit operation.
Datastore commits are either transactional, meaning they take place in the context of a transaction and the transaction’s set of mutations are either all or none are applied, or non-transactional, meaning the set of mutations may not apply as all or none.
https://cloud.google.com/datastore/docs/concepts/transactions
The schema of records to be written. Should contain column name, type and nullability. Can be imported.
https://github.com/GoogleCloudPlatform/google-cloud-datastore
Examples
Initial Dataset
Kind: TaskType
Key | Label |
---|---|
name='DEV' | Development |
Kind: Task
Key | Parent | Priority |
---|---|---|
id=1 | Key(TaskType, 'DEV') | 1 |
id=2 | 1 |
Source examples
Read data filtered by ancestor and property, including Key
Kind | Task | |
---|---|---|
Ancestors | Key(TaskType, 'DEV') | |
Filter Property Value | Priority | 1 |
Include Key | true | |
Key Alias | TaskKey |
Output Schema
TaskKey | String |
---|---|
Priority | Integer |
Output Dataset
TaskKey | Priority |
---|---|
id=1 | 1 |
Read data filtered by property, without including Key
Kind | Task | |
---|---|---|
Ancestors | ||
Filter Property Value | Priority | 1 |
Include Key | false | |
Key alias |
Output Schema
Priority | Integer |
---|
Output Dataset
Priority |
---|
1 |
1 |
Sink examples
Insert new row with ancestor and custom Key
Input Dataset
TaskKey | Priority |
---|---|
id=3 | 2 |
Sink properties
Kind | Task |
---|---|
Ancestors | Key(TaskType, 'DEV') |
Allow Auto-generated Key | false |
Key Alias | TaskKey |
Input Schema
TaskKey | String |
---|---|
Priority | Integer |
Resulting Dataset (new row was inserted)
Key | Parent | Priority |
---|---|---|
id=1 | Key(TaskType, 'DEV') | 1 |
id=2 | 1 | |
id=3 | Key(TaskType, 'DEV') | 2 |
Insert new row without ancestor and with auto-generated Key
Input Dataset
Priority |
---|
2 |
Sink properties
Kind | Task |
---|---|
Ancestors | |
Allow Auto-generated Key | true |
Key Alias |
Resulting Dataset (new row was inserted)
Key | Parent | Priority |
---|---|---|
id=1 | Key(TaskType, 'DEV') | 1 |
id=2 | 1 | |
id=11010104985 | 2 |
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature