Google Cloud Datastore Source and Sink
- Arina Ielchiieva
Introduction
Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development offered by Google on the Google Cloud Platform. Cloud Datastore is built upon Google's Bigtable and Megastore technology.
Use case(s)
- Users would like to batch build a data pipeline to read complete table from Google Cloud Datastore instance.
- Users would like to batch build a data pipeline to perform inserts / upserts into Google Cloud Datastore tables in batch
- Users should get relevant information from the tool tip while configuring the Google Cloud Datastore source and Google Cloud Datastore sink
- The tool tip should describe accurately what each field is used for
- Users should get field level lineage for the source and sink that is being used
- Reference documentation be available from the source and sink plugins
User Storie(s)
- Source code in data integrations org
- Integration test code
- Relevant documentation in the source repo and reference documentation section in plugin
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design
Properties
Source
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Project ID | String | Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. | Required. |
JSON key file path | String | The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key | Required. |
Namespace | String | A namespace partitions entities into a subset of Datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy | Optional. If not provided, [default] namespace will be used. |
Kind | String | The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Required. |
Has Ancestor | String | Ancestor identifies the common root entity in which the entities are grouped. Must be written in Key Literal format: key(kind_1, identifier_1, kind_2, identifier_2, [...]). Example: key(kind_1, 'stringId', kind_2, 100) https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency | Optional. |
Filters | String | List of filter property names and values pairs by which equality filter will be applied.
| Optional. |
Number of Splits | Integer | Desired number of splits to split a query into multiple shards during execution. | Required. Min value: 1. Max value: 2147483647. |
Key Type | String | Key is unique identifier assigned to the entity when it is created. None - key will not be included. Key literal - key will be included in Datastore key literal format including complete path with ancestors. URL-safe key - key will be included in the encoded form that can be used as part of a URL. Note, if Key literal or URL-safe key is selected, default key name (__key__) or its alias must be present in the schema with non-nullable STRING type. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Required. None by default. |
Key Alias | String | Allows to set user-friendly name for the key column which default name is __key__. Only applicable, if Key Type is set to Key literal or URL-safe key. If Key Type is set to None, property must be empty. | Optional. |
Schema | JSON schema | Schema of the data to read, can be imported or fetched by clicking the Get Schema button. | Required. |
Sink
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Project ID | String | Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. | Required. |
JSON key file path | String | The credential JSON key file path. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. https://cloud.google.com/storage/docs/authentication#generating-a-private-key | Required. |
Namespace | String | A namespace partitions entities into a subset of Datastore. https://cloud.google.com/datastore/docs/concepts/multitenancy | Optional. If not provided, [default] namespace will be used. |
Kind | String | The kind of an entity categorizes it for the purpose of Datastore queries. Equivalent to relational database table notion. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Required. |
Key Type | String | Key is unique identifier assigned to the entity when it is created. Property defines what type of key will be added to the entity, commonly is needed to perform upserts to the Cloud Datastore. Can be one of four options: Auto-generated key - key will be generated by Datastore as Numeric ID. Custom name - key will be provided by user. Supported types: non-nullable STRING, INT or LONG. Key literal - key will be provided in Datastore key literal format including complete path with ancestors. Supported type: non-nullable STRING in key literal format: key(<kind>, <identifier>, <kind>, <identifier>, [...]). Example: key(kind_name, 'stringId') URL-safe key - key will be provided in the encoded form that can be used as part of a URL. Supported type: non-nullable STRING in URL-safe key format. Note, if Custom name, Key literal or URL-safe key is selected, default key name (__key__) or its alias must be present in the schema. https://cloud.google.com/datastore/docs/concepts/entities#kinds_and_identifiers | Required. Auto-generated key by default. |
Key Alias | String | Allows to set user-friendly name for the key column which default name is __key__. | Optional. |
Ancestor | String | Ancestor identifies the common root entity in which the entities are grouped. Must be written in Key Literal format: key(kind_1, identifier_1, kind_2, identifier_2, [...]). Example: key(kind_1, 'stringId', kind_2, 100) https://cloud.google.com/datastore/docs/concepts/structuring_for_strong_consistency | Optional. |
Index Strategy | String | Index strategy defines which fields defined in schema will be indexed in Cloud Datastore. Can be one of three options: All - all fields will be indexed None - none of fields will be indexed Custom - indexed fields will be provided in Indexed Properties | Required. All by default. |
Indexed Properties | String | List of property names to be marked as indexed separated by comma. | Optional. Must be provided if Index Strategy is set to Custom, otherwise must be empty. |
Batch size | Integer | Maximum number of entities that can be passed in one batch to a Commit operation. | Required. Default value 25. Min value 1. Max value 500. |
Implementation Tips
Implementation will be done using Google Cloud Datastore Data API since it provides mechanisms to allow query splitting during data reads.
https://cloud.google.com/datastore/docs/reference/data/rpc/
https://github.com/GoogleCloudPlatform/google-cloud-datastore
Examples
Initial Dataset
Kind: TaskType
Key | Label |
---|---|
name='DEV' | Development |
Kind: Task
Key | Parent | Priority |
---|---|---|
id=1 | Key(TaskType, 'DEV') | 1 |
id=2 | 1 |
Source examples
Read data filtered by ancestor and property, including Key
Kind | Task | |
---|---|---|
Has Ancestor | Key(TaskType, 'DEV') | |
Filters | Priority | 1 |
Key Type | Key literal | |
Key Alias | TaskKey |
Output Schema
TaskKey | String |
---|---|
Priority | Long |
Output Dataset
TaskKey | Priority |
---|---|
key(Task, 1) | 1 |
Read data filtered by property, without including Key
Kind | Task | |
---|---|---|
Has Ancestor | ||
Filters | Priority | 1 |
Key Type | None | |
Key Alias |
Output Schema
Priority | Long |
---|
Output Dataset
Priority |
---|
1 |
1 |
Sink examples
Insert new row with Ancestor and Custom name
Input Dataset
TaskId | Priority |
---|---|
3 | 2 |
Sink properties
Kind | Task |
---|---|
Ancestor | Key(TaskType, 'DEV') |
Key Type | Custom name |
Key Alias | TaskId |
Input Schema
TaskKey | Long |
---|---|
Priority | Long |
Resulting Dataset (new row was inserted)
Key | Parent | Priority |
---|---|---|
id=1 | Key(TaskType, 'DEV') | 1 |
id=2 | 1 | |
id=3 | Key(TaskType, 'DEV') | 2 |
Insert new row without Ancestor and with Auto-generated key
Input Dataset
Priority |
---|
2 |
Sink properties
Kind | Task |
---|---|
Ancestor | |
Key Type | Auto-generated key |
Key Alias |
Resulting Dataset (new row was inserted)
Key | Parent | Priority |
---|---|---|
id=1 | Key(TaskType, 'DEV') | 1 |
id=2 | 1 | |
id=11010104985 | 2 |
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature