Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
A batch source that ingests data from dynamodb into hydrator pipelines.
Use case(s)
- For an organization all the spam emails are being dumped to dynamodb table. As a data scientist I want to train my machine learning models in hydrator pipelines based on the data from the dynamodb tables.
User Storie(s)
- User should be able to provide the table name in DynamoDb.
- User should be able to provide the AWS endpoint url for DynamoDb instance.
- User should be able to provide the AWS region id for DynamoDb instance.
- User should be able to provide the throughput for DynamoDb instance. (Dynamo db charges are incurred based on throughput and user should be able to control the throughput)
- User should be able to provide the AWS access id.
- User should be able to provide the AWS access key.
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Table name | String | Name of the dynamo db table | Naming convention constraints from AWS |
endpoint url | String | AWS endpoint url for DynamoDb instance | Optional, could be reconstructed using regionId |
region id | String | AWS region id for DynamoDb instance. | Optional, with default value set as us_west_2 |
throughput | Int | Intended throughput for DynamoDb | (Optional) |
access id | password | AWS access key | |
access key | password | AWS access secret key | |
query | String | Query to get the data |
|
filterQuery | String | Query to filter the fetched data, befor before returning to the client | (Optional) |
partition key | String | Partition key to get the data |
|
sort key | String | Sort key to refine/sort the fetched data | (Optional) |
Design / Implementation Tips
- For Testing purposes tables can be created either using AWS cli or using java code http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPITablesExample.html
- AWS dynamoDb cli refrence http://docs.aws.amazon.com/cli/latest/reference/dynamodb/
- Java example for CRUD operations http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/batch-operation-document-api-java.html
- Java Example for working with queries http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryingJavaDocumentAPI.html
- Please reuse/modify RecordReader, InputFormat classes present here https://github.com/awslabs/emr-dynamodb-connector
Design
Dynamo Db JSON Format:
{
"name": "DynamoDb",
"type": "batchsource",
"properties": {
"accessKey": "xyz",
"secretAccessKey": "abc",
"regionId": "us-east-1",
"endpointUrl": "localhost:8000",
"tableName": "Movies",
"throughput": "10",
"query": "year_id = 1985Id = :v_Id",
"filterQuery": "rating > 5:v_rating",
"valueMap": "v_Id:120,v_rating:18"
"partitionKey": "Id",
"sortKey": "salary"
}
}
Approach(s)
- Dropdown with the list of regions will be provided to user, to select the region for AWS Dynamo DB to connect to. Supported regions are:
"us-gov-west-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2", "eu-west-1", "eu-west-2", "eu-central-1", "ap-south-1","ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "ap-northeast-2", "sa-east-1", "cn-north-1", "ca-central-1", "getCurrentRegion". (Referred from: http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/
http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region) - If user does not select any region, then default region will be used, i.e. us_west_2.
- getCurrentRegion from the list, returns a Region object representing the region the application is running in, when running in EC2. If this method is called from a non-EC2 environment, it will return null.
- User will provide the complete query(fields and its value, conditions, expressions and constraints if any) through “Query” widget, that will be used to fetch the data. For example: year_id = 1985 and rating > 5
- If there is any filter query required to fetch the data, it can be provided through “Filter Query” widget.
Properties
- endpointUrl: aws endpoint http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region This could be reconstructed using regionId.
- regionId: The region for AWS Dynamo DB to connect to.
- accessKey: Access key for AWS Dynamo DB.
- secretAccessKey: Secret access key for AWS Dynamo DB.
- tableName: The table to read the data from.
- throughput: Intended throughput for AWS Dynamo DB.
- query: The query that will fetch the data from table.
- filterQuery: Query that will be used to filter the fetched data, before returning the final results.
- nameMap: Comma separated list of key value pair, where key is the Attribute name place holder used in Query/FilterQuery and value to replace the placeholders.
- valueMap: Comma separated list of key value pair, where key is the value place holder used in Query/FilterQuery for items to be searched and value to replace the placeholders.
- partitionKey: Partition key, that will be used to fetch the data from table.
- sortKey: Sort Key, that will be used to sort/refine the fetched data from table.
Security
- The AWS access keys should be a password field and macros enabled
Limitation(s)
Future Work
Test Case(s)
Sample Pipeline
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature