- Created by abhinavc (Unlicensed), last modified by Rashi Gandhi on Jan 30, 2017
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 22 Next »
Introduction
A batch source that ingests data from dynamodb into hydrator pipelines.
Use case(s)
- For an organization all the spam emails are being dumped to dynamodb table. As a data scientist I want to train my machine learning models in hydrator pipelines based on the data from the dynamodb tables.
User Storie(s)
- User should be able to provide the table name in DynamoDb.
- User should be able to provide the AWS endpoint url for DynamoDb instance.
- User should be able to provide the AWS region id for DynamoDb instance.
- User should be able to provide the throughput for DynamoDb instance. (Dynamo db charges are incurred based on throughput and user should be able to control the throughput)
- User should be able to provide the AWS access id.
- User should be able to provide the AWS access key.
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Table name | String | Name of the dynamo db table | Naming convention constraints from AWS |
endpoint url | String | AWS endpoint url for DynamoDb instance | Optional, could be reconstructed using regionId |
region id | String | AWS region id for DynamoDb instance. | Optional, with default value set as us_west_2 |
access id | password | AWS access key | |
access key | password | AWS access secret key | |
query | String | Query to get the data |
|
filterQuery | String | Query to filter the fetched data, before returning to the client | (Optional) |
nameMap | String | Comma separated list of key value pair, where key is the Attribute name place holder used in Query/FilterQuery and value to replace the placeholders | (Optional) |
valueMap | String | Comma separated list of key value pair, where key is the value place holder used in Query/FilterQuery for items to be searched and value to replace the placeholders. | |
placeholderType | String | Attribute value placeholder and its type |
|
readThroughput | String | Read Throughput for AWS DynamoDB table to connect to in double. Default is 1 | (Optional) |
readThroughputPercentage | String | Read Throughput Percentage for AWS DynamoDB table to connect to. Default is 0.5. | (Optional) |
Design / Implementation Tips
- For Testing purposes tables can be created either using AWS cli or using java code http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPITablesExample.html
- AWS dynamoDb cli refrence http://docs.aws.amazon.com/cli/latest/reference/dynamodb/
- Java example for CRUD operations http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/batch-operation-document-api-java.html
- Java Example for working with queries http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryingJavaDocumentAPI.html
- Please reuse/modify RecordReader, InputFormat classes present here https://github.com/awslabs/emr-dynamodb-connector
Design
Dynamo Db JSON Format:
{
"name": "DynamoDb",
"type": "batchsource",
"properties": {
"accessKey": "xyz",
"secretAccessKey": "abc",
"regionId": "us-east-1",
"endpointUrl": "localhost:8000",
"tableName": "Movies",
"query": "Id = :v_Id",
"filterQuery": "rating > :v_rating",
"valueMap": ":v_Id|120,:v_rating|18"
"placeholderType": ":v_Id|int,:v_rating|int"
}
}
Approach(s)
- Dropdown with the list of regions will be provided to user, to select the region for AWS Dynamo DB to connect to. Supported regions are:
"us-gov-west-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2", "eu-west-1", "eu-west-2", "eu-central-1", "ap-south-1","ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "ap-northeast-2", "sa-east-1", "cn-north-1", "ca-central-1", "getCurrentRegion". (Referred from: http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/
http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region) - If user does not select any region, then default region will be used, i.e. us_west_2.
- getCurrentRegion from the list, returns a Region object representing the region the application is running in, when running in EC2. If this method is called from a non-EC2 environment, it will return null.
- User will provide the query in the form of keyCondition expression through “Query” widget, that will be used to fetch the items from DynamoDb table. For example: Id = :v_id and rating > :v_rating
Referred from : http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/dynamodbv2/datamodeling/DynamoDBQueryExpression.html#setKeyConditionExpression-java.lang.String- - If there is any filter query required to fetch the data, it can be provided through “Filter Query” widget, in similar fashion the query is provided.
- User will provide the actualt value for the placeholders used in Query/FilterQuery through nameMap and valueMap widget.
- Current implementation supports 'boolean,int,long,double,float and string' types for seraching, i.e attibute value type.
Properties
- endpointUrl: aws endpoint http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region This could be reconstructed using regionId.
- regionId: The region for AWS Dynamo DB to connect to.
- accessKey: Access key for AWS Dynamo DB.
- secretAccessKey: Secret access key for AWS Dynamo DB.
- tableName: The table to read the data from.
- throughput: Intended throughput for AWS Dynamo DB.
- query: The query that will fetch the data from table.
- filterQuery: Query that will be used to filter the fetched data, before returning the final results.
- nameMap: Comma separated list of key value pair, where key is the Attribute name place holder used in Query/FilterQuery and value to replace the placeholders.
- valueMap: Comma separated list of key value pair, where key is the value place holder used in Query/FilterQuery for items to be searched and value to replace the placeholders.
- placeholderType: List of Attribute value placeholders and its type.
- readThroughput: Read Throughput for AWS DynamoDB table to connect to in double. Default is 1. (Macro Enabled)
readThroughputPercentage: Read Throughput Percentage for AWS DynamoDB table to connect to. Default is 0.5. (Macro Enabled)
Security
- The AWS access keys should be a password field and macros enabled
Limitation(s)
Future Work
Test Case(s)
DynamoDB batch source - query using partition key.
DynamoDB batch source - query using partition and sort key.
DynamoDB batch source - with query and filter query.
Sample Pipeline
DynamoDBSourcePipeline-cdap-data-pipeline.json
Table of Contents
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature
- No labels