Introduction

A batch source that ingests data from dynamodb into hydrator pipelines.

Use case(s)

For an organization all the spam emails are being dumped to dynamodb table. As a data scientist I want to train my machine learning models in hydrator pipelines based on the data from the dynamodb tables.

User Storie(s)

User should be able to provide the table name in DynamoDb.
User should be able to provide the AWS endpoint url for DynamoDb instance.
User should be able to provide the AWS region id for DynamoDb instance.
User should be able to provide the throughput for DynamoDb instance. (Dynamo db charges are incurred based on throughput and user should be able to control the throughput)
User should be able to provide the AWS access id.
User should be able to provide the AWS access key.

Plugin Type

Batch Source
Batch Sink
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Configurables

This section defines properties that are configurable for this plugin.

User Facing Name	Type	Description	Constraints
Table name	String	Name of the dynamo db table	Naming convention constraints from AWS
endpoint url	String	AWS endpoint url for DynamoDb instance	Optional, could be reconstructed using regionId
region id	String	AWS region id for DynamoDb instance.	Optional, with default value set as us_west_2
throughput	Int	Intended throughput for DynamoDb	(Optional)
access id	password	AWS access key
access key	password	AWS access secret key
query	String	Query to get the data
filterQuery	String	Query to filter the fetched data, befor before returning to the client	(Optional)
partition key	String	Partition key to get the data
sort key	String	Sort key to refine/sort the fetched data	(Optional)

Design / Implementation Tips

For Testing purposes tables can be created either using AWS cli or using java code http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPITablesExample.html
AWS dynamoDb cli refrence http://docs.aws.amazon.com/cli/latest/reference/dynamodb/
Java example for CRUD operations http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/batch-operation-document-api-java.html
Java Example for working with queries http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryingJavaDocumentAPI.html
Please reuse/modify RecordReader, InputFormat classes present here https://github.com/awslabs/emr-dynamodb-connector

Design

Dynamo Db JSON Format:

{

"name": "DynamoDb",

"type": "batchsource",

"properties": {

"accessKey": "xyz",

"secretAccessKey": "abc",

"regionId": "us-east-1",

"endpointUrl": "localhost:8000",

"tableName": "Movies",

"throughput": "10",

"query": "year_id = 1985Id = :v_Id",

"filterQuery": "rating > 5:v_rating",

"valueMap": "v_Id:120,v_rating:18"

"partitionKey": "Id",

"sortKey": "salary"

}

Approach(s)

Dropdown with the list of regions will be provided to user, to select the region for AWS Dynamo DB to connect to. Supported regions are:
"us-gov-west-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2", "eu-west-1", "eu-west-2", "eu-central-1", "ap-south-1","ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "ap-northeast-2", "sa-east-1", "cn-north-1", "ca-central-1", "getCurrentRegion". (Referred from: http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/
http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region)
If user does not select any region, then default region will be used, i.e. us_west_2.
getCurrentRegion from the list, returns a Region object representing the region the application is running in, when running in EC2. If this method is called from a non-EC2 environment, it will return null.
User will provide the complete query(fields and its value, conditions, expressions and constraints if any) through “Query” widget, that will be used to fetch the data. For example: year_id = 1985 and rating > 5
If there is any filter query required to fetch the data, it can be provided through “Filter Query” widget.

Properties

endpointUrl: aws endpoint http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region This could be reconstructed using regionId.
regionId: The region for AWS Dynamo DB to connect to.
accessKey: Access key for AWS Dynamo DB.
secretAccessKey: Secret access key for AWS Dynamo DB.
tableName: The table to read the data from.
throughput: Intended throughput for AWS Dynamo DB.
query: The query that will fetch the data from table.
filterQuery: Query that will be used to filter the fetched data, before returning the final results.
nameMap: Comma separated list of key value pair, where key is the Attribute name place holder used in Query/FilterQuery and value to replace the placeholders.
valueMap: Comma separated list of key value pair, where key is the value place holder used in Query/FilterQuery for items to be searched and value to replace the placeholders.
partitionKey: Partition key, that will be used to fetch the data from table.
sortKey: Sort Key, that will be used to sort/refine the fetched data from table.

Security

The AWS access keys should be a password field and macros enabled

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 13

New Version 14

Key

Introduction

Use case(s)

User Storie(s)

Plugin Type

Configurables

Design / Implementation Tips

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

Page Comparison

Versions Compared

Old Version 13

New Version 14

Key

Introduction

Use case(s)

User Storie(s)

Plugin Type

Configurables

Design / Implementation Tips

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline