Versions Compared
compared with
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
A batch sink that pushes data from hydrator pipelines into dynamoDb tables.
Use case(s)
- An organization wants to parse the logs generated by a system and want to store the metadata in dynamodb tables.
User Storie(s)
- User should be able to provide the table name in DynamoDb.
- User should be able to provide the primary key of the table.
- User should be able to provide the type of primary key (hash or range).
- The table should be created if it is not already existing.
- User should be able to provide the AWS endpoint url for DynamoDb instance.
- User should be able to provide the AWS region id for DynamoDb instance.
- User should be able to provide the AWS access id.
- User should be able to provide the AWS access key.
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints | ||||
---|---|---|---|---|---|---|---|
Table name | String | Name of the dynamo db table | Naming convention constraints from AWS | ||||
Primary key fields | List<Map<String,String>> | Primary key fields of the table | There should be at least 1 primary key | ||||
endpoint url | String | AWS endpoint url for DynamoDb instance | Optional, could be reconstructed using regionId | ||||
region id | String | AWS region id for DynamoDb instance. | throughput | Int | Intended throughput for DynamoDb | (Optional) | |
access id | String | AWS access id | |||||
access key | password | AWS access key | |||||
Primary key types | List<Map<String,String>> | Key types for the primary keys, used for creating the table | The primary key type can only have 2 values HASH and RANGE | ||||
Read Capacity Units | Long | The number of strongly consistent reads per second of items up to 4 KB in size per second. | |||||
Write capacity units | Long | The number of 1 KB writes per second. |
Design / Implementation Tips
- For Testing purposes tables can be created either using AWS cli or using java code http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaDocumentAPITablesExample.html
- AWS dynamoDb cli refrence http://docs.aws.amazon.com/cli/latest/reference/dynamodb/
- Java example for CRUD operations http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/batch-operation-document-api-java.html
- Please reuse/modify RecordReader, InputFormat classes present here https://github.com/awslabs/emr-dynamodb-connector
Design
DynamoDB Sink JSON format:
Code Block | ||
---|---|---|
| ||
{ "name": "DynamoDb", "type": "batchsink", "properties": { "endpointUrl": "", "regionId": "us-east-1", "accessKey": "xyz", "secretAccessKey": "abc", "tableName": "Movies", "primaryKeyFields": "Id:N", "primaryKeyTypes": "Id:HASH", "readCapacityUnits": "10", "throughputwriteCapacityUnits": "10" } } |
Approach(s)
- Dropdown with the list of regions will be provided to user, to select the region for AWS Dynamo DB to connect to. Supported regions are:
"us-gov-west-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2", "eu-west-1", "eu-west-2", "eu-central-1", "ap-south-1","ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "ap-northeast-2", "sa-east-1", "cn-north-1", "ca-central-1", "getCurrentRegion". (Referred from: http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/
http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region) - If user does not select any region, then default region will be used, i.e. us_west_2.
- getCurrentRegion from the list, returns a Region object representing the region the application is running in, when running in EC2. If this method is called from a non-EC2 environment, it will return null.
- The plugin will support following CDAP data types in schema: String, Number(int, long, float, double), Boolean, NULL, Map, List, Array of String and Array of Number, Bytes (will be converted to binary when storing to DynamoDB).
- Key value drop-down to take the name of the primary key fields and attribute type. The drop-down will allow following values: String, Number(int, long, float, double), Boolean, NULL, Map, List, Array of String and Array of Number.
- Key value drop-down to take the name of the primary key fields and key type. The drop-down will have the following values: "N"(number) and , "S"(string) and "B"(binary - the byte[] value received from the previous stage will be converted to binary when storing the data in DynamoDB).
Properties
- endpointUrl: aws endpoint http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region This could be reconstructed using regionId.
- regionId: The region for AWS Dynamo DB to connect to.
- accessKey: Access key for AWS Dynamo DB.
- secretAccessKey: Secret access key for AWS Dynamo DB.
- tableName: The table to read the data from.
- primaryKeyFields: The field name to be used as primary key and its type.
- primaryKeyTypes: Primary key field names and type
- throughput: Intended throughput for DynamoDb.
- readCapacityUnits: The maximum number of strongly consistent reads consumed per second before DynamoDB returns a ThrottlingException. This will be used when creating a new table if the table name specified by the user does not exists.
- writeCapacityUnits: The maximum number of writes consumed per second before DynamoDB returns a ThrottlingException. This will be used when creating a new table if the table name specified by the user does not exists.
Security
- The AWS access keys should be a password field and macros enabled
Limitation(s)
Future Work
Test Case(s)
Sample Pipeline
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature