Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Introduction

A batch source that ingests data from dynamodb into hydrator pipelines.

Use case(s)

  • For an organization all the spam emails are being dumped to dynamodb table. As a data scientist I want to train my machine learning models in hydrator pipelines based on the data from the dynamodb tables.

User Storie(s)

  • User should be able to provide the table name in DynamoDb.
  • User should be able to provide the AWS endpoint url for DynamoDb instance.
  • User should be able to provide the AWS region id for DynamoDb instance.
  • User should be able to provide the throughput for DynamoDb instance. (Dynamo db charges are incurred based on throughput and user should be able to control the throughput)
  • User should be able to provide the AWS access id.
  • User should be able to provide the AWS access key.


Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraints
Table nameStringName of the dynamo db tableNaming convention constraints from AWS
endpoint urlStringAWS endpoint url for DynamoDb instance

Optional, could be reconstructed using regionId

region idStringAWS region id for DynamoDb instance. 
throughputIntIntended throughput for DynamoDb(Optional)
access idStringAWS access id 
access keypasswordAWS access key 

query

StringQuery to get the data 
parition keyStringPartition key to get the data 
sort keyStringSort key to refine/sort the fetched data(Optional)

Design / Implementation Tips

Design

We will provide dropdown with the list of supported regions to user, to select the region for AWS Dynamo DB to connect to.

Dynamo Db JSON Format:

{

  "name": "DynamoDb",

  "type": "batchsource",

    "properties": {

                  "accessKey": "xyz",

                  "secretAccessKey": "abc",

                  "regionId": "us-east-1",

                  "endpointUrl": "localhost:8000",

                  "tableName": "Movies",

                  "throughput": "10",

                  "query": "ID = :v_ID",

                  "partitionKey": "Id",

                  "sortKey": "salary"

    }

}

Approach(s)

Properties

  • endpointUrl: aws endpoint http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region This could be reconstructed using regionId.
  • regionId: The region for AWS Dynamo DB to connect to.
  • accessKey: Access key for AWS Dynamo DB.
  • secretAccessKey: Secret access key for AWS Dynamo DB.
  • tableName: The table to read the data from.
  • throughput: Intended throughput for AWS Dynamo DB.
  • query: The query that will fetch the data from table.
  • partitionKey: Partition key, that will be used to fetch the data from table.
  • sortKey: Sort Key, that will be used to sort/refine the fetched data from table.

Security

  • The AWS access keys should be a password field and macros enabled

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

 

 

Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature