Amazon AuroraDB Source and Sink
- Sree Raman
- Illia
Owned by Sree Raman
Introduction
Amazon Aurora is a Mysql and Postgres compatible database offered as a service. Users will have needs to write to AuroraDB or read from AuroraDB
Use case(s)
- Users would like to batch build a data pipeline to read complete table from Amazon Aurora DB instance and write to BigTable.Â
- Users would like to batch build a data pipeline to perform upserts on AuroraDB tables in batchÂ
- Users should get relevant information from the tool tip while configuring the AuroraDB source and AuroraDB sink
- The tool tip for the connection string should be customized specific to the database.Â
- The tool tip should describe accurately what each field is used for
- Users should get field level lineage for the source and sink that is being used
- Reference documentation be available from the source and sink plugins
DeliverablesÂ
- Source code in data integrations org
- Integration test codeÂ
- Relevant documentation in the source repo and reference documentation section in plugin
Relevant linksÂ
- Data-integrations org:Â https://github.com/data-integrations/
- Field level lineage:Â https://docs.cdap.io/cdap/6.0.0-SNAPSHOT/en/developer-manual/metadata/field-lineage.html
- Integration test repos:Â https://github.com/caskdata/cdap-integration-tests
Plugin Type
- Batch Source
- Batch SinkÂ
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design / Implementation Tips
- Amazon's SDK for Relational Database Service (part of which Aurora DB is)Â is only useful for programmatically managing database clusters and instances (creating and deleting databases, changing instance state, managing users, creating dumps to S3, etc). From client perspective, all you need to work with Aurora DB is JDBC driver.
- There are two ways to connect to AuroraDB - using JDBC and using RDS Data API client. According to documentation, Data API client 'provides Http Endpoint to query RDS databases'. This means that in this case data will be accessed via REST API backed by the same JDBC calls. Thus there will be not only network delay but also additional time should be spent serializing/deserializing classes from SDK. Also, using HTTP doesn't seem to be any good for database batch operations. Another reason to stick with JDBC is the fact, that most of the CPU time in database plugins is spent on datatypes mapping and conversion (as discovered during performance testing), not on IO operations.
JDBC | RDS API |
---|---|
+ Drivers are designed to connect to remote databases | + Database-agnostic, same calls seem to work with cluster of any type (Postgres, MySQL) |
+ Database connections are reused, connection pooling is available | - Serialization/deserialization overhead |
- Requires driver to be provided | - Sending big batches of data as a text via relatively slow HTTP protocol |
- Still uses JDBC somewhere in AWS infrastructure | |
- Plugin should be responsible for preventing SQL injection |
- Taking into account above arguments it looks like connecting to AuroraDB cluster using JDBC driver would be the most rational choice, so it might also be a good idea to reuse existing database source and sink functionality from database-plugins project.
Design
- For better user experience it is suggested to create separate plugins for AuroraDB MySQL and PostgreSQL.Â
Future Work
Table of Contents
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature