Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

One common use case is that of a user running a relational database with multiple tables. They would like to create copies of those tables in a data warehouse like BigQuery in a single, simple operation. All existing data should be copied first, then new changes (inserts, updates, deletes) that are applied to the relational db tables should be reflected in the BigQuery tables within minutes. Newly created tables in the relational db should automatically appear in BigQuery. Tables that are deleted in the relational db should be delete in BigQuery. Compatible schema changes should also be reflected.

Pipelines are usually not suitable for these types of use cases, which more closely resemble replication than incremental loads. It is possible to incrementally load data from a single table to a single BigQuery table if the table never has deletes or updates and has a monotonically increasing column. Most users do not have a write pattern like this, so a better solution is required. 

Goals

Design a way for users to easily create a continuously updating copy of their existing data.

User Stories

  1. As a data admin, I want to be able to copy data from Oracle, MySQL, or SQL Server
  2. As a data admin, I want to be able to copy data into BigQuery, Spanner, Redshift, or Snowflake
  3. As a data admin, I want my copied tables to be updated within 5 minutes of the source table being updated
  4. As a data admin, if a data write failed but could possibly succeed on retry (ex: system temporarily down, somebody revoked privileges on the service account, etc), I want the copy to be paused until the problem is fixed
  5. As a data admin, if a data write failed in a way that cannot succeed on retry (ex: db column type was altered, which isn't supported in the destination), I want the data to be captured somewhere so that
is continuously updated. 

User Stories 

  1. Breakdown of User-Stories 
  2. User Story #1
  3. User Story #2
  4. User Story #3they can be handled manually
  5. As a data admin, I want to know how many records failed to write to each destination table
  6. As a data admin, I do not want any data to be lost even if the copy process crashes
  7. As a data admin, I do not want duplicate data in the destination even if the copy process crashes
  8. As a data admin, I want to be able to tell how far behind my destination tables are compared to my source tables
  9. As a data admin, I want to have some metrics around how quickly data is being copied
  10. As a data admin, I want to be able to pause and resume my data copy processes
  11. As a data admin, I want to be able to delete my data copies
  12. As a data admin, I want to be able to select a subset of database tables to copy to my destination system
  13. As a data admin, I want to be able to configure whether new tables in my source database should automatically appear in my destination system
  14. As a data admin, I want to be able to configure whether tables that are dropped in my source database should automatically be dropped in my destination system
  15. As a data admin, I want to be able to see logs about my data copy in case there are issues (out of memory, permissions errors, etc)
  16. As a data admin, I want to be able to find documentation about what type of database setup I need to perform on my source database
  17. As a data analyst, I want queries executed against BigQuery, Redshift, or Snowflake to return results consistent with a state that the origin was in at some point in the past (transactions are honored) (future work)

Design


Cover details on assumptions made, design alternatives considered, high level design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors







Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results












Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work