Salesforce Sync plugins can help users automatically sync their Salesforce objects to specified destinations with only a few clicks.
Use-case
A user would like to specify multiple objects from Salesforce, that need to be synced with corresponding objects in the destination (e.g. BigQuery tables, GCS buckets). So a user would specify a set of objects in the source, and would connect the appropriate destination (a BigQuery dataset, or a GCS bucket). The pipeline would sync data into the destination separately for all objects. The objects are not to be joined, but are to be synced separately into their corresponding destinations. In addition, a user should be able to specify whether to sync the object's data since the beginning of time, or since a given timestamp.
User Stories
As a data pipeline user, I would like to be able to create a pipeline to sync a single Salesforce object to a specified destination, so that I can use the Salesforce object for analytics in my downstream processes.
As a data pipeline user, I would like to be able to create a pipeline to sync multiple Salesforce objects to their corresponding destinations, so that I can use the Salesforce objects for analytics in my downstream processes.
Requirements
Support BigQuery and Cloud SQL as destinations
Select source object(s) and a destination. For each object, create a corresponding object in the destination (if it doesn't exist already), and push data to it.
A user should be able to easily identify the destination object based on its name. The name should be unique, and must contain the source object name. For uniqueness purposes, more information may be encoded in the destination object name.
Support full (get all data since the beginning of time) and incremental (get data since the specified timestamp) modes.
Automatically handle schema changes between source and destination. If a destination exists, but with a different schema from the source because the source's schema has been updated, the destination's schema should be updated.
User flow
User specifies Salesforce credentials (username, password, clientId and clientSecret)
The source plugin determines the Salesforce instance to use based on these credentials
User selects the Salesforce objects he wants to replicate
There may be a limitation here that CDAP currently does not have the capability to list all objects for the user to select from. If that's the case, we may require the user to manually enter the objects to begin with, and then enhance this experience later (perhaps when connections and plugins are unified)
User specifies whether he wants to run a "full refresh" (pull all data since the beginning of time) or incremental (pull data since the specified timestamp).
User specifies the destination.
This is perhaps a different set of sink plugins
There is one such plugin for GCS, Cloud SQL, BigQuery and Spanner
For each such plugin, the user only specifies the top level container. e.g. For GCS, user specifies a bucket. For BigQuery, a dataset, for CloudSQL, a database, and so on.
For all Salesforce objects selected
The sink plugins create a corresponding object in the destination- e.g. a directory in the GCS bucket, a table in the BigQuery dataset, a table in the CloudSQL database.
The sink plugins ensure that the schema of the destination objects is consistent with the schema of the source, at the time of the run