Data Fusion CI/CD Best Practices
1. Introduction
This document is a WIP.
1.1 Why is CI/CD important?
Using DevOps principles including Continuous Integration and Continuous Delivery / Continuous Deployment (CI/CD) is part of the scalability pillar of the Google Cloud Adoption Framework (GCAF). The diagram below shows the GCAF Cloud Adoption epics, where you can see CI/CD referenced as part of the "Scale" pillar (shown in blue).
By using CI/CD techniques we can scale our development team, and allow them to focus on developing code and reduce the overhead of things like manual testing and deployment. We do this by automating many of the steps that may have been done manually. This also has the added benefit of reducing risks associated with errors during the deployment process as checks are built into the process. It also helps to shorten the time to release, and allows teams to deploy smaller releases more often.
This article will not go into depth on what CI/CD is and we will assume the reader is familiar with DevOps and CI/CD. For more information about CI/CD, please refer to one of the following resources:
1.2 How CI/CD applies to Data Engineering
Data Engineers are often required to quickly onboard new data or modify existing data pipelines with relatively lean teams. As more and more data pipelines are created over time these can become an operational burden to manage if done manually. By adopting CI/CD practices, we can allow these teams to focus on the value-adding components such as pipeline creation and manipulation.
Many existing guides that cover CI/CD relate to software development but there are some unique constraints to work with when evaluating Data Engineering pipelines that need to be specifically addressed, and I aim to address these in this document.
1.3 What this guide will cover
This guide will provide an overview of how to use GCP tools to implement a CI/CD pipeline for Cloud Data Fusion. This guide will also touch on deployment of Cloud Composer pipelines which is commonly used for orchestration of Data Fusion pipelines. Throughout this document you will find tips and best practices highlighted in callout boxes.
1.4 Intended Audience
The intended audience for this guide is for developers of Data Fusion pipelines who are looking to understand how to apply CI/CD processes to pipeline development with Data Fusion. It will also cover Composer deployments where they relate to Data Fusion.
1.5 Assumed Knowledge
There is some assumed knowledge for readers including:
Knowledge of how to use git. If this is a new topic to readers, they can view the git-scm site for tutorials and guides to learn git.
Familiarity with Cloud Data Fusion pipeline development
Familiarity with Google Cloud Platform products (including but not limited to Cloud Source Repositories, Cloud Build, Cloud Composer, Cloud Data Fusion, Cloud Dataproc, Stackdriver Monitoring, Stackdriver Logging, BigQuery, Google Cloud Storage)
Familiarity with CI/CD concepts
Familiarity with bash scripting
2. Environment
This section shows one possible configuration of Data Fusion and Composer. As Data Fusion evolves, this recommended setup is likely to change. In particular the rollout of Role Based Access Controls (RBAC) is likely to have an effect on this design. However, the underlying CI/CD Principles will be the same.
2.1 Data Processing Environment
The above diagram shows the environments that would be used for development of data ingestion pipelines:
Non-Prod Instance (left): An enterprise edition of Cloud Data Fusion with administrator access for developers
namespace: dev - for IT development of data pipelines
Production Instance (right): This would be an enterprise edition of Cloud Data Fusion with view-only access for developers. Operations should have Administrator access.
namespace: test - for integration testing
namespace: prod - for operational execution of pipelines - restricted access would apply for developers (roles/datafusion.viewer)
Best Practice Tips:
|
2.2 CI/CD Environment
To create a CI/CD pipeline with Data Fusion, we'll need to address the following components:
Version Control: This will be used to store Data Fusion pipeline code and key artifacts so we can keep track of changes and manage the code review process. We'll also be using version control merges to trigger the promotion process that will move our pipelines between environments.
Deployment: To promote our pipelines, we'll need to use a tool to trigger something to happen when a checkin occurs. This is usually some unit tests or a command to promote our code between environments. We are also going to need to use the CDAP Pipeline REST API to perform actions against our Data Fusion environment.
Testing - we can break this down into two types of tests:
Data Fusion Pipeline testing: We will address how to test a Data Fusion pipeline to ensure it functions as expected.
Composer Integration testing: To ensure that all our objects work together end-to-end, we will need to perform a complete run of a Composer pipeline in a test environment. To do this we will execute Composer pipelines that trigger a number of Data Fusion pipelines, and then validate that these ran correctly and completed processing within a reasonable timeframe.
Secret Management: To deploy our pipeline from Cloud Build using the CDAP API, we'll need a place to store our service account key file so we can generate an access token we can use when we call the CDAP API.
2.2.2 High-Level CI/CD Architecture
Below is a typical CI/CD pipeline:
2.2.2.1 Data Fusion CI/CD flow
For Data Fusion, we can implement the migration from the Dev to Test environment as shown:
The numbered lines shown in black are steps that need to be executed by the development team. The other actions would be set up one time only and applied automatically by the CI/CD process.
To start the migration process, the developer would need to perform the following steps:
Create a pipeline in the Cloud Data Fusion development environment.
Export the pipeline JSON file.
Commit the pipeline to version control.
Submit a pull request to merge from the develop branch to the test branch. Where there are code review tasks set up within the version control process these would take place at this stage.
Once the code has been merged, an automated process can take over the deployment and testing. This involves:
The 3rd Party Version Control repository must be set up for mirroring with a Cloud Source Repository. Alternatively, you can also trigger directly from a 3rd party git app if you prefer - see here.
When a new change has been detected (as defined by the trigger) on the Cloud Source Repository by Cloud Build, an automated process will run the actions specified in the associated cloudbuild.yaml file.
The cloudbuild.yaml file (also stored in version control) should be set up to:
Deploy the pipeline using the CDAP REST APIs from the development environment to the test environment.
Start the pipeline execution using the CDAP REST APIs on the test environment.
Wait for pipeline completion, then run automated testing over the pipeline to ensure it ran successfully.
Details of the Cloud Build deployment process will be addressed in later sections when we discuss continuous deployment.
Once the migration to the test environment has been successful, deployment to production is a simple process involving merging the test branch to the main (production) branch. Branch restrictions would be applied to the main branch, as any merge will trigger a deployment to production.
Additional Resources:
For publicly available information about CI/CD with CDAP see:
Medium CDAP.IO blog - CI/CD and Change Management for Pipelines - Part 1: Covers the overall process and defines the concepts.
Medium CDAP.IO blog - CI/CD and Change Management for Pipelines - Part 2: Focuses on how to extract artifacts from a CDF/CDAP environment and store them on GitHub.
Medium CDAP.IO blog - CI/CD and Change Management for Pipelines - Part 3: Discusses the process for migrating artifacts from GitHub into a TEST, QA, or PROD environment.
Note that some instructions may vary slightly for these CDAP documents than those for Cloud Data Fusion but the overall process is similar.
2.2.2.2 Composer CI/CD flow
For Composer, we can implement the migration from the Dev to Test environment as shown below. The process is essentially the same as for Data Fusion apart from Step 1 and 2 since the authoring process is different, and the deployment process involves copying DAG files to the Composer GCS bucket. The Cloud Build deployment process is also different - this will be addressed in later sections when we discuss continuous deployment.
2.2.3 Version Control System
Before we can begin creating a CI/CD pipeline, we must have a version control system (VCS) in place. This is the foundation that is used to control activities in the CI/CD process.
In this design, we use a third-party version control system such as GitHub, BitBucket or GitLab. This allows us to implement a code review process and use branch merge restrictions. The third-party repository can be mirrored to a Google Cloud Source Repository through the Cloud Source repositories console. Mirroring allows us to use a Cloud Source Repository as a trigger source by Cloud Build. More details on the configuration of the Version Control repository are provided in Section 3.1 Version Control.
Best Practice Tip:
|
2.2.4 Integrated Development Environment
We recommend developers set up a local development environment to work with that is integrated with your version control system. This makes it easier to manage files from a single place, and group and commit changes using the built-in capabilities of the IDE.
For instructions on how to set your IDE up with version control, please refer to instructions for the specific version of the IDE you are using.
Best Practice Tip:
|
3. Continuous Integration
The core of Continuous Integration is version control. This section will outline how to work with Version Control.
3.1 Version Control
3.1.1 Version Control Structure
The following structure is recommended to initialize the version control repository. Note that this is likely to change over time as use cases and user needs are taken into consideration.
Folders
composer/
env-parameters/
dev/
test/
prod/
dags/
dag-parameters/
dag-dependencies/ - e.g. read_params.py - common across systems
datafusion/
pipelines/ - JSON files
pipeline-arguments/ - argument JSON files
pipeline-tests/ - input csv files used for testing
metrics/ - query JSON files
plugins/
plugin-tests/ - note that tests can also be contained within the maven build - see the "example-transform" code
udds/
udd-tests/
bigquery/
environment-definitions/
dev/
test/
prod/
table-definitions/ - audit tables (manually created)
view-definitions/ - custom views (manually created)
3.1.2 Branching Structure
The branching structure to support our git workflow will be configured as follows:
The "main" branch is reserved for production code that is deployable. It should only contain stable code, meaning that it should have passed all tests and reviews prior to any code being merged. When you first create a repository this branch is created by default.
The "develop" branch is used to share code in development that is ready for testing.
The "feature" branches are used for individual development. Often these branches contain code that is in development and may not be ready for testing.
3.1.2.1 Folders and Branches
The following diagram shows how folders and branches interact with one another. Folders are generally the same across each branch (unless new folders are created on feature branches and are not yet merged). The diagram below shows how a user would create a pipeline "mypipeline", how this would go through each branch until it reaches production, and how deployments would occur after each branch merge.
3.1.3 Version Control Repository Configuration
The third party version control repository will be set as the working repository. Within this working repository the following actions will take place:
Developers will track ("git add"), commit ("git commit"), and push ("git pull" followed by "git push") pipeline code to the remote.
Developers will pull ("git pull") the latest code from the remote.
When the developer is ready to merge from one branch to another (e.g. from a feature branch to the develop branch), they will submit a "pull request" to signal they would like to merge. Within third-party version control systems this can usually be done within the UI.
You can find more information about pull requests (aka merge requests) with Bitbucket in the Atlassian Bitbucket Server documentation, for GitHub in the GitHub documentation, and for GitLab in the GitLab documentation. For more general information, see this Atlassian Tutorial on Pull Requests.
Within the third-party version control system (e.g. BitBucket, GitHub, GitLab), conditions will be set up on the branches (branch permissions) to indicate what should happen when a pull request is made. You can find more information about branch permissions available for Bitbucket in the Atlassian Bitbucket documentation, for GitHub in the GitHub documentation, and for GitLab in the GitLab documentation.
Prior to a merge to the develop branch, a code review should take place. The primary purpose of this review is to ensure understanding across the development team (avoiding a knowledge base centered around a single person).
Prior to a merge to the main branch, "admins" should approve the main branch merge. Admins could be code owners, the operations team or the security team for example. The purpose of this approval is to ensure that the timing for pushing to production causes minimal interference with business processes while the upgrade takes place. Depending on the code base being deployed, sometimes security reviews will also occur at this point to ensure code doesn't introduce vulnerabilities. This should be done if the code is going to connect to external systems outside of the GCP environment.
A second read-only Cloud Source Repository can be created that will mirror the Bitbucket Server repository. Developers will not be interacting directly with this repository, and we ensure this by making it read-only. The purpose of mirroring the third-party version control repository is so we can trigger a Cloud Build process. The description of how these processes are triggered will be discussed further in the following sections. The main thing to note here is that a merge to the Develop branch will trigger a deployment to the Test environment, and a merge to the main branch will trigger a deployment to the Production environment.
The process described in this subsection is also drawn below for reference. In this example the third-party version control system used is BitBucket Server:
3.1.4 Git Workflow
Section 3.1.2 describes a simple branching implementation that allows us to follow the GitHub Flow methodology for branching. The github site describes GitHub Flow in an easy to read format here.
For developers new to version control, they should begin by adopting the GitHub Flow, since it is likely to meet the needs of the customer initially without introducing a high degree of complexity.
Note that there are a number of branching strategies that could be adopted. For more complex scenarios, this document provides a good overview of another approach that can be used and this same workflow is also described on the Bitbucket site.
3.1.5 Dealing with secret management mishaps in version control
It's important to ensure that secrets such as username and password information are not stored in version control, since the code base is more widely accessible.
If code is accidentally committed that contains secrets, it should be removed immediately. Because Version Control tracks history, it is not as simple as just deleting the file and re-committing. The commit history will also need to be updated. If you catch this soon after committing, you can use the "git reset --hard <commit-id>" command. Note: Only an experienced version control administrator should perform this action as we are removing the commits and there is the potential to lose code with this action. For secrets that have been committed and available in the repository for an extended period of time, it is recommended you change the secret details (i.e. change your password), as it is likely your secret is already out!
To manage secrets a secret management tool is generally used. Doing this gives the ability to refer to a secret without revealing the secret's contents. How this is used for deployment with Cloud Build is described in further detail in Section 4.2 Cloud Build Deployment Secret Management.
Best Practice Tip:
|
3.2 Committing to version control
3.2.1 Committing Data Fusion files to version control
This section describes how to export pipeline code from Data Fusion and commit it to Version Control. A high-level view of this process is described in Section 2.2.2.1 Data Fusion CI/CD flow.
The following sections go a level deeper to describe how each artifact e.g. code or file, should be gathered and committed to version control.
3.2.1.1 Data Fusion Pipeline JSON files
Before you can commit your pipeline to version control, you first need to export a copy of it. You would do this by going to the Actions menu within Studio and clicking Export as shown:
Tip: When you export the pipeline it's a good idea to settle on a standard nomenclature that suits your environment. For example <application>_<usecase>_pipeline_<version>.json.
Once you have the pipeline JSON file you can commit this to version control. To do this:
First, copy the file to your version control directory by locating your cloned repository and copying the file into it. So, if you cloned your repository to C:/dev/myrepo/, then you would navigate to the datafusion/pipelines folder and paste the pipeline.json file into it.
Double check that you are on the correct branch e.g. a feature branch. If not, switch branches ("git checkout <branch-name>").
Next, stage the file ("git add") so that it is tracked by your version control repository.
When you've staged all the files you want to check in, processed with a commit ("git commit -m <commit-description>") . This will create a new commit that only you will be able to access.
When you want to share this with other developers, do a "git pull" followed by a "git push". This pull retrieves the latest code from the remote after which you can push your latest code back. The code is now accessible to other developers who pull updates from the branch you pushed to.
3.2.1.2 Other Data Fusion migration files and dependencies
When moving across different environments, we also need to make sure that we move all dependencies that are required by the pipeline as well. These are likely to include:
Plugins that you have deployed from the Hub
Custom Plugins or Wrangler UDDs that you may have created
Python dependencies that you may have created
Reusable Pipeline Argument files
Other Data Fusion pipelines that your current pipeline depends upon
However, if orchestration is managed by Composer, dependencies for pipelines are unlikely to exist in Data Fusion.
Schedule files
However, if Composer is used for orchestration, Data Fusion will not handle scheduling and therefore there will be no scheduling files to deploy.
Best Practice Tip:
|
The sections below describe how to store the Data Fusion Plugins and UDDs in Version Control.
Custom Data Fusion Plugins
In the case where a customer has some custom plugin development, these will need to have a deployment process.
To deploy custom plugins, place the java code into the datafusion/plugins folder for the plugin code itself and copy any unit test code into the datafusion/plugin-tests folder.
When this is done, the third-party version control repository will automatically mirror to the Cloud Source Repository and when the code is merged, a Cloud Build process will be automatically triggered that compiles the code (with maven), creates artifacts (jar and json files), and deploys the plugin jar and json files to your environment.
Wrangler User Defined Directives (UDDs)
Wrangler provides an array of built-in functions and directives to transform your data. However during the course of development, the need may arise to build additional functionality in the form of Wrangler User Defined Directives (UDDs).
A directive is a single data manipulation instruction, specified to either transform, filter, or pivot a single record into zero or more records. You can use the Wrangler UI to add a directive to a recipe.
UDD code can be developed and checked into the Version Control repository in the datafusion/udds folder for the UDD code itself and datafusion/udd-tests for any unit test code.
3.2.2 Committing Composer files to version control
There are often a number of objects associated with Composer - some of these will need to be checked into version control. These may include the following:
An Environment Parameter file - a json file containing environment specifications. There exists one file per environment (i.e. dev, test, prod). This should be placed in the composer/env-parameter folder in the respective environment sub-folder.
The Composer DAG code - a python file containing the pipeline code in the form of Directed Acyclic Graph (DAG) code. This should be placed in the composer/dags folder.
A DAG Parameter file - a json file containing parameters to run a Composer DAG and parameters needed in order to generate argument files for Data Fusion pipelines. There exists one file per DAG. Includes configuration information about the type of source (e.g. GCS file, sFTP, API), system name, retry attempts, delimiters, etc.. This can be used when you want to specify all parameters for a source. This should be placed in the composer/dag-parameters folder.
Python modules that are used by Composer - these are provided either as individual python files. These should be placed in the composer/dag-dependencies folder.
Python dependencies used by Composer - these are provided to Composer in a requirements.txt file. These should be placed in the composer folder.
3.2.3 Committing SQL code files to version control
Often some additional BigQuery objects need to be created that are created independently of the Composer and Data Fusion pipelines. These should be stored in Version Control in the bigquery folder.
One such object is the BigQuery Audit Table which captures details from each Composer run. This should be stored in the bigquery/table-definitions folder. It's possible to automate the deployment of this by specifying the environment (project-id and dataset-id) in the associated environment folder e.g. bigquery/environment-definitions/dev with the appropriate permissions. A simple shell script would allow the creation of the complete SQL statement that we can execute at build time.
Following this approach allows us to have a single parameterised CREATE TABLE statement and Cloud Build will insert the environment-specific parameters (project id and dataset name) upon deployment.
There may also be other objects created in the future that are not auto-generated by Dataflow or Composer. Such files should be stored in the bigquery folder in version control and the developer could update the scripts to automate the Dataset and Table creation across environments.
3.3 Pull Request Processes
3.3.1 Code Review
A "code review" i.e. a peer review, should be performed by another developer on the same team. This helps to ensure that a peer understands the code or data pipeline that is produced, and provides an opportunity to improve the shared knowledge base across the team of what's been created.
As mentioned previously this would occur when the developer performs a pull request to merge to the "develop" branch.
Peer reviews should aim to help improve the overall code base/solution. The intent behind code reviews is to help each other to learn & develop, improve the quality of the code, and to reduce time to production.
What should a peer review of the pipeline code cover?
Google has open-sourced our internal guide which you can locate on github here. This talks about code reviews from the perspective of software engineering, so some guidance may differ slightly when working with data pipelines, but it is a good first point of reference to understand the origins of code reviews.
The code review should be focused around just a couple of elements initially. If developers would like to add to this list in future they can do this, but initially it is more important to get the process established and then modify this process as needs change:
Readability: Another developer should be able to look at the pipeline that has been developed and understand what the developer is trying to accomplish.
Functionality: Does the code do what it is supposed to?
Object Naming: Names of objects should be easy to interpret. This is important for logging and auditing. If the name is insufficiently descriptive, troubleshooting issues will become a challenge. Some naming conventions should be followed to assist with this.
(Optional) Documentation: If additional documentation is required to explain what the pipeline is doing beyond what is in the code, then this should be included as one of the code reviewer items.
3.3.2 Production Merge Approval
A merge approval would take place when merging to the "main" branch. The reason this occurs at this point is because a merge to the main branch triggers a deployment to production.
The purpose of this merge approval is to ensure that the deployment occurs at the right time, and minimizes disruption to production systems. In cases where the system is external-facing such as websites, sometimes there are also security reviews at this point. However, where the systems are internal, this is not usually required. One exception to this is where data pipelines may reach out to external sources. The need for these review tasks can be assessed by the merge approver.
4. Continuous Deployment
In the previous section we discussed version control and how to set up your repositories so you can merge to develop and main branches. This section will cover details about how to automate deployments using Cloud Build.
Example deployment scripts are captured in Appendix A and referred to throughout the following sections. The naming convention for deployment scripts is:
{object-type}-cloudbuild-{environment-abbreviation}.yaml
4.1 The Deployment Process
Once code is merged to a branch, we need to define what happens from a deployment perspective. There are two key moments when a merge will trigger a deployment, namely:
When code is merged to the "develop" branch, a deployment to the Test environment will occur.
When code is merged to the "main" branch, a deployment to the Production environment will occur.
To automate deployment, we will need to set up some Cloud Build scripts. The following subsections describe the build process for each type of object we're deploying.
For any command line arguments referring to AUTH_TOKEN or CDF_API, you can set them as follows (if you want to run under your own account), updating the CDF API endpoint with your own.
export AUTH_TOKEN=$(gcloud auth print-access-token)
export CDF_API=https://customerdemo-dev-demonstration-wonderland-dot-usw1.datafusion.googleusercontent.com/api
4.1.1 Deploying Data Fusion Pipelines
For an example of a Cloud Build script to deploy a folder containing many Data Fusion pipelines, see Appendix A1: Cloud Build code to deploy a Data Fusion Pipeline.
If we want to replace an existing pipeline, we should delete the existing pipeline first since a pipeline with the same name will not be redeployed if one already exists in the target environment. To delete a pipeline, we can either:
Manually delete pipelines on the command line
Automate deletion of pipelines based on a file
To manually delete, use the following API call:
curl -X DELETE -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDF_API}/v3/namespaces/<namespace-id>/apps/<pipeline-name>"
Alternatively, you can also create a file that specifies which items to delete. For example, within the datafusion/pipelines folder, create a new file called "pipelines-to-delete.txt" that contains the names of the pipelines (one pipeline filename per new line). You can run this on the command line as shown:
for pipeline in $(cat demo1/datafusion/pipelines/pipelines-to-delete.txt) ; do curl -X DELETE -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDF_API}/v3/namespaces/<namespace-id>/apps/$pipeline"; done
Or you can also do this through a Cloud Build step (the example shown applies to all pipelines in the namespace "demo". You should change this in the cloud build script depending on what environment you are deploying to (each namespace is an environment):
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args:
- '-c'
- |
for pipeline in $(cat demo1/datafusion/pipelines/pipelines-to-delete.txt)
do
curl -X DELETE -H "Authorization: Bearer $(cat access-token.txt)"
"https://customerdemo-dev-demonstration-wonderland-dot-usw1.datafusion.googleusercontent.com/api/v3/namespaces/demo/apps/$$pipeline"
done
Trying to deploy a pipeline that already exists has no effect on the existing target. However, if you want to specify particular pipelines to deploy rather than all of them, you can adopt the same approach as done for deletion, and use a "pipelines-to-deploy.txt" file to explicitly define the pipelines you want to deploy (again, one pipeline filename per new line). To keep your intentions traceable, it is best to update this file with only the pipelines you want to deploy, and remove any that you do not.
4.1.2 Deploying Custom Data Fusion Plugins and UDD
Removing old Data Fusion Plugins and UDDs
Before you deploy a new plugin or UDD, you may wish to delete an older one. This is not part of the Cloud Build process since deployed pipelines may depend on earlier versions of plugins, and deleting these would cause them to fail.
Note that plugins in existing pipelines are not automatically updated, so before deleting plugins, consider if you can instead opt for versions. This will allow you to retain existing pipelines without interruption while you proceed with any updates.
To remove old plugins, you can run the following API command, replacing the variables with your own:
export PLUGIN_NAME=example-transform
export PLUGIN_VERSION=1.1.0-SNAPSHOT
curl -X DELETE -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDF_API}/v3/namespaces/demo/artifacts/${PLUGIN_NAME}/versions/${PLUGIN_VERSION}"
Deploying Data Fusion Plugins and UDDs
To see an example of how you can deploy a new plugin with Cloud Build, you can refer to A2: Cloud Build code to deploy a Custom Data Fusion Plugin or UDD.
4.1.3 Deploying Composer DAGs
Deploying a Composer DAG is relatively simple, and just involves copying the DAG file to a GCS bucket that correlates with the Composer environment we want to move the DAG to.
Composer GCS buckets follow a standard naming convention along with a set GCS folder structure.
For a Cloud Build code example to deploy a Composer DAG, see Appendix A3: Cloud Build code to deploy a Composer Pipeline
So given a particular Composer DAG file e.g. dl_load_{system_name1}.py, we will move this to the Composer GCS bucket within the composer/dags version control directory.
4.1.4 Deploying Composer Python Packages
A list of python packages that you want to have available in your Composer environment are specified in a "requirements.txt" file. These will be contained within the version control repository within the composer directory. Updating the requirements.txt file will trigger a build which will also update the Composer instance. If you run this as a separate Cloud Build process this will not affect your Data Fusion instance. However if you are running any Composer and Data Fusion pipelines during this time these will be interrupted.
On the command line, you can do this similar to the following example:
gcloud composer environments update customerdemo-test \
--update-pypi-packages-from-file path/to/requirements.txt \
--location us-west3
For an example of a Cloud Build script to deploy the Python packages needed by python, see Appendix A4: Cloud Build code to deploy Composer Python Packages.
Note that this process causes a short amount of downtime to the Composer environment while the Composer environment updates. Note that there will be no impact on the Cloud Data Fusion instance.
4.1.5 Deploying Composer Dependencies
Nearly all other files located in Composer GCS buckets simply need to be copied to the Composer bucket.
To do this, we first need to get the GCS bucket that's associated with our instance
On the command line, you can do this as shown in the following example:
DAG_FOLDER=$(gcloud composer environments describe customerdemo-test \
--location us-west3 \
--format="get(config.dagGcsPrefix)")
echo $DAG_FOLDER
The result should look something like this:
gs://us-west3-customerdemo-test-623e32ea-bucket/dags
Then we need to copy files to the appropriate sub-folder within the Composer GCS Bucket.
For those files going into the data/ folder, you will need to strip off the "dags" folder off the end of the GCS string. You can do this using the sed command as follows:
echo $DAG_FOLDER | sed 's|\(.*\)/.*|\1|'
The result should look something like this:
gs://us-west3-customerdemo-test-623e32ea-bucket
While we have provided a descriptive mapping previously in Section 3.2.2 Committing Composer files to version control, this table provides an easy-to-reference mapping between the source control path and the Composer GCS bucket path:
Source Control Path | GCS Bucket Path |
|
|
|
|
|
|
|
|
Once we obtain the correct filename and path, these can be copied with Cloud Build using the gsutil cp command.
In most cases we will not want to overwrite the destination files so we can use a command with the "-n" option, which stands for "non-clobber" and will skip deployment of files already in the destination. You can find more details about the gsutil cp options in the documentation here. If you do have the requirement to overwrite, you can simply modify the copy command to allow this.
For an example of how this is written in Cloud Build, see Appendix A5: Cloud Build code to deploy Composer Python dependencies & files.
4.1.7 Deploying BigQuery Objects
In some instances there may be BigQuery objects that you need to create in your test and production environments.
Note that if you are using Cloud Build to run BigQuery commands as per the example shown in Appendix A, the Cloud Build Service Account will need to be granted permission to access BigQuery.
For an example of a Cloud Build script to deploy the BigQuery objects, see Appendix A6: Cloud Build code to create BigQuery objects.
To use the Cloud Build script process, we need a number of supporting files that contain information and commands to create the tables. To deploy these scripts, we want a single script for our table and we'll use Cloud Build to point to the right environment file. These variables can then be populated in the table script for dev/test/prod so we only ever need a single create table script and we do not need to update it three times (one per environment).
Some examples of the files required are shown here. We have a file defining the environment variables, a file creating any datasets where they don't exist, and a file that will create any tables where they don't exist. You can also create views in a similar way if you wish by modifying the SQL code.
bigquery/environment-definitions/env-dev (Sets the environment variables including any datasets used in subsequent scripts - there should be one of these files for each environment e.g. env-dev, env-test, env-prod)
export PROJECT_ID=<dev-env-project-id>
export DATASET_AUDIT=<dev-audit-dataset>
export DATASET_AUDIT_DESCRIPTION=<dev-audit-dataset-description>
export DATASET_LOCATION=<dev-dataset-location>
bigquery/environment-definitions/create-datasets.sh (Creates the dataset if it doesn't exist)
#!/bin/bash
bq --location=${DATASET_LOCATION} mk \
--dataset \
--description '${DATASET_AUDIT_DESCRIPTION}' \
${PROJECT_ID}:${DATASET_AUDIT}
bigquery/table-definitions/create-table-load_audit.sh (Script to creates the table if it doesn't exist)
#!/bin/sh
SQL1="CREATE TABLE IF NOT EXISTS \`${PROJECT_ID}.${DATASET_AUDIT}.load_audit\` \
( source_id string ,source_name string);"
echo ${SQL1}
bq query --nouse_legacy_sql $SQL1'
Best Practice Tip:
|
4.2 Cloud Build Deployment Secret Management
This section refers specifically to how deployment secrets to access the CDAP API are managed when using Cloud Build.
For Data Fusion deployment with Cloud Build, we'll make use of the CDAP REST API. In order to call this API, we need an access token. If we execute this API directly as a user, we will be using our own account, but if we want to allow other services like Cloud Build to execute the API, we need to create a Service Account with access permissions to the target environments.
To generate an access token using a service account, we first need to create a service account. Each service account can have a keyfile (JSON file) associated with it.
Best Practice Tip:
|
4.2.1 Storing Deployment Secrets for Cloud Build with Secret Manager
The following instructions explain how to generate a secret and call the API with a service account.
Using Secrets Manager to call the CDAP API:
(If not already enabled) Enable the Secret Manager API.
Create a service account (e.g. cdfbuild@...) and download the account's JSON key file.
Create a new secret containing the service account JSON key file into Secret Manager. See Appendix B1: Create a Secret in Secret Manager for details on how to do this.
At runtime, generate an access token using the keyfile which is retrieved from Secret Manager. If using Cloud Build, the Cloud Build service account will need access to the secret manager (grant Cloud Build the "Secret Manager Secret Accessor" role). See Appendix A1: Cloud Build code to deploy a Data Fusion Pipeline for a code example showing how to do this.
You can find more information here:
In the Cloud Build YAML file:
Clone the repo containing the pipeline json file.
Get the cdfbuild@ service account keyfile from Secret Manager.
Activate and get an access token for the cdfbuild@ service account.
Run the curl command providing the access token as Authorization: Bearer.
5. Testing
Testing is part of the CI/CD process and helps to speed up the rate of development by automating tests and reducing the occurrence of production issues. Tests ensure that consideration is given to the various scenarios that might arise in the production environment. When working with data, particular attention should be paid to boundary cases whereby data input does not match the expected format.
In the following subsections, we discuss:
Testing Data Fusion pipelines
Testing Composer pipelines
How to generate and use test data by taking a copy of production into the test environment
One of the most common questions is "how much should I test?". Initially we recommend that customers new to CI/CD focus on writing automated tests that would cover the most common scenarios, rather than aiming to cover every scenario that may ever happen. By having automated testing processes, we can reduce the time to make changes by testing the pipeline under the same conditions. Writing automated test cases may provide a diminishing return where you may see infrequent code changes (meaning the tests are unlikely to be ever run again), or where the impact of errors in the production environment is inconsequential.
Best Practice Tip:
|
5.1 Data Fusion Pipeline Testing
5.1.1 During development
There are a number of tests that developers should perform during development. The most obvious is for the developer to ensure their pipeline runs end-to-end successfully in development. The developer should also take care to consider error handling and logging.
In most cases the developer can write a simple local test. A single test could consist of the following:
Generate test data manually.
Perform some transformations on this data using your Data Fusion pipeline.
Check that the values produced are as you would expect. There are a number of ways you could do this - for example you could write a script to check values, or use BigQuery Assert (see below).
Use BigQuery Assert
Where the data is located on BigQuery, one option for conducting local unit testing is to provide expected input and then use the BigQuery Assert expression to run a series of tests to validate the data. Any inconsistencies can then be addressed prior to merging to the "develop" branch and thus the test environment. This helps to avoid developer's performing a lot of manual inspection of output tables during development.
You can find the syntax for the BigQuery Assert statement in the documentation here.
5.1.2 In the development environment
Once we merge to our feature branch, we will perform some automated unit tests that will test the functionality of an individual component, in this case, our Data Fusion data pipeline. This can be done in the development environment to ensure that all test cases are addressed.
Pipeline Tests
When we conduct testing, we need to validate what happens under different circumstances. Therefore, we would want a range of tests to cover these circumstances. For example, we might check the results against likely scenarios such as:
Standard inputs
Boundary cases - e.g. when the range of possible values are 0-100, test values near the min/max e.g. -1,0,100,101
Null value handling
Data type inconsistencies - check how your pipeline handles the result when a STRING is provided instead of an INT for example
Missing columns
Formatting issues
Test data files providing various inputs should be stored in the version control repository within the datafusion/pipeline-tests folder.
The test file naming convention should be {dag-name}-test{test-number}-{test-description}.{data-file-extension}
An example of a test file is shown below.
|
|
To trigger this in a Cloud Build script, you will need to perform two steps:
Call the CDF API to start the pipeline with each input file in the datafusion/pipeline-tests folder.
Wait for completion and check the results using the BigQuery Assert statement
5.2 Composer Pipeline Testing
5.2.1 During development
During development it's recommended the team use a python linter (pylint) to help write neatly formatted code. This can be added as an extension to VS Code. You can find more information on the VS Code site here.
If using a Composer DAG that triggers reusable pipelines, then we will need at least two Data Fusion pipelines that have already been tested (i.e. clean pipelines with no errors) as input and ensure that the DAG runs as expected. We do this to avoid mixing up testing of Data Fusion pipelines and Composer DAGs, and creating confusion about what has caused failures. The idea of testing Composer DAGs is to make sure that the pipeline runs as expected and that log files are written out. Results of testing can also be checked using BigQuery Assert.
You can also perform DAG testing by following these instructions. These instructions provide guidance on checking for PyPI package errors, syntax errors and task errors.
You can also perform a dry run of the DAG using gcloud's --dry-run option. For more details, see the airflow reference here.
5.2.2 In the test environment
Once our pipelines are running successfully, we also need to perform integration testing to ensure that entire workflows triggered by Composer are going to run without any issues in our production environment. We'll do this by running them in the "test" environment (i.e. Composer test environment and Data Fusion test environment).
While we could rely on scheduling, we may wish to have a little more control over when our test runs, so we can use the gcloud command to trigger a DAG run. More information on this can be found in the Google documentation here.
Best Practice Tip:
|
5.3 Test Data
5.3.1 Unit Test Data
Unit testing should be done on pre-defined datasets that outline the specific cases you want to test. This is described in the Section 5.1 Data Fusion Pipeline Testing.
5.3.2 Integration Test Data
It's important to note that you don't want data changes occurring constantly while running integration tests. In other words, we don't want a constant stream of updates from our production data. It is recommended that the development team take a snapshot and only refresh sample data from production at intervals. We want some consistency between our test results so we know what's causing any pipeline failures. We would like to avoid the situation where we are unsure if integration test failures are a result of pipeline changes or testing data changes.
It is recommended that this be performed on an "as needed" basis at a time agreed across all developers using this test dataset. This time period depends on the significance of data changes to the production environment, particularly where these changes are necessary for further development and/or testing.
We recommend using a script to copy data from views to tables in a test dataset. To do this, developers may need to create some views that redact sensitive data from the testing datasets in BigQuery. Once done, a script can be used to materialize this data into a replica testing environment.
Best Practice Tip:
|
Appendix A: Cloud Build Example Code
This appendix contains sample Cloud Build code that can be used to deploy various objects.
A1: Cloud Build code to deploy a Data Fusion Pipeline
The following code iterates through all Data Fusion pipelines in a folder and deploys them.
It also obtains the secret, which is a service account keyfile, stores it into a local JSON file, and then uses this to obtain an access token which will call the CDAP API. We then use these credentials to call a cURL command and deploy the pipeline to our specified Data Fusion instance.
Filename: cdfpipeline-cloudbuild-dev.yaml
steps:
- name: gcr.io/cloud-builders/git
args: ['clone', '-b', 'develop', 'https://lizsuperawesome@bitbucket.org/lizsuperawesome/demo1.git']
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args: [ '-c', 'gcloud secrets versions access latest --secret=cdfbuild-sa-keyfile > cdfbuild-sa-keyfile.json' ]
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args: [ '-c', 'gcloud auth activate-service-account --key-file cdfbuild-sa-keyfile.json' ]
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args: [ '-c', 'gcloud auth print-access-token > access-token.txt' ]
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args:
- '-c'
- |
for FILE in demo1/datafusion/pipelines/*.json
do
CURRFILE=$(echo $$FILE | sed "s/.*\///" | cut -f 1 -d '.')
curl -X PUT -H "Authorization: Bearer $(cat access-token.txt)" "https://customerdemo-dev-demonstration-wonderland-dot-usw1.datafusion.googleusercontent.com/api/v3/namespaces/demo/apps/$$CURRFILE" -d "@./demo1/datafusion/pipelines/$$CURRFILE.json"
done
A Cloud Build trigger is set up to run this as shown:
A2: Cloud Build code to deploy a Custom Data Fusion Plugin or UDD
The following code is placed within the datafusion/plugins/example-transform folder. It clones the develop branch, copies the plugin files to the root directory, uses maven to build the plugin jar/json file, gets the secret required by the CDAP API, and then uses the secret to call two API commands that deploy the plugin's jar and the json file to the designated environment.
Filename: cdf-example-transform-cloudbuild-dev.yaml
steps:
- name: gcr.io/cloud-builders/git
args: ['clone', '-b', 'develop', 'https://lizsuperawesome@bitbucket.org/lizsuperawesome/demo1.git']
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args:
- '-c'
- |
cp -R demo1/datafusion/plugins/example-transform/* .
- name: maven:3-jdk-8
entrypoint: 'mvn'
args: ['clean', 'package']
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args: [ '-c', 'gcloud secrets versions access latest --secret=cdfbuild-sa-keyfile > cdfbuild-sa-keyfile.json' ]
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args: [ '-c', 'gcloud auth activate-service-account --key-file cdfbuild-sa-keyfile.json' ]
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args: [ '-c', 'gcloud auth print-access-token > access-token.txt' ]
- name: gcr.io/cloud-builders/gcloud
entrypoint: 'bash'
args:
- '-c'
- |
PLUGIN_VERSION=1.1.0-SNAPSHOT
PLUGIN_PROPERTIES=$(cat /workspace/target/example-transform-$$PLUGIN_VERSION.json | python -c "import sys, json; print(json.dumps(json.load(sys.stdin)['properties']))")
PLUGIN_PARENTS=$(cat /workspace/target/example-transform-$$PLUGIN_VERSION.json | python -c "import sys, json; print('/'.join(json.load(sys.stdin)['parents']))")
curl -X POST -H "Authorization: Bearer $(cat access-token.txt)" \
"https://customerdemo-dev-demonstration-wonderland-dot-usw1.datafusion.googleusercontent.com/api/v3/namespaces/demo/artifacts/example-transform" \
-H "Artifact-Version: $$PLUGIN_VERSION" \
-H "Artifact-Extends: $$PLUGIN_PARENTS" \
--data-binary @/workspace/target/example-transform-$$PLUGIN_VERSION.jar
curl -X PUT -H "Authorization: Bearer $(cat access-token.txt)" \
"https://customerdemo-dev-demonstration-wonderland-dot-usw1.datafusion.googleusercontent.com/api/v3/namespaces/demo/artifacts/example-transform/versions/$$PLUGIN_VERSION/properties" \
-d "$$PLUGIN_PROPERTIES"
A trigger is set up to run this Cloud Build file as shown:
A3: Cloud Build code to deploy a Composer Pipeline
Filename: dags-cloudbuild-dev.yaml
A trigger is set up to run this Cloud Build file as shown:
A4: Cloud Build code to deploy Composer Python packages
Python Packages
The following cloud build file deploys the requirements.txt file containing the python packages for Composer to the Composer environment.
Note that this may take some time to complete while Composer updates. Therefore we've increased the timeout to 1 hour up from the default 10 mins.
Filename: composer-py-pacakges-cloudbuild-dev.yaml
A Cloud Build trigger is set up as shown:
A5: Cloud Build code to deploy Composer Python dependencies & files
DAG Python Dependencies
Within the A&G data lake design there also exists some python dependencies. These are located in the composer/dag-dependencies folder in version control. These are files such as read_params.py that perform specific operations within Composer.
There is no specific deployment that needs to take place. We simply need to move these dependencies into the correct folder i.e. dag/dependencies within Composer's GCS bucket.
For more information on installing dependencies using gcloud, see the documentation here.
Filename: dag-dependencies-cloudbuild-dev.yaml
A Cloud Build trigger is set up as shown:
A6: Cloud Build code to create BigQuery objects
The following code shows an example of how to create a BigQuery dataset and tables that do not already exist.
Filename: bq-cloudbuild-dev.yaml
You should set this up to trigger based on new files being created in the "bigquery" folder as shown:
Appendix B: How To Guides
B1: How to create a Secret in Secret Manager
To begin with, we will need to create a service account that will be used to call the CDAP API. On creation, you can download the JSON keyfile.
The image below shows the service account listed in IAM:
The next step is to store our secret into Secret Manager. To do this:
Navigate from the main menu to Security > Secret Manager.
Once you're in the Secret Manager, click on Create Secret.
Enter a name for the secret, and then upload the keyfile.json file.
You should now be able to see the secret listed on the main page of Secret Manager.
Click on it and you will be taken through to see the details.
You've now stored your secret in Secret Manager.