Deploying Artifacts to The Hub

This article will cover the steps needed to deploy an artifact (plugin, driver, pipeline etc) to The Hub.

Requirements

Background info

The Hub contains many artifacts that users of Data Fusion can deploy at any time. The artifacts are stored in Google Cloud Storage, in this bucket. The packages.json file in the bucket dictates which artifacts are visible and their properties. The artifacts are stored in the packages/ directory. Using the info from the packages.json file, CDF is able to pull the artifacts from the GCS bucket directly and deploy them to the user’s instance.

The packages.json file is a generated file, it should not be edited directly. The Packager is used to generate the file using the spec.json files present in each directory. The directory structure of the Hub is as follows:

packages/<name>/<version>/spec.json packages/<name>/<version>/icon.jpg packages/<name>/<version>/<other files>

If there are multiple versions of the same artifact then multiple directories are created under the <name> directory. The <other files> depends on the type of artifact, for example if the artifact is a plugin the structure would be:

packages/<name>/<version>/spec.json packages/<name>/<version>/icon.jpg packages/<name>/<version>/<plugin-name>-<plugin-version>.jar packages/<name>/<version>/<plugin-name>-<plugin-version>.json

The exact naming of the plugin artifacts is not important as long as it matches the names in spec.json but it is recommended that this naming convention is followed.

The spec.json file needs to be configured correctly to allow the Packager to properly create the packages.json file. More details on the format can be found here. This is an example spec.json for a plugin, it is fairly straightforward to edit this for other plugins.

{ "specVersion": "1.0", "label": "Data Loss Prevention", "description": "Data Loss Prevention plugins to filter, redact and decrypt sensitive data directly in a pipeline.", "author": "Cask", "org": "Cask Data, Inc.", "created": 1589218398, "categories": [ "hydrator-plugin" ], "cdapVersion": "[6.1.1,7.0.0-SNAPSHOT)", "paidLink":"https://cloud.google.com/dlp/pricing", "beta":false, "actions": [ { "type": "one_step_deploy_plugin", "label": "Deploy Data Loss Prevention Plugins", "arguments": [ { "name": "name", "value": "dlp", "canModify": false }, { "name": "version", "value": "1.2.2", "canModify": false }, { "name": "scope", "value": "user", "canModify": false }, { "name": "config", "value": "dlp-1.2.2.json", "canModify": false }, { "name": "jar", "value": "dlp-1.2.2.jar", "canModify": false } ] } ] }

Steps

  1. Clone The Hub Github Repo locally.

  2. Create a new branch for your changes.

  3. If you are deploying a new version of an existing artifact proceed to step 3. If you are deploying a new artifact that does not currently exist in the Hub then create a directory under packages/ with the name of the artifact. The naming convention is “<artifact-type>- <artifact-name>“.

  4. Create a new directory under the packages/<artifact>/ directory, name this new directory with the version number. Ex. to deploy version 1.0.0 the path would be packages/<artifact>/1.0.0/

  5. If you are updating an existing artifact, it is recommended that you delete the old version if they target the same version of CDAP. For example, if you’re adding version 1.1.0 and the existing version is 1.0.0 and they both target CDAP 6.1.1, then we recommend that you delete 1.0.0 since there is no reason anyone should deploy the older version.

  6. Place the appropriate files in the directory you just created, following the info presented in the Background Info section.

  7. Create a Pull Request with your changes and send the link to someone from the Cloud Data Fusion team for approval.

  8. Once the PR is approved, merge it into the master branch.

  9. From the hub/ directory run the following commands:

  10. Open the packages.json file and ensure your new artifact appears in the json with the correct version.

  11. When you are ready to deploy run the following commands:

Warning: This will push your changes to prod instantly, there is no staging environment or rollout. As soon as the upload is done the changes are live. Please double check that all the required artifact files are present and that packages.json contains the correct versions before deploying.

12. The previous steps copied the artifacts to the central hub. There are also regional hubs are used by CDF 6.1.4 (and newer) instances. Run the following commands to sync all regional hubs to the central hub:

It is possible for the one of the gsutil commands to hang/freeze when processing this volume of data. The best solution is to kill the command (using Ctrl + C) and rerun it.

13. Wait a couple minutes for the changes to propagate and try to deploy your new artifact from The Hub in a CDF instance.

Â