Table of Contents |
---|
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
CDAP pipelines can run in various environments like native(hadoop), cloud (GCP/EMR/Azure). Various CDAP plugins are only capable of running in few specific environment. For example transactional plugins like CDAP Table is not compatible with Cloud and can only be used natively. Furthermore, various plugins are compatible with certain version of underlying processing or storage platform. For example, a plugin might only be compatible with Spark 2 and not with Spark 1.
Plugins can also be either compatible or incompatible with various business rules for example a plugin might be PII compatible. A CDAP administrator might want to only make PII compliant plugins available to pipeline developers to ensure compliance.
It is a bad user experience when a user builds a pipeline for it to only fail because it was incompatible with the environment or business rules in which it was running. We will like to support filtering of plugin available to pipeline developer depending on it’s compatibility.
Goals
There are three goals which want to achieve to improve the user experience around compatibility of plugins:
A plugin developer should be able to easily and effectively specify the compatibility of the plugin being developed.
CDAP platform should be able to capture and provide compatibility information of plugins.
If an incompatible plugin runs it should fail early and fail with appropriate message.
User Stories
As a CDAP plugin developer, I should be able to specify compatibility of my plugin.
As a CDAP pipeline developer, I should only see plugins which are compatible.
As a CDAP administrator, I want to enforce that plugins have certain capabilities to run in my CDAP instance.
As a CDAP pipeline developer and/or CDAP administrator, if a pipeline containing an incompatible plugin runs I will like it to fail early and with appropriate message.
Scenarios
Scenario 1: Specifying Capability
Scenario 1.1
Alice is a CDAP Plugin developer who is developing a CDAP Dataset plugin (transactional). Her plugin is supported only in transactional environment. She will like to specify this in her plugin so that pipeline developer don’t use her plugin in other modes.
Scenario 1.2
Alice is also developing an Action plugin which store some state information in CDAP Dataset. Since her action plugin uses CDAP Dataset it can only run in native mode. She will like to specify this in her plugin so that pipeline developer don’t use her plugin in other modes.
Scenario 1.3
Alice is CDAP plugin developer who is developing a Spark ML transform which is uses libraries available only in Spark 2 and she will like to specify her plugin is only compatible with Spark 2.
Scenario 1.4
Alice is a CDAP Plugin developer who is developing a PII compliant plugin and she will like to specify that her plugin is PII complaint so that when she deploys her plugin in a CDAP instance which only allow PII complaint plugins to run her plugin can be run and be used by pipeline developers.
Scenario 2: Plugin Filtering
Scenario 2.1
Bob is a data analyst who is evaluating CDAP. He is running his CDAP in a particular environment and he sees a lot of plugin which does not seem compatible with his environment. He will like to be able to filter plugins on compatibility to see only the plugins which is compatible with his environment.
Scenario 2.2
Eve is a CDAP administrator, who is setting up a CDAP instance in cloud. She will like to enforce that only plugins which are capable of running in cloud are available to pipeline developer for use.
Scenario 2.3
Eve is a CDAP administrator, who is trying to set up a CDAP environment in production for data processing. Eve’s organization has strict compliance requirement and she wants to only allow plugins which meet certain compliance to be used by the data analyst in her organization. Furthermore, she does not want any data analyst to be able to override her settings and be able to run non-compliant plugins.
Scenario 3: Failing Early and Gracefully
Scenario 3.1
Bob is trying to develop a pipeline to process some data which is stored in CDAP Table. He builds a pipeline with the appropriate plugin and configuration and the pipeline fails at runtime with a lot of cryptic error messages in logs. Bob rechecks his plugin configurations and tries to debug the issue but he is not able to run the pipeline successfully. Disappointed with the platform Bob reaches out to CDAP support group for help. After some back and forth Bob gets to know that he was running the plugin in cloud mode and to run this pipeline he will need to set the correct compute profile during runtime. It makes sense to him but he wonders only if the log error messages would have pointed it out, he could have easily corrected it by himself saving the time spent in support.
Scenario 3.2
Bobs exported a plugin which was sent to him by another pipeline developer and tries to run it. The pipeline fails for him but works perfectly fine for his colleague. Bobs tries to debug the issue by looking into the logs but he is again greeted by cryptic error messages. He reaches out to CDAP support and was told that he is running his pipeline in incorrect mode. He gets really furious as why CDAP logs does not show any information for such a common problem.
Design
API
A plugin developer will be responsible for specifying the capabilities of the plugin. The plugin developers can use annotation provided by the platform to specify this just like they specify Name or Description of the plugin.
To support this we will add the following annotation
Code Block | ||||
---|---|---|---|---|
| ||||
/** * Annotates different environment and versions in which the elements is supported */ @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.TYPE) public @interface Capability { String[] value(); } |
Code Block | ||||
---|---|---|---|---|
| ||||
/** * Defines the different capability options for CDAP programs/plugins to specify */ public final class Capabilities { /** * Mode Category: Defines different modes for CDAP */ public static final class Mode { public static final String CLOUD = "mode_cloud"; // defines compatibility in cloud mode public static final String NATIVE = "mode_native"; // defines compatibility in native (non-cloud mode) public static final String ALL = "mode_all"; // defines compatibility in all modes } /** * Spark Category: Defines different spark versions */ public static final class Spark { public static final String V1 = "spark_1"; // defines compatibility with spark 1.x public static final String V2 = "spark_2"; // defines compatibility with spark 2.x public static final String ALL = "spark_all"; // defines compatibility with spark 1.x and spark 2.x } public static final String ALL = "system_all"; // defines compatibility all predefined options. Used as default option if no compatibility for the plugin is defined. } |
This will allow the plugin developer to specify various capabilities exposed by the platform in the following way:
Code Block | ||||
---|---|---|---|---|
| ||||
@Plugin(type = BatchSource.PLUGIN_TYPE) @Name("Mock") @Compatible({Capabilities.Mode.NATIVE, Capabilities.Spark.V2}) public class MockSource extends BatchSource<byte[], Row, StructuredRecord> { .... ... } |
Plugins can also be annotated with custom value to specify business rules compatibility. For example a plugin developer can specify that the plugin is PII compatible by annotating it with
Code Block | ||
---|---|---|
| ||
@Compatible({Capabilities.Mode.NATIVE, Capabilities.Spark.V2, "PII"}) |
The capability value will be case insensitive. If no, capability option is specified then the default capability will be used according to which the plugin will be considered as capable of ll system defined capabilities but not with any user defined capability options such as PII. Custom capabilities needs to be specifically defined by the plugin developer.
Allowing users to specify custom capabilities value opens up many issue with standardization of such values. For example, if two plugin developers are developing different plugins they may choose to annotate their plugins with different names for the same business rule. This might lead to confusion and we might end up with many filter options representing the same business rule. CDAP Metadata system currently suffers from the same problem where two different tags say ‘sensitive’ and ‘confidential’ might be used to tag in similar context. One way to achieve standardization of capability options which plugin developer can use might be to make the CDAP platform specifically define what are the different allowed capability options which can be used. Although for simplicity, in this release we will not support any mechanism of standardization. It will be the responsibility of the plugin developers to use consistent taxonomy among each other.
Platform
Processing
Currently when a Plugin is deployed in CDAP we inspect it to collect various information about the plugin. In this step we can also inspect and collect the capability information which is provided.
The capability information will be processed in the Artifact inspection stage by our existing ArtifactInspector class. Here will be look for @Capability annotation on plugin and collect all the information. The capability information will be stored in PluginClass which is field member of ArtifactClasses.
Code Block | ||||
---|---|---|---|---|
| ||||
/** * Contains information about a plugin class. */ @Beta public class PluginClass { private final String type; private final String name; private final String description; private final String className; private final String configFieldName; private final Map<String, PluginPropertyField> properties; private final Set<String> endpoints; private final Set<String> compatibility; // all the capabilities of this plugin } |
Storage
We need to store the capability information at plugin level as one artifact can have n numbers of plugins and each one of them will have their own capability information.
Approach 1: Artifact Store
We are storing the plugin capability information in PluginClass which is contained in ArtifactMeta. Hence, we can easily store the capability information of a plugin in the ArtifactStore as a part of ArtifactMeta itself. This will allow us to store all the plugin information in one store.
Approach 2: Metadata
The capability information can also be stored as a system metadata of the Plugin by the ArtifactSystemMetadataWriter. Since Plugin is not an EntityId in CDAP we will use Metadata systems capability to store metadata for custom entities where Plugin will be a custom entity under Artifact.
The custom entity hierarchy will be as follows:
Code Block |
---|
namespace=<namespace-name> | artifact=<artifact-name> | version=<artifact-version> | plugin=<plugin-name> |
Note: | and = are just used as a separator here for readability. In actual serialized form we use byte-length encoding.
This capability information will be stored as metadata property where the key will be ‘capability’ and value list of unique comma separated string representing capabilities.
Code Block |
---|
capability = mode_sandbox, mode_cloud, pii |
Note: = and , is our standard key-value and individual value separator in Metadata storage
Comparison
The below table shows the comparison between the two approaches
Approach | Pros | Cons |
Approach 1: Artifact Store |
|
|
Approach 2: Metadata |
|
|
Filtering
CDAP will support filtering of plugins at two level. One will be for administrators to enforce strict environment and business rules. This will be done through a configuration property in cdap-site.xml. Another will be for data analyst to help them see plugins which are compatible with different environments and rules.
Admin Level Filtering
In cdap-site.xml we will add a new property which will specify certain requirements which a plugin needs to meet to be displayed/enabled. This configuration will be used by CDAP administrators to enforce strict environment and business level rules wherein they want the display/enable only certain plugins. An example of this is Scenario 2.2 and Scenario 2.3.
Code Block |
---|
<property> <name>plugin.required.capabilities</name> <value>mode_cloud</value> <description> Comma separated list of capabilities values which are required by default. If a system level capability category is undefined no capability is required in that category and all plugins for that category will be displayed/enabled. </description> </property> |
Capabilities specified in this configuration will be considered mandatory and only the plugins which have these capabilities will be displayed. The plugins which does not have one of the capabilities specified here will be filtered out. Please see Filtering Examples section for examples of different use cases.
Any changes to this will require a CDAP restart and that is acceptable since we expect such changes to happen very infrequently.
If a pipeline was created before a capability was required i.e. the capability was added as a requirement in the above configuration after the pipeline deployment will start failing with appropriate error message. (User story 4)
Pipeline Developer Filtering
The second level of filtering capability is provided to pipeline developer. A pipeline developer can further filter the available plugin to see only the plugins which match a capability.
Approach 1 (Selected)
Currently, available plugins is retrieved by calling:
Code Block |
---|
GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type} |
This call returns a summary of Plugins for the provided plugin-type. The result will now only include plugins which meet the required capability defined by the configuration plugin.required.capabilities. The response will now contain the capability options of the plugin.
Code Block |
---|
[ { "name": "Plugin1", "type": "dummy", "description": "This is plugin1", "className": "co.cask.cdap.internal.app.runtime.artifact.plugin.Plugin1", "artifact": { "name": "plugins", "version": "1.0.0", "scope": "USER" }, "capability": [ "mode_cloud", "spark_2" ] }, { "name": "Plugin2", "type": "dummy", "description": "This is plugin2", "className": "co.cask.cdap.internal.app.runtime.artifact.plugin.Plugin2", "artifact": { "name": "plugins", "version": "2.0.0", "scope": "USER" }, "capability": [ "mode_cloud", "spark_1" ] } ] |
(Edwin Elia: Please provide feedback for the below UI based design decision)
Client/UI will be responsible to processing the capability list of all the plugins and if needed rendering a view which will show all the unique compatible values to allow further filtering.
Pipeline developer will be able to further filter the available plugin and see plugins which have certain capability. This capability option will be passed as a query parameter to above call.
Code Block |
---|
GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type}?capability=spark_2 |
This will further filter out the plugins from the above list to display only the plugins which have spark_2 capability.
Code Block |
---|
[ { "name": "Plugin1", "type": "dummy", "description": "This is plugin1", "className": "co.cask.cdap.internal.app.runtime.artifact.plugin.Plugin1", "artifact": { "name": "plugins", "version": "1.0.0", "scope": "USER" }, "compatibility": [ "mode_cloud", "spark_2" ] } ] |
Including capability information of plugin in the response is beneficial as it will allow UI/Client to subdivide or label the individual plugin based on their capabilities.
Note:
For simplicity, in 5.1 we will only support the second level of filtering on one condition i.e. a user can only pass one capability option as query parameters. The current design supports providing multiple capability options as query parameters but we don’t have a known use case for it.
The values provided in the plugin.required.capabilities defines takes precedence over the filtering parameters specified by the pipeline developer as a query parameter. If a query parameters specify to include a plugin compatibility options which is not in `plugin.compat.enabled` then that call will return an empty result even if there are compatible plugins known in the system.
Approach 2 (Considered)
As mentioned before currently available plugins is rendered by calling:
Code Block |
---|
GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type} |
This call returns a summary of Plugins for the provided plugin-type.
We will add an additional REST API which will provide all the capability options which are known in the system.
Code Block |
---|
GET /namespaces/{namespace-id}/capabilities |
returns
Code Block |
---|
[ "mode_cloud", "spark_2", "spark_1", "mode_native" ] |
The values from this list of compat can be passed as query parameter
Code Block |
---|
GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type}?capability=spark_2 |
This will further filter out the plugins from the above list to display only the plugins which have spark_2 capability.
Code Block |
---|
[ { "name": "Plugin1", "type": "dummy", "description": "This is plugin1", "className": "co.cask.cdap.internal.app.runtime.artifact.plugin.Plugin1", "artifact": { "name": "plugins", "version": "1.0.0", "scope": "USER" }, "compatibility": [ "mode_cloud", "spark_2" ] } ] |
Comparison
Approach | Pros | Cons |
Approach 1 |
|
|
Approach 2 |
|
|
Filtering Examples
To understand how the configuration and filtering will work in real world let us consider few use cases which we know of and see how they can be addressed through the above design.
Cloud
CDAP is running in a cloud environment and we only want to allow plugins capable of running in cloud to be displayed
plugin.required.capability | CDAP Table Source @Compat(Mode.NATIVE, Spark.V2, "PII") | BigTable Source @Compat(Mode.CLOUD, "PII") | AWS S3 Source @Compat(Mode.CLOUD, Mode.NATIVE) |
mode_cloud | Filtered out | Visible | Visible |
In-Prem Hadoop
CDAP is running in a hadoop environment and we only want to allow plugins capable of running in hadoop and outside cloud or server connection is restricted.
plugin.required.capability | CDAP Table Source @Compat(Mode.NATIVE, Spark.V2, "PII") | BigTable Source @Compat(Mode.CLOUD, "PII") | AWS S3 Source @Compat(Mode.CLOUD, Mode.NATIVE) |
mode_native | Visible | Filtered out | Visible |
In-Prem Hadoop
CDAP is running in hadoop environment and we only want to allow plugins capable of running in hadoop and cloud
plugin.required.capability | CDAP Table Source @Compat(MODE.NATIVE, PII) | BigTable Source @Compat(MODE.CLOUD, PII) | AWS S3 Source @Compat(MODE_CLOUD, MODE.NATIVE) |
mode_native, mode_cloud | Filtered out | Filtered out | Visible |
Sandbox
CDAP is running in sandbox and we want to allow plugin which are capable of running in native or cloud
plugin.required.capability | CDAP Table Source @Compat(Mode.NATIVE, Spark.V2, "PII") | BigTable Source @Compat(Mode.CLOUD, "PII") | AWS S3 Source @Compat(Mode.CLOUD, Mode.NATIVE) |
Visible | Visible | Visible |
Note: When plugin.required.capabilities is empty it means the instance does not defines any capability to be required for any category and hence all plugins will be shown.
Sandbox: Spark 2
CDAP is running in sandbox and we want to allow plugin which are capable of running with spark 2
plugin.required.capability | CDAP Table Source @Compat(Mode.NATIVE, Spark.V2, "PII") | BigTable Source @Compat(Mode.CLOUD, "PII") | AWS S3 Source @Compat(Mode.CLOUD, Mode.NATIVE) |
spark_2 | Visible | Filtered out | Filtered out |
Note: Here plugin.required.capabilities does not specify any requirement for Mode and hence it means the instance does not defines any capability for mode to be required and hence all plugins which support any mode are a candidate to be displayed. Although it does define a Spark requirement so only plugins which is compatible with spark 2 will be show.
Sandbox: Compliance Required
CDAP is running in sandbox and we want to allow plugin which are capable of running in native or cloud mode but also want to satisfy a compliance need and hence plugins must have PII capability
plugin.required.capability | CDAP Table Source @Compat(Mode.NATIVE, Spark.V2, "PII") | BigTable Source @Compat(Mode.CLOUD, "PII") | AWS S3 Source @Compat(Mode.CLOUD, Mode.NATIVE) |
PII | Visible | Visible | Filtered out |
Cloud: Compliance Required
CDAP is running in cloud and we want to allow only plugins which are capable of running in cloud and is PII compliant
plugin.required.capability | CDAP Table Source @Compat(Mode.NATIVE, Spark.V2, "PII") | BigTable Source @Compat(Mode.CLOUD, "PII") | AWS S3 Source @Compat(Mode.CLOUD, Mode.NATIVE) |
mode_cloud, PII | Filtered out | Visible | Filtered out |
API changes
New Programmatic APIs
Compatible Annotation
Code Block | ||
---|---|---|
| ||
/** * Annotates different environment and versions in which the elements is supported */ @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.TYPE) public @interface Compatible { String[] value(); } |
Deprecated Programmatic APIs
None
Updated Programmatic APIs
None
New REST APIs
None
Deprecated REST APIs
None
Updated REST APIs
Path | Method | Description | Response Code | Response | ||
---|---|---|---|---|---|---|
v3/namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type}?compat=mode_cloud | GET | Returns the plugins information including it's compatibility | no change |
| ||
CLI Impact or Changes
- list artifact plugins <artifact-name> <artifact-version> <plugin-type> [<scope>] will be modified to take another parameter to filter just like the REST API.
UI Impact or Changes
- UI will need to support filtering: https://issues.cask.co/browse/CDAP-14002
Security Impact
None.
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
1 | Deploying a plugin which does not have Capability annotation | Plugin must be deployed and should be considered capable with all system defined capabilities |
2 | Deploying a plugin which have Capability annotation | Plugin must be deployed and should be capable with only the options defined in the annotation |
3 | Deploying a plugin with two Capability annotation (which might or might not have same options) | Plugin must be deployed and should be capable with union of capabilities defined in various annotations |
4 | Redeploying a plugin with updated capability annotation | Plugin must be redeployed and its capability information should be updated |
5 | Missing or Empty plugin.required.capabilities in cdap-site.xml | All plugins in the system should be displayed |
6 | plugin.required.capabilities = mode_cloud | Only plugins with cloud capability should be displayed |
7 | plugin.required.capabilities = mode_native | Only plugins with native capability should be displayed |
8 | plugin.required.capabilities = spark_1 | Plugins which are capable to run in any mode and is capable of running with spark 1 should be displayed |
9 | plugin.required.capabilities = mode_native, mode_cloud | Plugins which are capable of both cloud and native mode should be displayed |
10 | plugin.required.capabilities = mode_native, spark_1 | Plugins which are capable to run in native mode and with spark 1 should be displayed |
11 | plugin.required.capabilities = mode_native, pii | Plugins which are capable of running in native mode and is PII compatible should be displayed |
12 | plugin.required.capabilities = mode_cloud and the following call is made GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type}?capability=mode_native | Should return empty result as system has only enabled cloud compatible plugins |
13 | plugin.required.capabilities = <empty> and the following call is made GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type}?capability=mode_native | Should only return plugins which are capable to run in native mode |
14 | plugin.required.capabilities = mode_cloud, mode_native and the following call is made GET /namespaces/{namespace-id}/artifacts/{artifact-name}/versions/{artifact-version}/extensions/{plugin-type}?capability=mode_native | Should show plugin which are capable of both native and cloud mode. Note: The second level of filtering is applied on top of first layer result set. When plugin.required.capabilities = mode_cloud, mode_native then we only have plugins which are supported in both mode enabled and second level of filtering will be applied on this set. Plugins which are only capable of native mode will not be displayed in the result of this REST call since they are not enable in the first place due to the requirement setting of plugin.required.capabilities. (Scenario 2.3) |
Releases
Release 5.1
Related Work
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
Future work
Complex Filtering
- Support filtering based on multiple capability query parameters.
Dynamic Filtering
- Support tagging and filtering of plugin on the fly
Standardization
- Support for standardization of
- plugin capabilities