Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

Prior to CDAP 6.0.0, extensions were often added as CDAP applications. Data Prep, Analytics, and Reports were all implemented as CDAP applications, with Data Prep and Analytics running in each namespace they are required in, and Reports running in the system namespace. Running an application in each namespace wastes resources, so it is desirable to move Data Prep and Analytics into the system namespace. However, each of them have namespaced entities. When moved to the system namespace, they both need some sort of namespacing capabilities. The Reports application also is contain part of its logic in the application, but part in the CDAP system itself, due to requirements around accessing system information (run records). In order to cleanly implement these extensions as applications, additional functionality must be provided for system applications. 

Goals

Manage a single Data Prep, Analytics, and Reports application for use across all namespaces. 

User Stories 

  • As a CDAP admin, I want to manage a single system application and not an application per namespace
  • As a CDAP admin, I want to be able to dynamically scale the Services and Spark Services run by system applications
  • As a CDAP admin, I do not want users to be able to create Analytics experiments in a namespace that does not exist
  • As a CDAP user, I want Analytics experiments and models to be local to a namespace
  • As a CDAP user, I want Analytics experiments and models in a namespace to be deleted when the namespace is deleted
  • As a CDAP system developer, I want to be able to receive notifications of program lifecycle events (pending, starting, running, etc)
  • As a CDAP system developer, I want my namespaced resources to be deleted when the namespace is deleted
  • As a CDAP system developer, I want to be able to instantiate user scope plugins in a system application

Design

This design will focus on the needs that have been brought up by the Data Prep, Analytics, and Reports applications. Future system applications may require additional functionality, but that is out of scope of this design.

All system apps will be moved to run in the system namespace instead of having one application per namespace. The system apps also need to be changed to be namespace-aware, meaning they need to explicitly take a namespace in their requests and store data in such a way that namespace isolation is achieved. Reports is cross namespace by nature and does not need to worry about namespace isolation.

The following entities need to be namespaced:

AppEntity
DataPrepconnection
DataPrepworkspace
Analyticsexperiment
Analyticssplit
Analyticsmodel


The introduction of a namespace concept to application specific entities (connections, experiments, etc) is explicitly handled by each system app. The system apps are responsible for managing namespaces themselves, using a data model that meets their needs. CDAP will be extended to ensure apps have the required capabilities to do this.

Data Model

System datasets have a usage pattern where they use a Table based dataset as a metadata and entity store for CRUD operations, and a FileSet based dataset as a blob store. The dataset types used before 6.0.0 are:

AppTypeUsage
DataPrepTableConnection entities
DataPrepTableRecipe entities
DataPrepFileSetIndex files for the File connection type
DataPrepcustom WorkspaceDatasetWorkspace entities and metadata. Just a thin wrapper around a Table
ReportsFileSetReport files
AnalyticsIndexedTableExperiment entities and metadata
AnalyticsIndexedTableModel entities and metadata
AnalyticsFileSetTrained model files
AnalyticsPartitionedFileSetData splits and metadata

System applications will require these dataset types, or dataset types that are comparable in functionality.

In order to achieve namespace isolation and automatic entity deletion when a namespace is deleted, system apps will create a dataset instance in each namespace it needs. For example, a connection in the 'default' namespace will be stored in a Table in the 'default' namespace. In this way, isolation and automatic deletion is handled by the platform; the app only needs to use the correct dataset.

REST

Namespace needs to be added as a prefix to all Data Prep and Analytics endpoints that manage namespaced entities. For example, Data Prep connection and workspace endpoints will all be prefixed by:

/v2/namespaces/<namespace>/connections
/v2/namespaces/<namespace>/workspaces

Similarly, all Analytics endpoints will be prefixed by:

/v2/namespaces/<namespace>/experiments

Note that this results in a fairly confusing full paths, as versions and namespaces appear in mulitple parts of the path. For example, to get the list of connections in the default namespace, the API would be:

GET /v3/namespaces/system/apps/dataprep/services/service/methods/v2/namespaces/default/connections

and to get the list of experiments in the default namespace, the API would be:

GET /v3/namespaces/system/apps/ModelManagementApp/spark/ModelManagerService/methods/v2/namespaces/default/experiments

Programmatic APIs

In order to implement namespace logic, system apps need to be able to perform several operations that are not supported prior to CDAP 6.0.0.

Check namespace existence

System apps need to be able to check that a namespace exists in CDAP before managing any custom entities in that namespace.

Dataset admin operations in another namespace

System apps need to be create a dataset in a specific namespace if it doesn't already exist. DatasetManager methods are not namespace aware and currently only operate within the namespace of the application.

Plugin operations in another namespace

DataPrep needs to be able to instantiate UDDs (User Defined Directives) in order to execute directive lists. This means system applications need to be able to instantiate plugins whose artifacts are user scoped in some namespace.

To give a more concrete example, suppose a 'my-custom-directive' UDD is deployed as a user artifact in namespace 'default'. The Data Prep system application needs to be able to instantiate that directive even though the app is running in the 'system' namespace.

Approach 1

In this approach, several existing interfaces are enhanced with namespaced versions of their existing methods. DatasetContext already has a way to get a dataset from another namespace, but DatasetManager cannot check existence of or create a dataset in another namespace. DatasetManager will need to be enhanced with namespaced versions of its current methods:

public interface DatasetManager {
  boolean datasetExists(String name) throws DatasetManagementException;
  boolean datasetExists(String namespace, String name) throws DatasetManagementException;
  ...
  void createDataset(String name, String type, DatasetProperties properties) throws DatasetManagementException;
  void createDataset(String namespace, String name, String type, DatasetProperties properties) throws DatasetManagementException;
  ...
}

Similarly, the ArtifactManager interface available to Service programs must also be modified.

public interface ArtifactManager {
  List<ArtifactInfo> listArtifacts() throws IOException;
  List<ArtifactInfo> listArtifacts(String namespace) throws IOException;  

  CloseableClassLoader createClassLoader(ArtifactInfo artifactInfo,
                                         @Nullable ClassLoader parentClassLoader) throws IOException;
}

The Admin interface will be enhanced to check for namespace existence:

public interface Admin extends DatasetManager, SecureStoreManager, MessagingAdmin {
  boolean namespaceExists(String namespace);
}

This has a side benefit of bringing more consistency to the APIs instead of having mixed APIs that sometimes allow cross namespace operations and sometimes don't.

System app service methods would typically look something like:

@Path("/v2")
public class ModelManagerServiceHandler implements SparkHttpServiceHandler {

  @GET
  @Path("/namespaces/{namespace}/experiments")
  public void listExperiments(HttpServiceRequest request, HttpServiceResponder responder,
                              @PathParam("namespace") String namespace) {
    Admin admin = getContext().getAdmin();
    if (!admin.namespaceExists(namespace)) {
      responder.sendStatus(404, "Namespace " + namespace + " not found.");
    }
    if (!admin.datasetExists(EXPERIMENTS_DATASET)) {
      admin.createDataset(namespace, EXPERIMENTS_DATASET, "table", EXPERIMENTS_DATASET_PROPERTIES);
    }
    getContext().execute(datasetContext -> {
      Table experiments = datasetContext.getDataset(namespace, EXPERIMENTS_DATASET);
      ...
      responder.sendJson(...);
      });
  }
  ...
}

Approach 2

In this approach, HttpServiceContext and SparkHttpServiceContext  enhanced to have a new method that returns another instance of context that uses a different namespace:

public interface Namespaced<T> {


  T namespace(String namespace) throws NamespaceNotFoundException;
}


public interface HttpServiceContext extends Namespaced<HttpServiceContext>, ... {

  ...


  /**
   * Create a context for another namespace, allowing access to entities from that namespace.
   */
  HttpServiceContext namespace(String namespace) throws NamespaceNotFoundException;
}

This is confusing when used in conjunction with the existing methods that take a namespace as a parameter already. To make the API clean, we would have to make all contexts implement the Namespaced interface and deprecate all methods that take a namespace as a parameter. However, it enforces that cross namespace functionality is always present in every context API that is added in the future and makes it impossible for a developer to write a service endpoint that forgets to check that the namespace exists. The service endpoint code looks very similar to approach 1, except the namespace is used in just one place and there is no explicit check that the namespace exists:

@Path("/v2")
public class ModelManagerServiceHandler implements SparkHttpServiceHandler {


  @GET
  @Path("/namespaces/{namespace}/experiments")
  public void listExperiments(HttpServiceRequest request, HttpServiceResponder responder,
                              @PathParam("namespace") String namespace) {
    HttpServiceContext namespaceContext = getContext().namespace(namespace);
    Admin admin = namespaceContext.getAdmin();
    if (!admin.datasetExists(EXPERIMENTS_DATASET)) {
      admin.createDataset(EXPERIMENTS_DATASET, "table", EXPERIMENTS_DATASET_PROPERTIES);
    }
    namespaceContext.execute(datasetContext -> {
      Table experiments = datasetContext.getDataset(EXPERIMENTS_DATASET);
      ...
      responder.sendJson(...);
      });
  }

  ...
}


Upgrade

When upgrading from CDAP 5.1.x to 6.0.0, no additional work needs to be done regarding datasets, as they will already be in their respective namespaces. Old versions of the DataPrep and Analytics apps will remain in any namespace they were enabled in, so it should be documented that users should delete these apps. CDAP could also provide a tool to do this cleanup. 

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs


New REST APIs

PathMethodDescriptionResponse CodeResponse










Deprecated REST API

PathMethodDescription



CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

Care needs to be taken to ensure that system applications are authorized to access CDAP entities in other namespaces.

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results












Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

Prior to CDAP 6.0.0, extensions were often added as CDAP applications. Data Prep, Analytics, and Reports were all implemented as CDAP applications, with Data Prep and Analytics running in each namespace they are required in, and Reports running in the system namespace. Running an application in each namespace wastes resources, so it is desirable to move Data Prep and Analytics into the system namespace. However, each of them have namespaced entities. When moved to the system namespace, they both need some sort of namespacing capabilities. The Reports application also is contain part of its logic in the application, but part in the CDAP system itself, due to requirements around accessing system information (run records). In order to cleanly implement these extensions as applications, additional functionality must be provided for system applications. 

Goals

Manage a single Data Prep, Analytics, and Reports application for use across all namespaces. 

User Stories 

  • As a CDAP admin, I want to manage a single system application and not an application per namespace
  • As a CDAP admin, I want to be able to dynamically scale the Services and Spark Services run by system applications
  • As a CDAP admin, I do not want users to be able to create Analytics experiments in a namespace that does not exist
  • As a CDAP user, I want Analytics experiments and models to be local to a namespace
  • As a CDAP user, I want Analytics experiments and models in a namespace to be deleted when the namespace is deleted
  • As a CDAP system developer, I want to be able to receive notifications of program lifecycle events (pending, starting, running, etc)
  • As a CDAP system developer, I want my namespaced resources to be deleted when the namespace is deleted
  • As a CDAP system developer, I want to be able to instantiate plugins using an artifact from another namespace

Design

This design will focus on the needs that have been brought up by the Data Prep, Analytics, and Reports applications. Future system applications may require additional functionality, but that is out of scope of this design.

All system apps will be moved to run in the system namespace instead of having one application per namespace. The system apps also need to be changed to be namespace-aware, meaning they need to explicitly take a namespace in their requests and store data in such a way that namespace isolation is achieved. Reports is cross namespace by nature and does not need to worry about namespace isolation.

The following entities need to be namespaced:

AppEntity
DataPrepconnection
DataPrepworkspace
Analyticsexperiment
Analyticssplit
Analyticsmodel


The introduction of a namespace concept to application specific entities (connections, experiments, etc) is explicitly handled by each system app. The system apps are responsible for managing namespaces themselves, using a data model that meets their needs. CDAP will be extended to ensure apps have the required capabilities to do this.

Data Model

System datasets have a usage pattern where they use a Table based dataset as a metadata and entity store for CRUD operations, and a FileSet based dataset as a blob store. The dataset types used before 6.0.0 are:

AppTypeUsage
DataPrepTableConnection entities
DataPrepTableRecipe entities
DataPrepFileSetIndex files for the File connection type
DataPrepcustom WorkspaceDatasetWorkspace entities and metadata. Just a thin wrapper around a Table
ReportsFileSetReport files
AnalyticsIndexedTableExperiment entities and metadata
AnalyticsIndexedTableModel entities and metadata
AnalyticsFileSetTrained model files
AnalyticsPartitionedFileSetData splits and metadata

System applications will require these dataset types, or dataset types that are comparable in functionality.

In order to achieve namespace isolation and automatic entity deletion when a namespace is deleted, system apps will create a dataset instance in each namespace it needs. For example, a connection in the 'default' namespace will be stored in a Table in the 'default' namespace. In this way, isolation and automatic deletion is handled by the platform; the app only needs to use the correct dataset.

REST

Namespace needs to be added as a prefix to all Data Prep and Analytics endpoints that manage namespaced entities. For example, Data Prep connection and workspace endpoints will all be prefixed by:

/v2/namespaces/<namespace>/connections
/v2/namespaces/<namespace>/workspaces

Similarly, all Analytics endpoints will be prefixed by:

/v2/namespaces/<namespace>/experiments

Note that this results in a fairly confusing full paths, as versions and namespaces appear in mulitple parts of the path. For example, to get the list of connections in the default namespace, the API would be:

GET /v3/namespaces/system/apps/dataprep/services/service/methods/v2/namespaces/default/connections

and to get the list of experiments in the default namespace, the API would be:

GET /v3/namespaces/system/apps/ModelManagementApp/spark/ModelManagerService/methods/v2/namespaces/default/experiments

Programmatic APIs

In order to implement namespace logic, system apps need to be able to perform several operations that are not supported prior to CDAP 6.0.0.

  1. Check namespace existence – System apps need to be able to check that a namespace exists in CDAP before managing any custom entities in that namespace.

  2. Dataset admin operations in another namespace – System apps need to be create a dataset in a specific namespace if it doesn't already exist. DatasetManager methods are not namespace aware and currently only operate within the namespace of the application.

  3. Plugin operations in another namespace – DataPrep needs to be able to instantiate UDDs (User Defined Directives) in order to execute directive lists. This means system applications need to be able to instantiate plugins whose artifacts are user scoped in some namespace. To give a more concrete example, suppose a 'my-custom-directive' UDD is deployed as a user artifact in namespace 'default'. The Data Prep system application needs to be able to instantiate that directive even though the app is running in the 'system' namespace.

Approach 1

In this approach, several existing interfaces are enhanced with namespaced versions of their existing methods. DatasetContext already has a way to get a dataset from another namespace, but DatasetManager cannot check existence of or create a dataset in another namespace. DatasetManager will need to be enhanced with namespaced versions of its current methods:

public interface DatasetManager {
  boolean datasetExists(String name) throws DatasetManagementException;
  boolean datasetExists(String namespace, String name) throws DatasetManagementException;
  ...
  void createDataset(String name, String type, DatasetProperties properties) throws DatasetManagementException;
  void createDataset(String namespace, String name, String type, DatasetProperties properties) throws DatasetManagementException;
  ...
}

Similarly, the ArtifactManager interface available to Service programs must also be modified.

public interface ArtifactManager {
  List<ArtifactInfo> listArtifacts() throws IOException;
  List<ArtifactInfo> listArtifacts(String namespace) throws IOException;  

  CloseableClassLoader createClassLoader(ArtifactInfo artifactInfo,
                                         @Nullable ClassLoader parentClassLoader) throws IOException;
}

The Admin interface will be enhanced to check for namespace existence:

public interface Admin extends DatasetManager, SecureStoreManager, MessagingAdmin {
  boolean namespaceExists(String namespace);
}

This has a side benefit of bringing more consistency to the APIs instead of having mixed APIs that sometimes allow cross namespace operations and sometimes don't.

System app service methods would typically look something like:

@Path("/v2")
public class ModelManagerServiceHandler implements SparkHttpServiceHandler {

  @GET
  @Path("/namespaces/{namespace}/experiments")
  public void listExperiments(HttpServiceRequest request, HttpServiceResponder responder,
                              @PathParam("namespace") String namespace) {
    Admin admin = getContext().getAdmin();
    if (!admin.namespaceExists(namespace)) {
      responder.sendStatus(404, "Namespace " + namespace + " not found.");
    }
    if (!admin.datasetExists(EXPERIMENTS_DATASET)) {
      admin.createDataset(namespace, EXPERIMENTS_DATASET, "table", EXPERIMENTS_DATASET_PROPERTIES);
    }
    getContext().execute(datasetContext -> {
      Table experiments = datasetContext.getDataset(namespace, EXPERIMENTS_DATASET);
      ...
      responder.sendJson(...);
      });
  }
  ...
}

Approach 2

In this approach, HttpServiceContext and SparkHttpServiceContext are enhanced to have a new method that returns another instance of HttpServiceContext that uses a different namespace:

public interface Namespaced<T> {
  T namespace(String namespace) throws NamespaceNotFoundException;
}


public interface HttpServiceContext extends Namespaced<HttpServiceContext>, ... {

  ...

  /**
   * Create a context for another namespace, allowing access to entities from that namespace.
   */
  HttpServiceContext namespace(String namespace) throws NamespaceNotFoundException;
}

The service endpoint looks similar to approach 1 except the namespace parameter is used in just one place and no explicit check for namespace existence is done:

@Path("/v2")
public class ModelManagerServiceHandler implements SparkHttpServiceHandler {

  @GET
  @Path("/namespaces/{namespace}/experiments")
  public void listExperiments(HttpServiceRequest request, HttpServiceResponder responder,
                              @PathParam("namespace") String namespace) {
    HttpServiceContext namespaceContext = getContext().namespace(namespace);
    Admin admin = namespaceContext.getAdmin();
    if (!admin.datasetExists(EXPERIMENTS_DATASET)) {
      admin.createDataset(EXPERIMENTS_DATASET, "table", EXPERIMENTS_DATASET_PROPERTIES);
    }
    namespaceContext.execute(datasetContext -> {
      Table experiments = datasetContext.getDataset(EXPERIMENTS_DATASET);
      ...
      responder.sendJson(...);
      });
  }
  ...
}

This approach makes it impossible for a developer to forget to check for namespace existence. It also enforces that every method in the context interfaces works across namespaces instead of having a mix where some operations can be done across namespaces and others cannot. However, it may be confusing to have both this and the namespace specific methods, though it still makes sense to have both in case datasets from different namespaces need to be accessed in the same transaction.

Upgrade

When upgrading from CDAP 5.1.x to 6.0.0, no additional work needs to be done regarding datasets, as they will already be in their respective namespaces. Old versions of the DataPrep and Analytics apps will remain in any namespace they were enabled in, so it should be documented that users should delete these apps. CDAP could also provide a tool to do this cleanup. 

API changes

New Programmatic APIs

There will be new programmatic APIs to allow more cross namespace access.

Deprecated Programmatic APIs

None

New REST APIs

There are no new CDAP APIs, but almost every Data Prep and Analytics endpoint will be modified to include namespace in the path.

Deprecated REST API

None

CLI Impact or Changes

  • None

UI Impact or Changes

  • UI needs to be updated to use the new REST APIs.

Security Impact 

Care needs to be taken to ensure that system applications are authorized to access CDAP entities in other namespaces.

Impact on Infrastructure Outages 

None

Test Scenarios

Test IDTest DescriptionExpected Results












Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work

  • No labels