Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Flexible in launching runtime in any targeted cluster, whether it is cloud or on premise, same or different cluster as CDAP runs.
    • A CDAP instance can uses different clouds / clusters simultaneously for different program executions
  • Flexible in integration with different runtime framework (MR, Spark, Beam/DataFlow, whatever next...).
  • Runtime is autonomous.
    • Once launched, it manages every runtime aspect of a program execution.
      • Logs, metrics, metadata collection are done within that runtime
    • Runtime shouldn't initiate any communication to CDAP master
      • There are practical concerns about scalability and availability
        • Scalability
          • The scaling of the CDAP master should only based on number of concurrent program executions, rather than individual program logic
        • Availability
          • Although the CDAP master should support HA, it shouldn't be in the critical path of program execution. Meaning if CDAP master is down, programs already running should keep on running without any blockage or error
    • The runtime provides REST endpoint for the CDAP master to periodically poll the runtime for information updates
      • Program states, metadata, workflow tokens, etc
    • The runtime provides REST endpoint for CDAP master to control the runtime
      • Suspend / Termination
  • Real-time logs and metrics collection mechanism is pluggable (via common services like current CDAP, per program run, rely on the cloud provider, ...)
    • This also means can be disabled (no-op) based on launch time configuration.

Design

...

CDAP Architecture

 The guiding principles of the architecture is as follows:

  • A stable and scalable core platform that provides essential functionalities
    • An application API for Data Application development
    • Manages and operates Data Application
    • A central catalog for entities' metadata
  • Provide a well defined API to support addition capabilities
    • Enable fast iteration of new ideas
    • Allow running data / compute intensive jobs for the system (e.g. Spark, Elastic Search, etc.)
    • Individual extended system can be turn on / off independently

Image Added

CDAP Core System

The major components of the CDAP core are as follows:

  • Metadata Catalog
    • Responsible for collecting, storing, serving and exporting metadata as defined by users and applications
  • Artifacts and Application Repository
    • Responsible for artifacts and applications deployment, update (versioning), removal and surfacing of those information
  • Runtime Manager
    • Responsible for all aspects about program lifecycle, including resource provisioning, execution, run records, etc.
  • Scheduler
    • Responsible for launching programs based on triggers and constraints
  • Transaction
    • Apache Tephra to provide transactional operations on HBase
    • #Transaction Service may not be available to application runtime, depending on the target execution cluster and we may eventually removing it
  • Metrics and Monitoring
    • Responsible for collecting and querying metrics
    • *May integrate with external metrics system in future
  • TMS
    • Provides messaging service for event based design to decouple components

CDAP Extended System

 

Metadata Management

This basically contains various services that manage CDAP metadata for artifacts, applications, program runs and dataset. There is no big changes architecturally to those services, except for the removal dataset service, which we will cover in later section.

...

RoleLearning
Dataset Modules and Types Management
  • Very complicated, hard to understand dependency management
  • Never actually needed as there is no real use case for such complexity
  • Unresolved classloading problem when it involves multiple jars
  • Does not work well with dataset type defined inside plugin artifacts
Dataset Instance Management
  • Useful for providing default set of dataset properties, but could be replaced / superceded with preferences and runtime arguments
  • Dataset type association is useful only for core datasets, but becomes too restrictive when dataset instance is not tied with dataset type code.
    • For example, in data pipeline, it is quite common to have a "dataset" being defined as source / sink and the platform tracks lineage information of that,
      but there is no dataset type code since the data logic code is part of the plugin (e.g. InputFormat implementation).
    • Currently we have a ExternalDataset as a workaround. It defines a dataset implementation without data logic, but merely just to satisfy the dataset service.
Dataset Admin Operations
  • Was designed to handle impersonation before having the requirements on impersonation. Turns out being an over-engineered design.
  • With CDAP program impersonation support, those admin operations can be done directly from user program container if necessary.
Explore Service Integration
  • Hive specific queries are code inside Dataset Service, rather than exposing to dataset implementation to let dataset to make their own decision.
  • The integration is only with Hive (which is more a Explore problem than Dataset Service problem), making it not useful to integrate with other type of query engine (e.g. Big Query).

Based on what we've learned, Dataset Service yields more problems / restrictions than the benefits it bring. Also, the current implementation requires runtime talking to Dataset Service directly for all dataset related operations, which violates the Runtime is autonomous requirement as described above. Therefore, removing the Dataset Service is desirable for the future CDAP.

...

  • container if necessary.
Explore Service Integration
  • Hive specific queries are code inside Dataset Service, rather than exposing to dataset implementation to let dataset to make their own decision.
  • The integration is only with Hive (which is more a Explore problem than Dataset Service problem), making it not useful to integrate with other type of query engine (e.g. Big Query).

Based on what we've learned, Dataset Service yields more problems / restrictions than the benefits it bring. Also, the current implementation requires runtime talking to Dataset Service directly for all dataset related operations, which violates the Runtime is autonomous requirement as described above. Therefore, removing the Dataset Service is desirable for the future CDAP.

Given how deeply we relies on Dataset and Dataset Service, the removal of Dataset Service should be done in phases, together with changes in public API, REST API as well as dataset support.

Missing Features after Dataset Service Removal

The removal of Dataset Service will cause certain features in CDAP missing, but instead have to be done through application. Here is the list of them:

  • Creation of Dataset Instance through REST API with known Dataset Type
    • This is a seldomly used, if not unused, feature, which has little impact if removed
  • Deletion of Dataset Instance through REST API
    • Again, this is rarely used. Also, with the dataset instance management, this is not needed
  • Explore (Hive) integration
  • Dataset properties association
    • Can be managed via metadata manager

Dataset Changes

With the removal of Dataset Service, we need to define what is Dataset, as well as changes to CDAP API and impact to application.

...

  1. Gather CDAP runtime jars, program artifacts, program runtime data, configurations, etc.
  2. Launch the CDAP runtime on the target runtime cluster with the information gathered in step 1
    • This is a cluster specific step, but generally involves files localization and launching the CDAP runtime JVM on the cluster. Here are some possibilities:
      • ssh + scp files and running command
      • Talks to YARN RM directly via the RM Client protocol for localization as well as launching YARN application
      • Build and publish a docker image (localization) and interacts with Kubernetes master to launch CDAP runtime in container

Runtime Monitor

The Runtime Monitor collects information emitted by the running program, including

  • Program lifecycle states
    • Starting, Running, Suspended, Completed, Killed, Failed
  • Metadata
    • Lineage, metadata, workflow tokens, 

 

Program Runtime

The Program Runtime is the CDAP component running inside the target execution (cloud) environment. It is responsible for

...