Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Flexible in launching runtime in any targeted cluster, whether it is cloud or on premise, same or different cluster as CDAP runs.
    • A CDAP instance can uses different clouds / clusters simultaneously for different program executions
  • Flexible in integration with different runtime framework (MR, Spark, Beam/DataFlow, whatever next...).
  • Runtime is autonomous.
    • Once launched, it manages every runtime aspect of a program execution.
      • Logs, metrics, metadata collection are done within that runtime
    • Runtime shouldn't initiate any communication to CDAP master
      • There are practical concerns about scalability and availability
        • Scalability
          • The scaling of the CDAP master should only based on number of concurrent program executions, rather than individual program logic
        • Availability
          • Although the CDAP master should support HA, it shouldn't be in the critical path of program execution. Meaning if CDAP master is down, programs already running should keep on running without any blockage or error
    • The runtime provides REST endpoint for the CDAP master to periodically poll the runtime for information updates
      • Program states, metadata, workflow tokens, etc
    • The runtime provides REST endpoint for CDAP master to control the runtime
      • Suspend / Termination
  • Real-time logs and metrics collection mechanism is pluggable (via common services like current CDAP, per program run, rely on the cloud provider, ...)
    • This also means can be disabled (no-op) based on launch time configuration.

Design

The current CDAP functionalities can be roughly categorized as following:

  • Application API
    • For user application to provide specifications at deployment time
    • For user program to access / interact with CDAP system at runtime
      • Runtime information, Dataset, Transaction, Metrics, KMS, etc
  • Metadata Management
    • For writing / reading of entities metadata
      • Artifacts, namespaces, applications, plugins, datasets, 
      • Currently is mainly scattered in the app-fabric and data-fabric modules

Generally speaking, the CDAP master consists of two main roles, Metadata Management and Runtime Management. There are other services provided as part of the master, such as transaction, stream, logs, metrics and explore. We'll be covering those services in later section.

Image Removed 

Metadata Management

This basically contains various services that manage CDAP metadata for artifacts, applications, program runs and dataset. There is no big changes architecturally to those services, except for the removal dataset service, which we will cover in later section.

...