...
- Flexible in launching runtime in any targeted cluster, whether it is cloud or on premise, same or different cluster as CDAP runs.
- A CDAP instance can uses different clouds / clusters simultaneously for different program executions
- Flexible in integration with different runtime framework (MR, Spark, Beam/DataFlow, whatever next...).
- Runtime is autonomous.
- Once launched, it manages every runtime aspect of a program execution.
- Logs, metrics, metadata collection are done within that runtime
- Runtime shouldn't initiate any communication to CDAP master
- There are practical concerns about scalability and availability
- Scalability
- The scaling of the CDAP master should only based on number of concurrent program executions, rather than individual program logic
- Availability
- Although the CDAP master should support HA, it shouldn't be in the critical path of program execution. Meaning if CDAP master is down, programs already running should keep on running without any blockage or error
- Scalability
- There are practical concerns about scalability and availability
- The runtime provides REST endpoint for the CDAP master to periodically poll the runtime for information updates
- Program states, metadata, workflow tokens, etc
- The runtime provides REST endpoint for CDAP master to control the runtime
- Suspend / Termination
- Once launched, it manages every runtime aspect of a program execution.
- Real-time logs and metrics collection mechanism is pluggable (via common services like current CDAP, per program run, rely on the cloud provider, ...)
- This also means can be disabled (no-op) based on launch time configuration.
Design
...
CDAP Architecture
The guiding principles of the architecture is as follows:
- A stable and scalable core platform that provides essential functionalities
- An application API for Data Application development
- Manages and operates Data Application
- A central catalog for entities' metadata
- Provide a well defined API to support addition capabilities
- Enable fast iteration of new ideas
- Allow running data / compute intensive jobs for the system (e.g. Spark, Elastic Search, etc.)
- Individual extended system can be turn on / off independently
CDAP Core System
The major components of the CDAP core are as follows:
- Metadata Catalog
- Responsible for collecting, storing, serving and exporting metadata as defined by users and applications
- Artifacts and Application Repository
- Responsible for artifacts and applications deployment, update (versioning), removal and surfacing of those information
- Runtime Manager
- Responsible for all aspects about program lifecycle, including resource provisioning, execution, run records, etc.
- Scheduler
- Responsible for launching programs based on triggers and constraints
- Transaction
- Apache Tephra to provide transactional operations on HBase
- #Transaction Service may not be available to application runtime, depending on the target execution cluster and we may eventually removing it
- Metrics and Monitoring
- Responsible for collecting and querying metrics
- *May integrate with external metrics system in future
- TMS
- Provides messaging service for event based design to decouple components
CDAP Extended System
Metadata Management
This basically contains various services that manage CDAP metadata for artifacts, applications, program runs and dataset. There is no big changes architecturally to those services, except for the removal dataset service, which we will cover in later section.
...