Goals
- CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages
- During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited functionalities
- Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
- However all the applications that are running, should be running
- Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
- Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal state
- Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress
- Does not include in-compatible upgrades or downgrades of underlying infrastructure
- Does not include long unavailability of service and infrastructure
Open Item/Discussion point
- Define long and short/transient outages
Infrastructure components used by Cask Data Application Platform (CDAP)
Following are the underlying infrastructure components used by CDAP and/or CDAP Applications running in CDAP. The components presented below are in no priority order.
- HDFS
- HBase
- Hive
- Kafka
- YARN and
- Zookeeper
- KMS
Functional use of infrastructure components
This section provides information about how and for what the components underneath are used.
HDFS
- CDAP Stream
- Apache Tephra WAL
- Deployed Application Artifact and Dataset Artifact
- Aggregated Logs
- CDAP Fileset Dataset
- YARN distributed cache
- Coprocessor jars
HBase
- CDAP System data/metadata (ex: Preferences, Application, Namespace, Artifact…)
- Metrics Cube
- Lineage
- Workflow Statistics
- Run Record and Statistics
- Checkpoint information
- CDAP Table Dataset
Kafka
- Logs
- Metrics
- Audit Logs (Will be moved to HBase in 4.0)
- Metadata updates (Will be moved to HBase in 4.0)
- Notifications (Will be moved to HBase in 4.x)
YARN
- System Services
- User applications
Zookeeper
- Routing Tables
- Coordination
- Secret keys
- Auth keys
Hive
- Dataset integration
- Schema
- Properties
- Serde
KMS
- User Secrets (Ex: Password, access tokens etc..)
Failure Scenarios
- HDFS
- Upgrade
- Downgrade
- Restart
- Data Node Outage
- HBase
- Upgrade
- Downgrade
- Restart
- Region Server Outage
- Zookeeper
- Upgrade
- Downgrade
- Network Partition
- YARN
- Upgrade
- Downgrade
- Node Manager Outage
- RM Outage
- Kafka
- Upgrade
- Downgrade
- Disk Outage
- KMS
- Upgrade
- Downgrade
- Outage
Initiatives In Progress
- [3.6] CDAP Service version and upgrade support
- [3.6] Application versioning
- [4.0] Messaging Service with goal of centralizing all transactional activities for metadata in HBase
- [4.0?] Non-Transactional datasets
- [4.0?] HBase Coprocessor Upgrade Management — Handling minor version changes efficiently without disabling HBase Tables.
- [4.0?] Upgrade tool improvements — Coprocessor Upgrade removal, faster data conversions if needed, smarts to reduce the impact to running services
- [4.0?] CDAP Service Upgrade capability, might have Apache Twill change
- [4.0?] Move configuration and operational updates to messaging services
Initiatives In Plan
- Clients have retry and back-off mechanism to operate in degraded mode
- YARN application resilience through Apache Twill
- Move Dataset Service that currently runs in Master as YARN Application