Goals
- CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages
- During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited functionalities
- Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
- However all the applications that are running, should be running
- Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
- Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal state
- Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress
- Does not include in-compatible upgrades or downgrades of underlying infrastructure
- Does not include long unavailability of service and infrastructure
Open Item/Discussion point
- Define long and short/transient outages
- More information to gathered here to understand the length of outages.
- When outages are multiple hours, how should the system handle these.
Action Items - Oct 7th 2016
- Send supported HBase version by CDAP
- Gather information about CDH version compatibility changes – Talk to Cloudera and compile
Failure Scenarios
- HDFS
- Upgrade
- Downgrade
- Restart
- Data Node Outage
- HBase
- Upgrade
- Downgrade
- Restart
- Region Server Outage
- Zookeeper
- Upgrade
- Downgrade
- Network Partition
- YARN
- Upgrade
- Downgrade
- Node Manager Outage
- RM Outage
- Kafka
- Upgrade
- Downgrade
- Disk Outage
- KMS
- Upgrade
- Downgrade
- Outage
Initiatives In Progress
- [3.6] CDAP Service version and upgrade support
- [3.6] Application versioning
- [4.0] Messaging Service with goal of centralizing all transactional activities for metadata in HBase
[4.0?] Non-Transactional datasets- [4.0?] HBase Coprocessor Upgrade Management — Handling minor version changes efficiently without disabling HBase Tables.
- [4.0?] Upgrade tool improvements — Coprocessor Upgrade removal, faster data conversions if needed, smarts to reduce the impact to running services
- [4.0?] CDAP Service Upgrade capability, might have Apache Twill change
- [4.0?] Move configuration and operational updates to messaging services
Initiatives In Plan
- Clients have retry and back-off mechanism to operate in degraded mode
- YARN application resilience through Apache Twill
- Move Dataset Service that currently runs in Master as YARN Application