CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages
During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited functionalities
Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
However all the applications that are running, should be running
Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal state
Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress
Does not include in-compatible upgrades or downgrades of underlying infrastructure
Does not include long unavailability of service and infrastructure
Area of Focus
CDAP system resiliency to infrastructure unavailability or interruption for long periods of time
CDAP Rolling Upgrades
CDAP Application Rolling Upgrades
High Level Requirements
Compatible Upgrade or Downgrade of underlying infrastructure Hadoop components
Underlying Hadoop infrastructure is either being upgraded or downgraded and the expectation is that CDAP and CDAP Applications should tolerate and be resilient to infrastructure services not being available during the upgrade or downgrade process.
The upgrade or downgrade process could take anywhere between 30 mins - 18 hours or more.
During the period of service unavailability or interruption, the CDAP and CDAP Applications operate in degraded mode.
Hadoop infrastructure upgrade / downgrade has to be compatible with CDAP and CDAP Application in order to have smooth upgrades
In case, there are issues during the upgrade, CDAP should be resilient to rollbacks
CDAP and CDAP Applications should also be able to withstand compatible downgrades
The compatibility matrix should be available to users to ensure smooth upgrades
Upgrade / Downgrade of CDAP
Upgrade a CDAP version. Major and minor version could have different impacts. We will discuss about these impact further in the document.
Roll back of CDAP upgrade or downgrade
CDAP version compatible matrix available to users
Dry run for upgrade and downgrade
Upgrade / Downgrade of CDAP Applications
Upgrade or Downgrade a CDAP Application
Rolling upgrade of live services like CDAP Services, Flow and Spark Streaming
Technical Breakdown
RS-001 : Un-interrupted update of compact modules in co-processor
CDAP system uses few HBase coprocessors to optimize the operations being performed on HBase. When underlying HBase is upgraded, the table has to be altered. This means that the table has to be disabled. Disabling the table can have multiple side effects on CDAP, so the recommended approach right now is to stop applications running within CDAP as well as CDAP. For each version of non-compatible HBase, CDAP has a compat module has to updated.
RS-002 : Client Resiliency
RS-003 : Move Dataset management out of CDAP Master
RS-004 : CDAP Version definition and guarantees of version
RS-005 : Rolling upgrade definition
RS-006 : Internal Schema Evolution and Management
RS-007 : Managing Infrastructure Incompatibility
RS-008 : System state transition and management
RS-009 : Apache Twill Application rolling upgrade
RS-010 : Upgrade Orchestrator
RS-011 : Progressive background upgrade tool
RS-012 : Hydrator Pipeline Upgrade
RS-013 : Dataset Upgrade
RS-014 : Test Framework and Chaos Monkey
RS-015 : User Interface / REST APIs / CLI
Open Item/Discussion point
Define long and short/transient outages
More information to gathered here to understand the length of outages.
When outages are multiple hours, how should the system handle these.
Action Items - Oct 7th 2016
Send supported HBase version by CDAP
Gather information about CDH version compatibility changes – Talk to Cloudera and compile
Initiatives In Progress
[3.6] CDAP Service version and upgrade support
[3.6] Application versioning
[4.0] Messaging Service with goal of centralizing all transactional activities for metadata in HBase
[4.0?] Non-Transactional datasets
[4.0?] HBase Coprocessor Upgrade Management — Handling minor version changes efficiently without disabling HBase Tables.
[4.0?] Upgrade tool improvements — Coprocessor Upgrade removal, faster data conversions if needed, smarts to reduce the impact to running services
[4.0?] CDAP Service Upgrade capability, might have Apache Twill change
[4.0?] Move configuration and operational updates to messaging services
Initiatives In Plan
Clients have retry and back-off mechanism to operate in degraded mode
YARN application resilience through Apache Twill
Move Dataset Service that currently runs in Master as YARN Application