Area of Focus
CDAP system resiliency to infrastructure unavailability or service interruptions
CDAP Application resiliency
CDAP Rolling Upgrades
CDAP Application Rolling Upgrades
Underlying infrastructure rolling upgrades
High Level Requirements
Technical Breakdown
RS-001 : Minimize interruption caused by update of coprocessor
CDAP system uses few HBase coprocessors to optimize the operations being performed on HBase. When the underlying HBase is upgraded, it may requires upgrade to the coprocessors due to coprocessor API changes. Upgrading of coprocessor requires HBase tables to be disabled. Disabling the table can have multiple side effects on CDAP, so the recommended approach right now is to stop applications running within CDAP as well as CDAP. Ideally stopping of CDAP or CDAP applications shouldn’t be required. For rolling upgrade of CDAP, disabling of HBase tables shouldn’t be required.
RS-002 : Client resiliency
CDAP as a system or CDAP Applications through CDAP APIs connect with Kafka, HBase, HDFS, YARN, Zookeeper as well as other CDAP systems. All client APIs currently have a predefined timeout before they fail. This behavior is not suitable for handling failures in underlying system. The clients should exhibit back-off behavior wherever applicable in case of failures resulting in degraded behavior. Once the issue is resolved then client should get back to normal operation. CDAP will provide the impact and behavior of each program type when there is infrastructure outage.
RS-003 : Make CDAP system services HA
All CDAP system services (such as Dataset service, TX manager, etc.) should support HA and have minimal failover time. Together with RS-002 client resiliency, CDAP and CDAP applications should be able to withstand any CDAP system services interruptions.
RS-004 : CDAP version definition and guarantees of version
CDAP version would have to provide strong guarantees. Things like change in major version might not support rolling upgrade, patch upgrades should be able to jump to any patch within minor version, Minor version upgrades. Handing of API deprecation, Beta and GA. Beta API contracts, would they affect rolling upgrade. What version component guarantee binary compatibility, source compatibility, wireformat compatibility. When does CDAP app need to be rebuilt. If they have to rebuilt, how the application should be upgraded.
RS-005 : Internal schema evolution and management
Most network endpoint are versioned, but they are not complete. All the internal schemas should versioned (schema hash concept) and support for compatible schema changes.
RS-006 : Managing infrastructure incompatibility
When underlying upgrade or downgrade creates incompatibility, the CDAP system and CDAP Applications should be able to handle transient incompatibilities service disruptions. This might be prevented with documentation and publishing of compatibility matrix, but the system still should be able to handle the impact.
RS-007 : System state transition and management
During the rolling upgrade process the system has to be transitioned from one state to the other. Different sub-systems could in different states and those need to be managed. This is also applicable not only to CDAP System, but also to CDAP Applications.
RS-008 : Apache Twill Application rolling upgrade
In order to support rolling upgrade of CDAP Applications, capability needs to be added to Apache Twill Application.
RS-009 : Upgrade orchestrator
The whole upgrade process has to be co-ordinated across multiple sub-systems for CDAP system and components for CDAP Application. The orchestrator is responsible for managing the lifecycle of rolling upgrade, reporting the status of upgrade.
Rolling upgrade at times would involve transitioning data, metadata from one format to another, if this process has to be non-intrusive, then it should be implemented as progressive process.
RS-011 : Hydrator pipeline upgrade
Hydrator pipelines are currently not compatible across major, minor or bug fix release as they are tightly tied to the exact version. This should follow the same or similar guidelines RS-004.
RS-012 : Dataset upgrade
In some cases there might system dataset used by the CDAP system or user datasets that are part of CDAP Applications that need to be migrated during the upgrade process, so the system should be support upgrading both types of datasets as part of RS-010.
RS-013 : Test framework and chaos monkey
From platform perspective, there should exists a solid end-2-end testing framework for testing known scenarios, but a chaos monkey would provide a more comprehensive testing.
RS-014 : User Interface / REST APIs / CLI
There should exist the ability to initiate, manage, monitor and track the progress of rolling upgrades / downgrades. These are accessible through CDAP User Interface, REST API and Command Line Interface.
RS-015 : Support for rollbacks
In case of failure to upgrade or downgrade mid through the process, the RS-009 should have the ability to rollback and restore the state of the system to point before the start of the process.
RS-016 : Application Versioning
Support versioning of CDAP Application. The specification of the application is versioned and support for running simultaneous version of application is supported.
Open Item/Discussion point
Action Items