Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goals

  • CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages
  • During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited functionalities 
    • Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
      • However all the applications that are running, should be running
  • Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal state 
  • Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress
  • Does not include in-compatible upgrades or downgrades of underlying infrastructure 
  • Does not include long unavailability of service and infrastructure

Area of Focus

  • CDAP system resiliency to infrastructure unavailability or interruption for long periods of time
  • CDAP Rolling Upgrades
  • CDAP Application Rolling Upgrades

High Level Requirements

  • Compatible Upgrade or Downgrade of underlying infrastructure Hadoop components
    • Underlying Hadoop infrastructure is either being upgraded or downgraded and the expectation is that CDAP and CDAP Applications should tolerate and be resilient to infrastructure services not being available during the upgrade or downgrade process. 
    • The upgrade or downgrade process could take anywhere between 30 mins - 18 hours or more. 
    • During the period of service unavailability or interruption, the CDAP and CDAP Applications operate in degraded mode.
    • Hadoop infrastructure upgrade / downgrade has to be compatible with CDAP and CDAP Application in order to have smooth upgrades
    • In case, there are issues during the upgrade, CDAP should be resilient to rollbacks
    • CDAP and CDAP Applications should also be able to withstand compatible downgrades
    • The compatibility matrix should be available to users to ensure smooth upgrades
  • Upgrade / Downgrade of CDAP
    • Upgrade a CDAP version. Major and minor version could have different impacts. We will discuss about these impact further in the document. 
    • Roll back of CDAP upgrade or downgrade
    • CDAP version compatible matrix available to users
    • Dry run for upgrade and downgrade
  • Upgrade / Downgrade of CDAP Applications
    • Upgrade or Downgrade a CDAP Application
    • Rolling upgrade of live services like CDAP Services, Flow and Spark Streaming

Technical Breakdown

RS-001 : Un-interrupted update of compact modules in co-processor

CDAP system uses few HBase coprocessors to optimize the operations being performed on HBase. When underlying HBase is upgraded, the table has to be altered. This means that the table has to be disabled. Disabling the table can have multiple side effects on CDAP, so the recommended approach right now is to stop applications running within CDAP as well as CDAP. For each version of non-compatible HBase, CDAP has a compat module has to updated.

RS-002 : Client

Resiliency

resiliency

CDAP as a system or CDAP Applications through CDAP APIs directly or in-directly connect with Kafka, HBase, HDFS, YARN, Zookeeper as well as other CDAP systems. All client APIs currently have a pre-defined timeout before they fail. This behavior is not suitable for handle failures in underlying system. The clients should exhibit backoff back-off behavior in case of failures resulting in degraded behavior. Once the issue is resolved then client should immediately get back to normal operation. 

RS-003 : Move Dataset management out of CDAP Master

Dataset Manager service currently resides within CDAP Master. All Dataset initialization would require to contact Dataset Service within CDAP Master to load dataset artifacts. In case of unavailability of CDAP Master the dataset initialization would fail and that would cause clients to fail, in turn would fail the programs performing the dataset operation. Moving Dataset Service out of CDAP Master and moving Dataset libraries into standard artifact infrastructure would allow to reduce this dependency.  

RS-004 : CDAP

Version

version definition and guarantees of version

RS-005 : Rolling upgrade definition

RS-006 : Internal Schema Evolution and Management

RS-007 : Managing Infrastructure Incompatibility 

CDAP version would have to provide strong guarantees. Things like change in major version might not support rolling upgrade, patch upgrades should be able to jump to any patch within minor version, Minor version upgrades. Handing of API deprecation, Beta and GA. Beta API contracts, would they affect rolling upgrade. What version component guarantee binary compatibility, source compatibility, wireformat compatibility. When does CDAP app need to be rebuilt. If they have to rebuilt, how the application should be upgraded. 

RS-005 : Internal schema evolution and management

RS-006 : Managing infrastructure incompatibility 

RS-008 : System state transition and management 

RS-009 : Apache Twill Application rolling upgrade 

RS-010 :

Upgrade Orchestrator

Upgrade orchestrator

RS-011 : Progressive background upgrade tool 

RS-012 : Hydrator

Pipeline Upgrade 

pipeline upgrade 

RS-013 : Dataset

Upgrade 

upgrade 

RS-014 : Test

Framework

framework and

Chaos Monkey

chaos monkey

RS-015 : User Interface / REST APIs / CLI

RS-016 : Support for rollbacks

Open Item/Discussion point

  • Define long and short/transient outages
    • More information to gathered here to understand the length of outages. 
    • When outages are multiple hours, how should the system handle these. 

Action Items

  • Oct 7th 2016

    • (tick) Send supported HBase version by CDAP
    • Gather information about CDH version compatibility changes – Talk to Cloudera and compile 

 

Table of Contents