Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goals

  • CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages

  • During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited

    functionalities Users will

    functionalities

    • Users may not be able to do admin operations such as creating, updating, deleting namespaces, adding roles, granting privileges

    • Users may not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.

    • However all the applications that are running, should be running

  • Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal

    state 

    state

  • Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress

  • Does not include

    in-compatible

    incompatible upgrades or downgrades of underlying

    infrastructure 

    infrastructure

  • Does not include long unavailability of service and infrastructure

Area of Focus

  • CDAP system resiliency to infrastructure unavailability or

    interruption for long periods of time

    service interruptions  

  • CDAP Application resiliency

  • CDAP Rolling Upgrades

  • CDAP Application Rolling Upgrades

  • Underlying infrastructure rolling upgrades

High Level Requirements

  • Compatible Upgrade or Downgrade of underlying infrastructure Hadoop components

    • Underlying Hadoop infrastructure is either being upgraded or downgraded and the expectation is that CDAP and CDAP Applications should tolerate and be resilient to infrastructure services not being available during the upgrade or downgrade process.

     
    • The whole upgrade or downgrade process could take anywhere between 30 mins - 18 hours or more.

     
      • However, CDAP would expect intermittent service interruption caused by rolling restart instead of complete shutdown of the service throughout the entire upgrade/downgrade process that causes long unavailability of the service

    • During the period of

    service unavailability or interruption
    • upgrade / downgrade , the CDAP and CDAP Applications operate in degraded mode.

    • Hadoop infrastructure upgrade / downgrade has to be compatible with CDAP and CDAP Application

    in order to have smooth upgrades
    • In case

    ,
    • there are issues during the upgrade, CDAP should be resilient to rollbacks

    • CDAP and CDAP Applications

    should also be able to withstand compatible downgrades
    • will continue to run and will not require a restart after the upgrade is done.

    • The compatibility matrix should be available to users to ensure smooth upgrades

    Upgrade / Downgrade
      • For HBase compatibility, it is at the HBase client level and not at the co-processor level

  • Upgrade of CDAP

    • Upgrade a CDAP version. Major and minor version could have different impacts. We will discuss about these impact further in the document.

     
    Roll back
    • Rollback of CDAP upgrade

    or downgrade
    • CDAP version

    compatible
    • compatibility matrix available to users

    Dry run for upgrade and downgrade
    • Rolling upgrade of CDAP

  • Upgrade

    / Downgrade

    of CDAP Applications

    Upgrade or Downgrade a CDAP Application

    • Rolling upgrade of live services like CDAP Services, Flow and Spark Streaming

  • Downgrade of CDAP and CDAP Applications

Technical Breakdown

RS-001 :

Un-interrupt

Minimize interruption caused by update of

compact modules in Coprocessor

coprocessor

CDAP system uses few HBase coprocessors to optimize the operations being performed on HBase. When the underlying HBase is upgraded, the table has to be altered. This means that the table has it may requires upgrade to the coprocessors due to coprocessor API changes. Upgrading of coprocessor requires HBase tables to be disabled. Disabling the table can have multiple side effects on CDAP, so the recommended approach right now is to stop applications running within CDAP as well as CDAP. Ideally stopping of CDAP or CDAP applications shouldn’t be required. For each version of non-compatible HBase, CDAP has a compat module has to updated.

 RS-002 : Client Resiliency

RS-003 : Move Dataset management out of CDAP Master

rolling upgrade of CDAP, disabling of HBase tables shouldn’t be required.

RS-002 : Client resiliency

CDAP as a system or CDAP Applications through CDAP APIs connect with Kafka, HBase, HDFS, YARN, Zookeeper as well as other CDAP systems. All client APIs currently have a predefined timeout before they fail. This behavior is not suitable for handling failures in underlying system. The clients should exhibit back-off behavior wherever applicable in case of failures resulting in degraded behavior. Once the issue is resolved then client should get back to normal operation. CDAP will provide the impact and behavior of each program type when there is infrastructure outage.

RS-003 : Make CDAP system services HA

All CDAP system services (such as Dataset service, TX manager, etc.) should support HA and have minimal failover time. Together with RS-002 client resiliency, CDAP and CDAP applications should be able to withstand any CDAP system services interruptions.

RS-004 : CDAP

Version

version definition and guarantees of

versions

RS-005 : Rolling upgrade definition

RS-006 : Internal Schema Evolution and Management

RS-007 : Managing Infrastructure Incompatibility 

RS-008 : System state transition and management 

RS-009 : Apache Twill Application rolling upgrade 

RS-010 : 

  • HIGH Client Resiliency

  • Rolling Upgrade Definition
  • Internal Schema Evolution
  • Infrastructure Incompatibility 
  • State Transition and Management
  • Apache Twill Rolling Upgrade Support
  • Rolling upgrade orchestrator
  • Progressive background upgrade tool
  • User Interface / REST APIs / CLI
  • Testing Framework and Chaos monkey
  • Hydrator pipeline upgradability

    version

    CDAP version would have to provide strong guarantees. Things like change in major version might not support rolling upgrade, patch upgrades should be able to jump to any patch within minor version, Minor version upgrades. Handing of API deprecation, Beta and GA. Beta API contracts, would they affect rolling upgrade. What version component guarantee binary compatibility, source compatibility, wireformat compatibility. When does CDAP app need to be rebuilt. If they have to rebuilt, how the application should be upgraded.

    RS-005 : Internal schema evolution and management

    Most network endpoint are versioned, but they are not complete. All the internal schemas should versioned (schema hash concept) and support for compatible schema changes.

    RS-006 : Managing infrastructure incompatibility

    When underlying upgrade or downgrade creates incompatibility, the CDAP system and CDAP Applications should be able to handle transient incompatibilities service disruptions. This might be prevented with documentation and publishing of compatibility matrix, but the system still should be able to handle the impact.

    RS-007 : System state transition and management

    During the rolling upgrade process the system has to be transitioned from one state to the other. Different sub-systems could in different states and those need to be managed. This is also applicable not only to CDAP System, but also to CDAP Applications.  

    RS-008 : Apache Twill Application rolling upgrade

    In order to support rolling upgrade of CDAP Applications, capability needs to be added to Apache Twill Application.

    RS-009 : Upgrade orchestrator

    The whole upgrade process has to be co-ordinated across multiple sub-systems for CDAP system and components for CDAP Application. The orchestrator is responsible for managing the lifecycle of rolling upgrade, reporting the status of upgrade.

    RS-010 : Progressive background upgrade tool

    Rolling upgrade at times would involve transitioning data, metadata from one format to another, if this process has to be non-intrusive, then it should be implemented as progressive process.

    RS-011 : Hydrator pipeline upgrade

    Hydrator pipelines are currently not compatible across major, minor or bug fix release as they are tightly tied to the exact version. This should follow the same or similar guidelines RS-004.

    RS-012 : Dataset upgrade

    In some cases there might system dataset used by the CDAP system or user datasets that are part of CDAP Applications that need to be migrated during the upgrade process, so the system should be support upgrading both types of datasets as part of RS-010.

    RS-013 : Test framework and chaos monkey

    From platform perspective, there should exists a solid end-2-end testing framework for testing known scenarios, but a chaos monkey would provide a more comprehensive testing.

    RS-014 : User Interface / REST APIs / CLI

    There should exist the ability to initiate, manage, monitor and track the progress of rolling upgrades / downgrades. These are accessible through CDAP User Interface, REST API and Command Line Interface.

    RS-015 : Support for rollbacks

    In case of failure to upgrade or downgrade mid through the process, the RS-009 should have the ability to rollback and restore the state of the system to point before the start of the process.

    RS-016 : Application Versioning

    Support versioning of CDAP Application. The specification of the application is versioned and support for running simultaneous version of application is supported.

    Open Item/Discussion point

    • Define long and short/transient outages

      • More information to gathered here to understand the length of outages.

       
      • When outages are multiple hours, how should the system handle these.

      • Rolling upgrades takes up to 6 hours

    Action Items

    -

    • Oct 7th 2016

      (tick) Send
      • (tick)Image Added Send supported HBase version by CDAP

      • Gather information about CDH version compatibility

      changes –
      • changes – Talk to Cloudera and

      compile 

    Initiatives In Progress

    • [3.6] CDAP Service version and upgrade support
    • [3.6] Application versioning
    • [4.0] Messaging Service with goal of centralizing all transactional activities for metadata in HBase
    • [4.0?] Non-Transactional datasets
    • [4.0?] HBase Coprocessor Upgrade Management — Handling minor version changes efficiently without disabling HBase Tables. 
    • [4.0?] Upgrade tool improvements — Coprocessor Upgrade removal, faster data conversions if needed, smarts to reduce the impact to running services
    • [4.0?] CDAP Service Upgrade capability, might have Apache Twill change
    • [4.0?] Move configuration and operational updates to messaging services

    Initiatives In Plan

    • Clients have retry and back-off mechanism to operate in degraded mode
    • YARN application resilience through Apache Twill
    • Move Dataset Service that currently runs in Master as YARN Application
      • compile

    Table of Contents