Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

 

Goals

  • CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages
  • During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited functionalities 
    • Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
      • However all the applications that are running, should be running
  • Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal state 
  • Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress
  • Does not include in-compatible upgrades or downgrades of underlying infrastructure 
  • Does not include long unavailability of service and infrastructure

Open Item/Discussion point

  • Define long and short/transient outages
    • More information to gathered here to understand the length of outages. 
    • When outages are multiple hours, how should the system handle these. 

Action Items - Oct 7th 2016

  • (tick) Send supported HBase version by CDAP
  • Gather information about CDH version compatibility changes – Talk to Cloudera and compile 

Failure Scenarios

  • HDFS
    • Upgrade
    • Downgrade
    • Restart
    • Data Node Outage
  • HBase
    • Upgrade
    • Downgrade
    • Restart
    • Region Server Outage
  • Zookeeper
    • Upgrade
    • Downgrade
    • Network Partition 
  • YARN
    • Upgrade
    • Downgrade
    • Node Manager Outage
    • RM Outage
  • Kafka
    • Upgrade
    • Downgrade
    • Disk Outage
  • KMS
    • Upgrade 
    • Downgrade 
    • Outage

Initiatives In Progress

  • [3.6] CDAP Service version and upgrade support
  • [3.6] Application versioning
  • [4.0] Messaging Service with goal of centralizing all transactional activities for metadata in HBase
  • [4.0?] Non-Transactional datasets
  • [4.0?] HBase Coprocessor Upgrade Management — Handling minor version changes efficiently without disabling HBase Tables. 
  • [4.0?] Upgrade tool improvements — Coprocessor Upgrade removal, faster data conversions if needed, smarts to reduce the impact to running services
  • [4.0?] CDAP Service Upgrade capability, might have Apache Twill change
  • [4.0?] Move configuration and operational updates to messaging services

Initiatives In Plan

  • Clients have retry and back-off mechanism to operate in degraded mode
  • YARN application resilience through Apache Twill
  • Move Dataset Service that currently runs in Master as YARN Application
  • No labels