Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Goals

  • CDAP and CDAP Applications have the ability to withstand short and transient infrastructural outages
  • During interruption of underlying services (one or more), CDAP or CDAP Applications can operate under degraded performance/limited functionalities 
    • Users will not be able to perform operations like deploying apps, starting programs or new data or application lifecycle operations.
      • However all the applications that are running, should be running
  • Once interruption in the underlying service is resolved or services come back to normal operation, the CDAP and CDAP Application will go back to normal state 
  • Interruptions in service would be due to node failure, service failures or compatible rolling upgrades or downgrades in progress
  • Does not include in-compatible upgrades or downgrades of underlying infrastructure 
  • Does not include long unavailability of service and infrastructure

Open Item/Discussion point

  • Define long and short/transient outages 

Infrastructure components used by Cask Data Application Platform (CDAP)

Following are the underlying infrastructure components used by CDAP and/or CDAP Applications running in CDAP.  The components presented below are in no priority order. 
  • HDFS
  • HBase
  • Hive
  • Kafka
  • YARN and
  • Zookeeper
  • KMS

Functional use of infrastructure components

This section provides information about how and for what the components underneath are used. 

...

  • User Secrets (Ex: Password, access tokens etc..) 

Failure Scenarios

  • HDFS
    • Upgrade
    • Downgrade
    • Restart
    • Data Node Outage
  • HBase
    • Upgrade
    • Downgrade
    • Restart
    • Region Server Outage
  • Zookeeper
    • Upgrade
    • Downgrade
    • Network Partition 
  • YARN
    • Upgrade
    • Downgrade
    • Node Manager Outage
    • RM Outage
  • Kafka
    • Upgrade
    • Downgrade
    • Disk Outage
  • KMS
    • Upgrade 
    • Downgrade 
    • Outage

Initiatives In Progress

  • [3.6] CDAP Service version and upgrade support
  • [3.6] Application versioning
  • [4.0] Messaging Service with goal of centralizing all transactional activities for metadata in HBase
  • [4.0?] Non-Transactional datasets 
  • [4.0?] HBase Coprocessor Upgrade Management — Handling minor version changes efficiently without disabling HBase Tables. 
  • [4.0?] Upgrade tool improvements — Coprocessor Upgrade removal, faster data conversions if needed, smarts to reduce the impact to running services
  • [4.0?] CDAP Service Upgrade capability, might have Apache Twill change
  • [4.0?] Move configuration and operational updates to messaging services

Initiatives In Plan

  • Clients have retry and back-off mechanism to operate in degraded mode
  • YARN application resilience through Apache Twill
  • Move Dataset Service that currently runs in Master as YARN Application