Installation Improvements

Goals

  • Provide a great out of the box installation experience
    • CDAP services on startup should check necessary pre-reqs and provide meaningful error messages if pre-reqs are not met
    • Provide meaningful messages in the log files for services on failure
  • Improve the documentation for installation

Checklist

  • User stories documented (Albert/Alvin)
  • User stories reviewed (Nitin)
  • Design documented (Albert/Alvin)
  • Design reviewed (Andreas)
  • Feature merged (Albert/Alvin)
  • Examples and guides (Albert/Alvin) (not needed for the feature)
  • Integration tests (Albert/Alvin) 
  • Documentation for feature (Albert/Alvin)
  • Blog post 

Design

Master

Improve exception logging
* Instead of stack trace, provide error code and message -- prescribe corrective actions where possible
* Stack trace should only shown when CDAP fails unexpectedly
* Failing to bind to port
* Failing to reach a system service
* Failing to reach an underlying infra component
* Failing to load a class

Perform integrity tests on startup
* Log important configuration info (SSL, ports, security, explore)
* Check configuration
* Check HDFS directory permissions
* Check status and versions of underlying systems (HDFS, YARN, HBase, Hive, ZooKeeper, Kafka)
* Check if there are appropriate resources available (may not be possible for YARN CPU/Memory)
* Log a warning if YARN only has 12 GB and CDAP requires 10 GB
* Log error if YARN only has 12 GB and CDAP requires > 12 GB
* Check that system services are up (tx, dataset, log, metric, transaction)
* Fail to start if any critical checks fail
* Checks can be disabled via cdap-site.xml

Improve master and system service interactions
* Start should be successful only if system services start successfully, otherwise shutdown and clean up
* Expose logs, status for each system service
* Status is OK only if the system service can perform its operations
* Master logs should only contain logs related to master (not app fabric, dataset, tx)
* Master logs and system service logs should contain classpath
* Master logs and system service logs should log which components are enabled or disabled
* Master logs and system service logs should log versions of underlying infra components
* Shutdown should clean up resources, show RED in Cloudera Manager

All top-level services

Improve service lifecycle management
* Fail startup if cannot bind to configured port
* If bound to 0.0.0.0, log all interfaces bound to
* Service (/etc/init.d/<service> status should work even if service is stopped
* Improve retry logic
* When tx service is down, there are lots of exception logs
* When router has an issue, the UI sends a lot of requests causing lots of exception logs
* Give up after N tries, but keep retrying til it comes back up
* Flowlet: exponential backoff

UI

Improve UI logging
* Show meaningful messages if failed to start

Improve UI experience when Master is starting up or not running
* Show fist loading screen when Master, Router is starting up
* Do status call for system services /v3/system/services/status
* Show normal UI when all critical system services are OK
* Show error screen when not all critical system services are OK
* If a non-critical system service is not OK, disable parts of the UI that use the service (e.g. Explore)

User Stories

  • User should be able to easily navigate to the installation docs for Ambari, Cloudera manager, Installing via package managers, MapR
  • User should be able to perform necessary preparation step for the installation using command line or UI
  • User should be able to perform step by step installation using command line or UI 
  • Users should have section for installing on HA clusters

    • HA secure clusters

    • HA YARN

    • HA HDFS

    • HA CDAP

  • User should be able to install non standard components (nodejs) using instructions in docs
  • Users should be able to easily determine the configurations needed for CDAP installation by reading the docs
  • Users should have section for installing on secure clusters
  • CDAP master services should not start up if required pre-requisites are not met
    • Pre-requisites: YARN containers required
    • HDFS permissions
    • CDAP system services running on Twill
    • Kafka services [Low Priority]
  • CDAP master service should log appropriate error messages if it fails to start and prescribe corrective actions where possible
  • CDAP master should start successfully only if all the corresponding system service can start fine
  • CDAP master service should serve requests only if app fabric and corresponding system services start up fine
  • Error logs in master logs should be meaningful to the end user and the end user should understand the behavior of the service easily
  • Master log should contain only app fabric logs and system services logs and not application logs
  • User should be able to determine the complete class path of the services (master, router, auth) by looking at the logs
  • Users should be able to determine what components are enabled and disabled by looking at the logs
  • Users should be able to determine the versions of underlying infra components by looking at the master logs
  • Users should be able to query for versions of the underlying infra components using HTTP REST API calls (Low) 
  • CDAP router should not startup if it cannot bind to configured port
  • CDAP router should log appropriate error in the logs if it fails to start
  • CDAP Auth should not startup if it cannot bind to configured port
  • CDAP auth service should log appropriate error in the logs if it fails to start
  • Services that are bound to 0.0.0.0 should log all the interfaces that are bound (Low)
  • CDAP UI should not serve services are starting up content up if the downstream services are not up and not make any other calls to the backend
  • CDAP UI should log meaningful messages on failure to start
  • CDAP UI should should not start up if the version of nodejs installed is not compatible
  • Users should be able to optionally disable the checks performed during startup
  • Service (/etc/init.d/<service> status should work even if service is stopped

Design

 

Additional Notes

Brainstorming notes (Andreas/Nitin/Albert/Sree)

 

IssuePriorityCategory
Master start-up should record versions (in Log and also store that info - make it queryable) of underlying system - HDFS, YARN, HBASE, Hive, Zookeeper, Kafka and any other systemHStartup

Master on every startup should perform integrity test - Check if there are appropriate YARN resources (CPU/Memory) available, Kafka is up, Log and Metric services

are up, Transaction is up, HDFS directory permissions, checks for known configurations

HStartup
UI should refuse to start if router and every underlying services are not up or it should keep trying with UI showing that system is coming up till everything is not upHStartup
Version compatibility checks - before startup - check for right version of Hive/Hbase and throw meaningful errorsHStartup
Log the necessary info in during startup - Classpath (no stars), what is enabled/disabled, versions of components, auth enabled/disabledHStartup
Right error messages should be printed in the logs on failureHStartup
Master service (App fabric) should register and serve requests only after the corresponding twill application is up and runningHStartup
Master logs should not be polluted with un-necessary exceptions during startupHStartup
The startup check should be optionally disabledHStartup
If any check fails, master should shutdown, but all errors should be loggedHStartup

 

Feedback from Installation Hackathon

IssuePriorityCategory
Installation and configuration document flow is all over the place: Max Client connxs in pre-requisites and preparing cluster is the right placeHDocs
Better highlighting in docs for cdap-site and cdap-securityMDocs
Discrepancy in variable naming: /etc/security/keytabs/cdap.keytab vs /etc/security/keytabs/cdap.service.keytabL 
Discrepancy in variable naming: router.server.address is described as IP in one place and hostname in anotherLDocs
More details on standard tasks: Creating cdap kerberos principal, update-alternativesMDocs
NTP not installed - should be a part of docsMDocs
Docs should clearly state CDAP HA/Hadoop HA and kerberos is not supportedHDocs
CDAP HA installation manual steps should be documented and should be linked from Ambari docsHDocs
Dependency of custom services not support. Should documentHDocs
CDAP with Explore enabled should have Hive client on master node(s)HDocs + Startup Check
Ambari shows CDAP services RED most of the timeHAmbari integration
Non-ASCII copy right causes the services scripts not to startHService script
Ambari doesn't support importing hosts (can only install cdap w/ Ambari if you installed Hadoop w/ Ambari)LDocs
Show docs for relevant errors in place (have error codes)LStartup check + Docs
Provide simple and meaningful errors on startupHStartup check
Should clearly specify what services are needed for Ambari/CDH integrationHDocs
If the required pre-reqs are not installed there should be relevant messages on startupHStartup check
Inconsistencies in docs should be fixed - same services are described differently (Ambari doc uses Ambari terminology, etc)HDocs
Installation docs between CDH and Ambari are not consistent, structure is completely differentHDocs
Installation steps for nodejs should be documentedHDocs
UI startup script should check for right version of nodejsHService script
Installation script should set right permissions for CDAP directories (/var/log/cdap)HService script
The startup script should check for right permissions (duplicate)HStartup check
/etc/init.d/<service> status only works when service is runningHService script
HDFS permissions should be checked on startup (duplicate)HStartup check
Potential port conflicts for router + Hive should be documented (10000)HDocs
Service startups should detect port conflict and fail gracefully (check it dies and message is good)HStartup check
Master log is noisy (duplicate)HPlatform
UI doesn't log anything if it doesn't startupHPlatform
Docs search is not goodMDocs Infra
Screenshots for installation steps needs improvementMDocs
Add links to Cloudera docs for CM installLDocs
Compatibility matrix on which version of CM needed etc needs to be prominentHDocs

 

 

Â