Goals:
- Provide a great out of the box installation experience
- CDAP services on startup should check necessary pre-reqs and provide meaningful error messages if pre-reqs are not met
- Provide meaningful messages in the log files for services on failure
- Improve the documentation for installation
Checklist
- User stories documented (Albert/Alvin)
- User stories reviewed (Nitin)
- Design documented (Albert/Alvin)
- Design reviewed (Andreas)
- Feature merged (Albert/Alvin)
Examples and guides (Albert/Alvin) (not needed for the feature)- Integration tests (Albert/Alvin)
- Documentation for feature (Albert/Alvin)
- Blog post
Design:
Master
Improve exception logging:
* Instead of stack trace, provide error code and message -- prescribe corrective actions where possible
* Stack trace should only shown when CDAP fails unexpectedly
* Failing to bind to port
* Failing to reach a system service
* Failing to reach an underlying infra component
* Failing to load a class
Perform integrity tests on startup:
* Log important configuration info (SSL, ports, security, explore)
* Check configuration
* Check HDFS directory permissions
* Check status and versions of underlying systems (HDFS, YARN, HBase, Hive, ZooKeeper, Kafka)
* Check if there are appropriate resources available (may not be possible for YARN CPU/Memory)
* Log a warning if YARN only has 12 GB and CDAP requires 10 GB
* Log error if YARN only has 12 GB and CDAP requires > 12 GB
* Check that system services are up (tx, dataset, log, metric, transaction)
* Fail to start if any critical checks fail
* Checks can be disabled via cdap-site.xml
Improve master and system service interactions:
* Start should be successful only if system services start successfully, otherwise shutdown and clean up
* Expose logs, status for each system service
* Status is OK only if the system service can perform its operations
* Master logs should only contain logs related to master (not app fabric, dataset, tx)
* Master logs and system service logs should contain classpath
* Master logs and system service logs should log which components are enabled or disabled
* Master logs and system service logs should log versions of underlying infra components
* Shutdown should clean up resources, show RED in Cloudera Manager
All top-level services
Improve service lifecycle management
* Fail startup if cannot bind to configured port
* If bound to 0.0.0.0, log all interfaces bound to
* Service (/etc/init.d/<service> status should work even if service is stopped
* Improve retry logic
* When tx service is down, there are lots of exception logs
* When router has an issue, the UI sends a lot of requests causing lots of exception logs
* Give up after N tries, but keep retrying til it comes back up
* Flowlet: exponential backoff
UI
Improve UI logging
* Show meaningful messages if failed to start
Improve UI experience when Master is starting up or not running
* Show fist loading screen when Master, Router is starting up
* Do status call for system services /v3/system/services/status
* Show normal UI when all critical system services are OK
* Show error screen when not all critical system services are OK
* If a non-critical system service is not OK, disable parts of the UI that use the service (e.g. Explore)
User Stories:
- User should be able to easily navigate to the installation docs for Ambari, Cloudera manager, Installing via package managers, MapR
- User should be able to perform necessary preparation step for the installation using command line or UI
- User should be able to perform step by step installation using command line or UI
Users should have section for installing on HA clusters
HA secure clusters
HA YARN
HA HDFS
HA CDAP
- User should be able to install non standard components (nodejs) using instructions in docs
- Users should be able to easily determine the configurations needed for CDAP installation by reading the docs
- Users should have section for installing on secure clusters
- CDAP master services should not start up if required pre-requisites are not met
- Pre-requisites: YARN containers required
- HDFS permissions
- CDAP system services running on Twill
- Kafka services [Low Priority]
- CDAP master service should log appropriate error messages if it fails to start and prescribe corrective actions where possible
- CDAP master should start successfully only if all the corresponding system service can start fine
- CDAP master service should serve requests only if app fabric and corresponding system services start up fine
- Error logs in master logs should be meaningful to the end user and the end user should understand the behavior of the service easily
- Master log should contain only app fabric logs and system services logs and not application logs
- User should be able to determine the complete class path of the services (master, router, auth) by looking at the logs
- Users should be able to determine what components are enabled and disabled by looking at the logs
- Users should be able to determine the versions of underlying infra components by looking at the master logs
- Users should be able to query for versions of the underlying infra components using HTTP REST API calls (Low)
- CDAP router should not startup if it cannot bind to configured port
- CDAP router should log appropriate error in the logs if it fails to start
- CDAP Auth should not startup if it cannot bind to configured port
- CDAP auth service should log appropriate error in the logs if it fails to start
- Services that are bound to 0.0.0.0 should log all the interfaces that are bound (Low)
- CDAP UI should not serve services are starting up content up if the downstream services are not up and not make any other calls to the backend
- CDAP UI should log meaningful messages on failure to start
- CDAP UI should should not start up if the version of nodejs installed is not compatible
- Users should be able to optionally disable the checks performed during startup
- Service (/etc/init.d/<service> status should work even if service is stopped
Design:
Additional Notes:
Brainstorming notes (Andreas/Nitin/Albert/Sree):
Issue | Priority | Category |
---|---|---|
Master start-up should record versions (in Log and also store that info - make it queryable) of underlying system - HDFS, YARN, HBASE, Hive, Zookeeper, Kafka and any other system | H | Startup |
Master on every startup should perform integrity test - Check if there are appropriate YARN resources (CPU/Memory) available, Kafka is up, Log and Metric services are up, Transaction is up, HDFS directory permissions, checks for known configurations | H | Startup |
UI should refuse to start if router and every underlying services are not up or it should keep trying with UI showing that system is coming up till everything is not up | H | Startup |
Version compatibility checks - before startup - check for right version of Hive/Hbase and throw meaningful errors | H | Startup |
Log the necessary info in during startup - Classpath (no stars), what is enabled/disabled, versions of components, auth enabled/disabled | H | Startup |
Right error messages should be printed in the logs on failure | H | Startup |
Master service (App fabric) should register and serve requests only after the corresponding twill application is up and running | H | Startup |
Master logs should not be polluted with un-necessary exceptions during startup | H | Startup |
The startup check should be optionally disabled | H | Startup |
If any check fails, master should shutdown, but all errors should be logged | H | Startup |
Feedback from Installation Hackathon
Issue | Priority | Category |
---|---|---|
Installation and configuration document flow is all over the place: Max Client connxs in pre-requisites and preparing cluster is the right place | H | Docs |
Better highlighting in docs for cdap-site and cdap-security | M | Docs |
Discrepancy in variable naming: /etc/security/keytabs/cdap.keytab vs /etc/security/keytabs/cdap.service.keytab | L | |
Discrepancy in variable naming: router.server.address is described as IP in one place and hostname in another | L | Docs |
More details on standard tasks: Creating cdap kerberos principal, update-alternatives | M | Docs |
NTP not installed - should be a part of docs | M | Docs |
Docs should clearly state CDAP HA/Hadoop HA and kerberos is not supported | H | Docs |
CDAP HA installation manual steps should be documented and should be linked from Ambari docs | H | Docs |
Dependency of custom services not support. Should document | H | Docs |
CDAP with Explore enabled should have Hive client on master node(s) | H | Docs + Startup Check |
Ambari shows CDAP services RED most of the time | H | Ambari integration |
Non-ASCII copy right causes the services scripts not to start | H | Service script |
Ambari doesn't support importing hosts (can only install cdap w/ Ambari if you installed Hadoop w/ Ambari) | L | Docs |
Show docs for relevant errors in place (have error codes) | L | Startup check + Docs |
Provide simple and meaningful errors on startup | H | Startup check |
Should clearly specify what services are needed for Ambari/CDH integration | H | Docs |
If the required pre-reqs are not installed there should be relevant messages on startup | H | Startup check |
Inconsistencies in docs should be fixed - same services are described differently (Ambari doc uses Ambari terminology, etc) | H | Docs |
Installation docs between CDH and Ambari are not consistent, structure is completely different | H | Docs |
Installation steps for nodejs should be documented | H | Docs |
UI startup script should check for right version of nodejs | H | Service script |
Installation script should set right permissions for CDAP directories (/var/log/cdap) | H | Service script |
The startup script should check for right permissions (duplicate) | H | Startup check |
/etc/init.d/<service> status only works when service is running | H | Service script |
HDFS permissions should be checked on startup (duplicate) | H | Startup check |
Potential port conflicts for router + Hive should be documented (10000) | H | Docs |
Service startups should detect port conflict and fail gracefully (check it dies and message is good) | H | Startup check |
Master log is noisy (duplicate) | H | Platform |
UI doesn't log anything if it doesn't startup | H | Platform |
Docs search is not good | M | Docs Infra |
Screenshots for installation steps needs improvement | M | Docs |
Add links to Cloudera docs for CM install | L | Docs |
Compatibility matrix on which version of CM needed etc needs to be prominent | H | Docs |