Goals
...
- Provide a great out of the box installation experience
- CDAP services on startup should check necessary pre-reqs and provide meaningful error messages if pre-reqs are not met
- Provide meaningful messages in the log files for services on failure
- Improve the documentation for installation
...
- User stories documented (Albert/Alvin)
- User stories reviewed (Nitin)
- Design documented (Albert/Alvin)
- Design reviewed (Andreas)
- Feature merged (Albert/Alvin)
-
Examples and guides (Albert/Alvin) (not needed for the feature) - Integration tests (Albert/Alvin)
- Documentation for feature (Albert/Alvin)
- Blog post
Design
...
Master
Improve exception logging
...
* Instead of stack trace, provide error code and message -- prescribe corrective actions where possible
* Stack trace should only shown when CDAP fails unexpectedly
* Failing to bind to port
* Failing to reach a system service
* Failing to reach an underlying infra component
* Failing to load a class
Perform integrity tests on startup
...
* Log important configuration info (SSL, ports, security, explore)
* Check configuration
* Check HDFS directory permissions
* Check status and versions of underlying systems (HDFS, YARN, HBase, Hive, ZooKeeper, Kafka)
* Check if there are appropriate resources available (may not be possible for YARN CPU/Memory)
* Log a warning if YARN only has 12 GB and CDAP requires 10 GB
* Log error if YARN only has 12 GB and CDAP requires > 12 GB
* Check that system services are up (tx, dataset, log, metric, transaction)
* Fail to start if any critical checks fail
* Checks can be disabled via cdap-site.xml
...
Improve master and system service interactions
...
* Start should be successful only if system services start successfully, otherwise shutdown and clean up
* Expose logs, status for each system service
* Status is OK only if the system service can perform its operations
* Master logs should only contain logs related to master (not app fabric, dataset, tx)
* Master logs and system service logs should contain classpath
* Master logs and system service logs should log which components are enabled or disabled
* Master logs and system service logs should log versions of underlying infra components
* Shutdown should clean up resources, show RED in Cloudera Manager
All top-level services
Improve service lifecycle management
* Fail startup if cannot bind to configured port
* If bound to 0.0.0.0, log all interfaces bound to
* Service (/etc/init.d/<service> status should work even if service is stopped
* Improve retry logic
* When tx service is down, there are lots of exception logs
* When router has an issue, the UI sends a lot of requests causing lots of exception logs
* Give up after N tries, but keep retrying til it comes back up
* Flowlet: exponential backoff
...
UI
Improve UI logging
* Show meaningful messages if failed to start
...
Improve UI experience when Master is starting up or not running
* Show fist loading screen when Master, Router is starting up
* Do status call for system services /v3/system/services/status
* Show normal UI when all critical system services are OK
* Show error screen when not all critical system services are OK
* If a non-critical system service is not OK, disable parts of the UI that use the service (e.g. Explore)
User Stories
...
- User should be able to easily navigate to the installation docs for Ambari, Cloudera manager, Installing via package managers, MapR
- User should be able to perform necessary preparation step for the installation using command line or UI
- User should be able to perform step by step installation using command line or UI
Users should have section for installing on HA clusters
HA secure clusters
HA YARN
HA HDFS
HA CDAP
- User should be able to install non standard components (nodejs) using instructions in docs
- Users should be able to easily determine the configurations needed for CDAP installation by reading the docs
- Users should have section for installing on secure clusters
- CDAP master services should not start up if required pre-requisites are not met
- Pre-requisites: YARN containers required
- HDFS permissions
- CDAP system services running on Twill
- Kafka services [Low Priority]
- CDAP master service should log appropriate error messages if it fails to start and prescribe corrective actions where possible
- CDAP master should start successfully only if all the corresponding system service can start fine
- CDAP master service should serve requests only if app fabric and corresponding system services start up fine
- Error logs in master logs should be meaningful to the end user and the end user should understand the behavior of the service easily
- Master log should contain only app fabric logs and system services logs and not application logs
- User should be able to determine the complete class path of the services (master, router, auth) by looking at the logs
- Users should be able to determine what components are enabled and disabled by looking at the logs
- Users should be able to determine the versions of underlying infra components by looking at the master logs
- Users should be able to query for versions of the underlying infra components using HTTP REST API calls (Low)
- CDAP router should not startup if it cannot bind to configured port
- CDAP router should log appropriate error in the logs if it fails to start
- CDAP Auth should not startup if it cannot bind to configured port
- CDAP auth service should log appropriate error in the logs if it fails to start
- Services that are bound to 0.0.0.0 should log all the interfaces that are bound (Low)
- CDAP UI should not serve services are starting up content up if the downstream services are not up and not make any other calls to the backend
- CDAP UI should log meaningful messages on failure to start
- CDAP UI should should not start up if the version of nodejs installed is not compatible
- Users should be able to optionally disable the checks performed during startup
- Service (/etc/init.d/<service> status should work even if service is stopped
Design
...
Additional Notes
...
Brainstorming notes (Andreas/Nitin/Albert/Sree):
Issue | Priority | Category |
---|---|---|
Master start-up should record versions (in Log and also store that info - make it queryable) of underlying system - HDFS, YARN, HBASE, Hive, Zookeeper, Kafka and any other system | H | Startup |
Master on every startup should perform integrity test - Check if there are appropriate YARN resources (CPU/Memory) available, Kafka is up, Log and Metric services are up, Transaction is up, HDFS directory permissions, checks for known configurations | H | Startup |
UI should refuse to start if router and every underlying services are not up or it should keep trying with UI showing that system is coming up till everything is not up | H | Startup |
Version compatibility checks - before startup - check for right version of Hive/Hbase and throw meaningful errors | H | Startup |
Log the necessary info in during startup - Classpath (no stars), what is enabled/disabled, versions of components, auth enabled/disabled | H | Startup |
Right error messages should be printed in the logs on failure | H | Startup |
Master service (App fabric) should register and serve requests only after the corresponding twill application is up and running | H | Startup |
Master logs should not be polluted with un-necessary exceptions during startup | H | Startup |
The startup check should be optionally disabled | H | Startup |
If any check fails, master should shutdown, but all errors should be logged | H | Startup |
...