Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
Metrics are written to the metrics table with a row key that is composed from all the dimensions in the metrics context and the metric name. The metric context contains the instance id of the container, which, in a MapReduce job, is the task ID of the mapper or reducer. A large MapReduce that runs 1000 mappers and emits 10 different metrics would therefore write to 10000 rows every second. Because the metrics table is not salted or otherwise spread out, all these writes (having the same prefix of namespace, app, program id, run id) go to the same region. This region goes under very heavy load, which can impact performance of other regions served on the same region server.
Goals
The goal is to salt following metrics tables to avoid hot spotting.
- cdap_system:metrics.v2.table.ts.1
- cdap_system:metrics.v2.table.ts.2147483647
- cdap_system:metrics.v2.table.ts.3600
- cdap_system:metrics.v2.table.ts.60
User Stories
- CDAP system should be able to salt metrics tables so that all the writes does not go to same region.
Design
In order to salt HBase metrics tables, we can create a new salted metrics tables:
- cdap_system:metrics.v3.table.ts.1
- cdap_system:metrics.v3.table.ts.2147483647
- cdap_system:metrics.v3.table.ts.3600
- cdap_system:metrics.v3.table.ts.60
Tables that do not require salting:
- cdap_system:metrics.kafka.meta
- cdap_system:metrics.v2.entity
Once the new salted tables are created, all the writes will go to salted tables. However, for reading, we can use CombinedHBaseMetricsTable which encapsulates both unsalted an salted metrics tables. Any read query on tables will be performed on both the tables and the results will be merged. While any write queries are only performed on salted HBase tables.
public class CombinedHBaseMetricsTable implements MetricsTable { private final MetricsTable unsaltedHBaseTable; private final MetricsTable saltedHBaseTable; @Nullable @Override public byte[] get(byte[] row, byte[] column) { return new byte[0]; } @Override public void put(SortedMap<byte[], ? extends SortedMap<byte[], Long>> updates) { // write only to saltedHBaseTable } @Override public void putBytes(SortedMap<byte[], ? extends SortedMap<byte[], byte[]>> updates) { // write only to saltedHBaseTable } @Override public boolean swap(byte[] row, byte[] column, byte[] oldValue, byte[] newValue) { return false; } @Override public void increment(byte[] row, Map<byte[], Long> increments) { } @Override public void increment(NavigableMap<byte[], NavigableMap<byte[], Long>> updates) { // write only to saltedHBaseTable } @Override public long incrementAndGet(byte[] row, byte[] column, long delta) { return 0; } @Override public void delete(byte[] row, byte[][] columns) { // delete records from both the tables } @Override public Scanner scan(@Nullable byte[] start, @Nullable byte[] stop, @Nullable FuzzyRowFilter filter) { // scan and merge records from both the tables return null; } @Override public void close() throws IOException { } }
Deletion of Existing Unsalted Tables
Metrics Table with resolution 1s has retention of 2 hours by default while 1m and 1h resolution tables have retention of 30 days by default. Now, once data in these tables gets expired, a background thread running in metrics processor can delete these tables. However, totals resolution table retains metrics for Integer.MAX_VALUE time.
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors |
|
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3