Metadata 5.0 Upgrade
Goal
To migrate CDAP's existing metadata and keep it accessible in 5.0 and beyond.
Requirements
Functional Requirements:
Metadata upgrade must happen while CDAP is up and functional and should not require CDAP to be stopped.
Metadata upgrade should not generate any audit record and all original audit record must be valid and accessible during and after the upgrade.
If a timestamp is associated with any of the metadata change (add/delete/update) it must not be affected by the upgrade.
CDAP metadata service must not be inaccessible but limited functionality and performance degradation during upgrade is acceptable. The follow metadata functionality must be supported during upgrade.
Adding new metadata: User should be able to deploy app, create dataset etc. while the metadata upgrade is running. This requires that cdap should be able to add metadata to “new entities/resources” during the upgrade.
Removing existing metadata: User should be able to delete an app/pipeline etc while the metadata upgrade is running. This requires that cdap metadata must be available for removing metadata for existing entities/resources.
Non-Functional Requirements:
Metadata upgrade should be done in background and should require minimal to none human monitoring or intervention
Metadata upgrade should be done in batches and/or appropriate transaction duration to avoid all possibilities of transaction timeouts during upgrade
Metadata upgrade should be resistant to CDAP restart and should not require the upgrade process to be redone from the start.
Metadata upgrade should be resistant to CDAP crashes and failures and should not require the upgrade process to be redone from the start.
Design
Daemon Thread: MetadataService will start a daemon which will be responsible for reading all the existing metadata from old table and rewriting it to the new table.
Separation of Data: We will use two separate metadata table to have a clear separation between the migrated and old data. (Note: If authorization is configured this will require proper privilege is granted to the new metadata table.)
Batch Processing: We will do the upgrade in batches and the batch size will be the limit of row keys in a scan and not a particular targeted key for example a particular namespace as it is very probable that all the entities might belong to particular namespace or application leading to skew in keys and very few but really large batch sizes.
Restarts and Failures Recovery: One the key requirement for the upgrade is to handle failover and restarts of CDAP. For this we will use simple checkpointing mechanism in which we will periodically store the rows/keys which we have migrated so that we can start over from last checkpoint rather than from very beginning.
Scalability and Future-proofing: CDAP Metrics in past has suffered from HBase hot-spotting leading to degraded performance and region server failures. This was because metrics are periodically and automatically emitted from all running programs which usually runs in a single or very few namespaces. This was resolved by introducing salting of HBase keys which is a standard practice when encountered with sequential row keys. With the introduction of feature to emit metadata from program and pipeline we suspect metadata is also destined for such issues and failures depending on the way a customer uses it. One of the major difference between metric and metadata is that metric is automatically emitted periodically at very short interval of time whereas metadata will be emitted by user code. If a user decide to emit metadata at very short interval of time we might end up with similar issue. Although, we believe such usage is not realistic and we don’t need to optimize for this scenario. Salting the key might cause performance degradation during scan and we must do some analysis before we decide to take this route. But while we are writing a metadata upgrade tool we will keep this in mind and implement it in such a way that if needed later we can salt the metadata table with minimum or no code change.
Behavior During Upgrade
Existing MetadataEntity/Resource | |
Delete all metadata | Pass |
Update Metadata | Fail (Retry requested) |
Search | Partial Result (Does not include entity/resource which are not yet migrated) |
Get | Fail (Retry requested) |
New/Migrated Metadata Entity/Resource | |
Add | Pass |
Delete | Pass |
Update | Pass |
Get | Pass |
Search | Partial Result (Only includes entity/resource which are migrated or newly created) |
We realize that the above is limiting to the user to some extent but we don’t expect update of metadata as a crucial call which cannot be be retried after sometime. Since, we expect metadata table to be fairly limited in size we don’t expect that the upgrade will take significant time where not being able to update metadata will become a severe issue. Also, there are some calls which internally require metadata update (for example redeploying application, schedule update etc) for such calls we will need to ensure we return an appropriate retry message to the user and the message is propagated all the way up in the UI.
Search and Get for metadata entity which are not migrated will not work until they are migrated. Again since these calls are not crucial to major CDAP functionality and given upgrade will not be very long process we accept the unavailability of these function for particular entities (yet to be migrated) acceptable.