...
- User is responsible for managing the lifecycle of custom hdfs directory/hive database/hbase namespace. CDAP will not create or delete any custom namespaces. These custom namespaces in the underlying storage provider must be empty
We are making this decision to have consistency in the behavior. Also, allowing users to use namespace which already has some data in it can lead to various issue. For example, for every namespace in hbase we create a queue table. What if a table with similar name exists ? Furthermore, we don't see a use case where user will want CDAP to handle the lifecycle of custom namespaces or have external data in it.
// NoteToSelf: Sync with Ali Anwar for impersonation to figure out if we need to pre-check for underlying namespace to be existing and empty. - Users can provide custom namespace for one or more or all storage providers. CDAP will be responsible for managing the lifecycle of all the storage provider namespace for which user did not provide any custom value.
This is done to allow users to have flexibility to use custom namespaces for only the needed storage and let CDAP handle others. - Namespace custom mapping is final and immutable. It can only be provided during the creation of the namespace and cannot be changed afterwards.
This is done to keep the design simple for now. Supporting mutable mapping needs answering a lot of other issues like, what to do with existing data ? Will we need a migration tool ? How to migrate hbase, hive, hdfs data for cdap etc. - An underlying storage namespace can be mapped only to one cdap namespace. Users will not be allowed to create two cdap namespaces which uses same underlying storage namespace or its part (subdir). During namespace creation we will explicitly check that no other cdap namespace is using the custom storage namespace. We will also check that the directory is not a subdir of the other directory used in some other namespace.
We are making this design decision because sharing of underlying namespace will lead to a lot of weird consequences since programs will be sharing datasets, metadata etc. For example deleting a dataset from a namespace will delete it from another one too.
...