Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. User is responsible for managing the lifecycle of custom hdfs directory/hive database/hbase namespace. CDAP will not create or delete any custom namespaces. These custom namespaces in the underlying storage provider must be empty
    We are making this decision to have consistency in the behavior. Also, allowing users to use namespace which already has some data in it can lead to various issue. For example, for every namespace in hbase we  create a queue table. What if a table with similar name exists ? Furthermore, we don't see a use case where user will want CDAP to handle the lifecycle  of custom namespaces or have external data in it.
    // NoteToSelf: Sync with Ali Anwar for impersonation to figure out if we need to pre-check for underlying namespace to be existing and  empty.

  2. Users can provide custom namespace for one or more or all storage providers. CDAP will be responsible for managing the lifecycle of all the storage provider namespace for which user did not provide any custom value.
    This is done to allow users to have flexibility to use custom namespaces for only the needed storage and let CDAP handle others. 

  3. Namespace custom mapping is final and immutable. It can only be provided during the creation of the namespace and  cannot be changed afterwards.
    This is done to keep the design simple for now. Supporting mutable mapping needs answering a lot of other issues like, what to do with existing data ? Will we need a migration tool ? How to migrate hbase, hive, hdfs data for cdap etc. 

  4. An underlying storage namespace can be mapped only to one cdap namespace. Users will not be allowed to create two cdap namespaces which uses same underlying storage namespace or its part (subdir). During namespace creation we will explicitly check that no other cdap namespace is using the custom storage namespace. We will also check that the directory is not a subdir of the other directory used in some other namespace. 
    We are making this design decision because sharing of underlying namespace will lead to a lot of weird consequences since programs will be sharing datasets, metadata etc. For example deleting a dataset from a namespace will delete it from another one too. 

...

Out-of-scope User Stories (4.0 and beyond)

  1. Support of accessing entities other than dataset/stream in different namespace. For example, a cdap user in namespace ns1, I should be able to create an application app1 using an artifact artifact2 which is present in namespace ns2. 
  2. Cross namespace access in explore queries with cdap namespace. Currently, users can do cross namespace access by providing the underlying hive database name.
  3. Admin interface for Dataset should be able to perform crossname namespace access. 

References


Appendix A: API changes


Code Block
languagejava
titleChanges for dataset
// Dataset Context:
<T extends Dataset> T getDataset(String namespace, String name) 
<T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments) 



// Add APIs to programs to support accessing dataset from a different namespace: 

// MapReduce: 
context.addInput(Input.ofStream("stream").fromNamespace("ns")); 

// Spark: 
public <K, V> JavaPairRDD<K, V> fromDataset(String namespace, String datasetName) 

public <K, V> JavaPairRDD<K, V> fromDataset(String namespace, String datasetName, Map<String, String> arguments) 

public abstract <K, V> JavaPairRDD<K, V> fromDataset(String namespace, String datasetName, Map<String, String> arguments, @Nullable Iterable<? extends Split> splits); 	
Code Block
languagejava
titleChanges for stream
// Add APIs for different programs to support accessing stream from another namespace: 

// MapReduce: 
context.addInput(Input.ofStream("stream").fromNamespace("ns")); 


// Flowlet: 
void connectStream(String stream, Flowlet flowlet) 
void connectStream(String stream, String flowlet) 


// Spark: 
JavaRDD<StreamEvent> fromStream(String namespace, String streamName, long startTime, long endTime); 

JavaPairRDD<Long, V> fromStream(String namespace, String streamName, Class<V> valueType) 

JavaPairRDD<Long, V> fromStream(String namespace, String streamName, long startTime, long endTime,Class<V> valueType); 

JavaPairRDD<K, V> fromStream(String namespace, String streamName, long startTime, long endTime,Class<? extends StreamEventDecoder<K, V>> decoderClass,Class<K> keyType, Class<V> valueType); 

JavaPairRDD<Long, GenericStreamEventData<T>> fromStream(String namespace, String streamName,FormatSpecification formatSpec,long startTime, long endTime,Class<T> dataType);