Page Comparison

Public interface to emit the preview data:

Code Block

language	java

public interface PreviewEmitter {
 
	/**
 	 * Emit the property specified by name and value for the given key. 
	 * values will be grouped in a list for values emitted with same key and propertyName.
 	 * @param key the key under which properties are stored
     * @param propertyName the the name of the property
	 * @param propertyValue the value associated with the property
 	 */
	 void emit(String key, String propertyName, Object propertyValue);
	
}

2. Preview Context API

Code Block

public interface PreviewContextDebugger() {
	/**
	 * boolean flag to indicate if preview is enabledmode or not.
	 */
	boolean isPreviewEnabled();
 
	/**
	 * get PreviewEmitter, PreviewEmitter can be used to emit objects collected by key and field names.
	 */
    PreviewEmitter getPreviewEmitter(String emitterName);
}

3. How the application will get access to the PreviewEmitter

Code Block

language	java

public static class ETLMapper extends Mapper implements ProgramLifecycle<MapReduceTaskContext<Object, Object>> {
  private TransformRunner<Object, Object> transformRunner;  

  @Override
  public void initialize(MapReduceTaskContext<Object, Object> context) throws Exception {
    // get source, transform, sink ids from program properties
    Map<String, String> properties = context.getSpecification().getProperties();
    if (Boolean.valueOf(properties.get(Constants.STAGE_LOGGING_ENABLED))) {
      LogStageInjector.start();
    }
    transformRunner = new TransformRunner<>(context, mapperMetrics);

  }

  @Override
  public void map(Object key, Object value, Mapper.Context context) throws IOException, InterruptedException {
      transformRunner.transform(key, value);
  }
...
}
 
TrackedTransform.java
 
/**
 * A {@link Transformation} that delegates transform operations while emitting metrics
 * around how many records were input into the transform and output by it.
 *
 * @param <IN> Type of input object
 * @param <OUT> Type of output object
 */
public class TrackedTransform<IN, OUT> implements Transformation<IN, OUT>, Destroyable {


	private final PreviewContext previewContext;
	private final String stageName;
 
	public TrackedTransform(Transformation<IN, OUT> transform, StageMetrics metrics, PreviewContext previewContext, String stageName,
    	                    @Nullable String metricInName, @Nullable String metricOutName) {
		...
  		this.previewContext = previewContext;
        this.stageName = stageName;
		...
	}

	@Override
	public void transform(IN input, Emitter<OUT> emitter) throws Exception {
  		if (metricInName != null) {
    		metrics.count(metricInName, 1);
  		}
  		if (previewContext.isPreviewEnabled()) {
			// emitting input data to preview
			previewContext.getEmitter().emit(stageName, "inputData", input); 
  		}
  		transform.transform(input, new TrackedEmitter<>(emitter, metrics, metricOutName, stageName, previewContext));
	}
}


 
... TrackedEmitter.java


@Override
public void emit(T value) {
  delegate.emit(value);
  stageMetrics.count(emitMetricName, 1);
  if (previewContext.isPreviewEnabled()) {
	  //emitting output data for preview
	  previewContext.getPreviewEmitter().emit(stageName, "outputData", value);
  }
}

@Override
public void emitError(InvalidEntry<T> value) {
  delegate.emitError(value);
  stageMetrics.count("records.error", 1);
  if (previewContext.isPreviewEnabled()) {

	  // emitting error data for preview
  	  previewContext.getPreviewEmitter().emit(stageName, "errorData", value);
  }
}

PreviewContext implementation will use previewId to create a preview emitter which can be obtained using getPreviewEmitter by programs. Programs can use isPreviewEnabled to check if preview is enabled before emitting.

4 How will CDAP get data from the preview?

Code Block

/**
 * Represents the state of the preview.
 */
public class PreviewStatus {
  public enum Status {
	RUNNING,
	COMPLETED,
	DEPLOY_FAILED,
	RUNTIME_FAILED	
  };
 
  Status previewStatus;
  @Nullable	
  String failureMessage;			
}
 
// This is internal interface which will be used by REST handlers
// to retrieve the preview information.
public interface PreviewManager {
 
    /**
	 * Get the status of the preview represented by previewId.
     */
	PreviewStatus getStatus(PreviewId previewId);
 
	/**
	 * Get the data associated with the preview represented by previewId.
	 */
    Map<String, Map<String, List<Object>> getData(PreviewId previewId);
 
	/**
	 * Get all metrics associated with the preview represented by previewId.
	 */
	Collection<MetricTimeSeries> getMetrics(PreviewId previewId);
  
 	/**
	 * Get all logs associated with the preview represented by previewId.
	 */
	List<LogEntry> getLogs(PreviewId previewId);
}
 
class PreviewId extends EntityId implements NamespaceId, ParentId<NamespaceId> {
	NamespaceId namespace;
    String preview;
}

...

No

Service	Standalone (Yes/No)	Preview (Yes/No)	Description
userInterfaceService	Yes	No	We don't want to run UI separately.
trackerAppCreationService	Yes	No	router	Yes	Tracker app is for exploring meta data, this should be on real data (standalone) and not preview data.
router	Yes	No	we don't want to run another router, existing router should be able to discover and router to preview service.
streamService	Yes	No
exploreExecutorService	Yes	No	No requirement to explore data in preview
exploreClient	Yes	No	No requirement to explore data in preview
metadataService	Yes	No (	Metadata service just starts a service with Metadata and Lineage handler. which is used by user to add user-level meta data. CDAP System uses Metadata Store to emit system level metadata. since we use remote dataset framework for datasets in user namespace, they should have metadata by default)system level dataset, we need to check if that will be enough to emit metadata in system dataset or do we need to share meta data store.
serviceStore (set/get service instances)	Yes	No
appFabricServer	Yes	No
previewServer	No	Yes
datasetService	Yes	Yes
metricsQueryService	Yes	No (Can call MetricStore query)
txService	Yes	No (can use standalone's tx service)
externalAuthenticationServer (if security enabled)	Yes	No
logAppenderInitializer	Yes	Yes
kafkaClient(if audit enabled)	Yes	No
zkClient (if audit enabled)	Yes	No
authorizerInstantiator (started by defaultPreview service runs as a single instance and works on small input set, doesn't need many instances, so we wouldn't need a serviceStore to increase/decrease preview instances.
appFabricServer	Yes	No	AppFabric has many services which we wouldn't need, PreviewServer can include just the required services.
previewServer	No	Yes	New addition
datasetService	Yes	Yes	We have a new shared dataset framework, need dataset service to handle dataset requests.
metricsQueryService	Yes	No	Can user MetricStore to query directly, as our requirement for metrics is straightforward. we will return all metrics emitted by a preview-id
txService	Yes	No
externalAuthenticationServer (if security enabled)	Yes	No

AppFabricServer vs PreviewServer :

This is a subset of services started in app-fabric server.

notificationServiceschedulerServiceYesprogramLifecycleService

Services	AppFabricServer	PreviewServer
logAppenderInitializer	Yes	Yes
kafkaClient(if audit enabled)	Yes	No
zkClient (if audit enabled)	Yes	No	applicationLifecycleService
authorizerInstantiator (started by default)	Yes	systemArtifactLoaderNo	Yes	Yes
programRuntimeService	Yes	Yes
streamCoordinatorClient	Yes	Yes

AppFabricServer vs PreviewServer :

This is a subset of services started in app-fabric server.

Services	AppFabricServer	PreviewServer
notificationService	Yes	No
schedulerService	Yes	No
applicationLifecycleService	Yes	Yes
pluginServicesystemArtifactLoader	Yes	Yes
handlerHttpServiceprogramRuntimeService	Yes	Yes (but only with preview handler). CDAP Router should route calls for preview here.

PreviewDatasetFramework

Requirements:

1) Pipeline want's to read from a dataset source (or) pipeline wants to write to a dataset sink (or) transform uses a lookup table. These datasets are in CDAP Standalone space.

2) Pipeline run's records, Pipeline run metrics, program status, etc are stored in System datasets in Preview space.

...


streamCoordinatorClient	Yes	Yes
programLifecycleService	Yes	Yes
pluginService	Yes	No (PluginService is needed only during config and not during preview)
handlerHttpService	Yes	Yes (but only with preview handler). CDAP Router should route calls for preview here.
metricsCollectionService	Yes	Yes
defaultNamespaceEnsurer	Yes	No

PreviewDatasetFramework

Requirements:

1) Pipeline want's to read from a dataset source (or) pipeline wants to write to a dataset sink (or) transform uses a lookup table. These datasets are in CDAP Standalone space.

2) Pipeline run's records, Pipeline run metrics, program status, etc are stored in System datasets in Preview space.

3) Error dataset : Its not clear if using error dataset should cause creating an error dataset in CDAP standalone space. I feel it might not be required to created in Standalone space. In which case if its a dataset then it's the only user level dataset that has to be created in Preview space, we can say we would have an in-memory implementation for maintaining error records.

...

When CDAP standalone is started, it will start PreviewService (which has PreviewHttpHandler) along with other required services. When CDAP shuts down, PreviewService will be terminated.
No-op implementation of the PreviewContext will be injected into SDK and BasicPreviewContext will be injected in preview.
DatasetFramework and DiscoveryService from SDK will be used by Preview.
1. DiscoveryService will be used for registering preview service so that it can be discovered by router.
2. DatasetFramework will be used for accessing the datasets in the user namespace.
User will give the preview request using preview REST endpoint.
We will have rule in the cdap-router which will forward the preview requests to the PreviewHttpHandler.
PreviewHttpHandler will receive request with preview configurations and generate unique preview id for it which will be used as app id.

When the app gets configured during LocalArtifactLoaderStage, application will replace the config object with the object updated for preview

Code Block

public class DataPipelineApp extends AbstractApplication<ETLBatchConfig> {
	public void configure() {
  		ETLBatchConfig config = getConfig();
		if (config.isPreviewMode()) {
			// This method should be responsible to create new pipeline configuration for example: replacing source with mock source
			config = config.getConfigForPreview();
		}
 
		PipelineSpecGenerator<ETLBatchConfig, BatchPipelineSpec> specGenerator = new BatchPipelineSpecGenerator(...);
		BatchPipelineSpec spec = specGenerator.generateSpec(config);
		PipelinePlanner planner = new PipelinePlanner(...);
		PipelinePlan plan = planner.plan(spec);

		addWorkflow(new SmartWorkflow(spec, plan, getConfigurer(), config.getEngine()));
		scheduleWorkflow(...);
	}   
}

There will be inconsistency between application JSON configurations and programs created for the applications. Since we create programs once the app configurations are updated with the preview configs. - OPEN QUESTION

Preview application deployment pipeline

Stage Name	Regular Application	Preview Application
LocalArtifactLoaderStage	Yes	Yes
ApplicationVerificationStage	Yes	Yes
DeployDatasetModulesStage	Yes	No
CreateDatasetInstanceStage	Yes	No
CreateStreamsStage	Yes	No
DeleteProgramHandlerStage	Yes	No
ProgramGenerationStage	Yes	Yes
ApplicationRegistrationStage	Yes	Yes
CreateSchedulesStage	Yes	No
SystemMetadataWriterStage	Yes	No

If there is a failure in the deploy pipeline, PreviewHttpHandler will return 500 status code with deploy failure reason.
Once deployment is successful, preview handler will start the program and return preview id as response. Currently we will start the SmartWorkflow in DataPipelineApp however preview configurations can be extended to accept the program type and program name to start.
During runtime when program emits preview data using PreviewContext, the implementation of it (BasicPreviewContext) will write that data to PreviewStore.
PreviewStore can store data in memory. It cannot be Table dataset because we want the intermediate data even if the transaction failed. Also it cannot be Fileset dataset, because if MapReduce program fails then it cleans up the files. (Potentially we can use non-transactional table like Metrics).
Logs TBD
Metrics for preview will be stored in the Metric dataset created for preview.
Deletion of the preview data: We can maintain the LRU cache of the preview data for different preview ids. In 3.5 we can restrict the LRU cache size to be 1.
Get preview data: PreviewManager will be used by PreviewHttpHandler to query for preview data from preview store, logs and metrics.
PreviewStore: PreviewStore will be responsible for storing the preview data. Implementation of PreviewStore will store the data in memory for 3.5. In future we can think of storing it in Level db dataset.

Implementation Plan:

1) Preview Service (Shankar)

2) Preview REST API (Sagar)

3) Storage (Sagar)

4) Non-static LevelDBService - Possible Test Case Failures (Done!!!)

5) Java API's (Sagar)

6) Hydrator changes: App with MapReduce and Spark (Shankar)

---

7) Logs and Metrics

8) ETL Config update for preview

9) Mock source plugin

Versions Compared

Old Version 22

New Version Current

Key

PreviewDatasetFramework

PreviewDatasetFramework