JAVA API :
interface PreviewContext { // all records from the stage identified by identifier <T> Iterator<T> getOutputRecords(); // all metrics from the stage identified by identifier Collection<MetricTimeSeries> getMetrics(); // all logs from the stage identified by identifier String getLogs(); }
interface PreviewEmitter { void emit(T record); }
class DefaultPreview implements PreviewContext, PreviewEmitter { // implementation for methods, emit, getEmittedRecords, getLogs and getMetrics. }
class PreviewHelper { Map<PreviewId, DefaultPreview> previewIdToDefaultPreview; PreviewEmitter getPreviewEmitter(PreviewId identifier) { // create DefaultPreview for identifier in previewIdToDefaultPreview and return if doesn't exist already, else get and return. } PreviewContext getPreviewContext(PreviewId identifier) { // create DefaultPreview for identifier in previewIdToDefaultPreview and return if doesn't exist already, else get and return. } }
Implementation:
SDK:
Preview Execution Isolation:
Requirement:
- We want the program runs we execute, datasets created during preview for preview purpose, logs and metrics emitted during preview to be isolated from the regular Standalone execution which is used to publish and run the pipeline.
- In Preview, pipeline could have lookup datasets in a transform which reads from the datasets in Standalone. so we want a way to share datasets in preview with datasets in standalone.
- In Preview, we want to skip writing meta data and lineage information as they are unnecessary.
Preview Injector vs Standalone Injector:
Service | Standalone (Yes/No) | Preview (Yes/No) |
---|---|---|
userInterfaceService | Yes | No |
trackerAppCreationService | Yes | No |
router | Yes | No |
streamService | Yes | Yes |
exploreExecutorService | Yes | No |
exploreClient | Yes | No |
metadataService | Yes | No |
serviceStore (set/get service instances) | Yes | No |
appFabricServer | Yes | No |
previewServer | No | Yes |
datasetService | Yes | Yes |
metricsQueryService | Yes | No (Can call MetricStore query) |
txService | Yes | Yes |
externalAuthenticationServer (if security enabled) | Yes | Yes |
logAppenderInitializer | Yes | Yes |
kafkaClient(if audit enabled) | Yes | No |
zkClient (if audit enabled) | Yes | No |
authorizerInstantiator (started by default) | Yes | Yes? |
AppFabricServer vs PreviewServer :
This is a subset of services started in app-fabric server.
Services | AppFabricServer | PreviewServer |
---|---|---|
notificationService | Yes | No |
schedulerService | Yes | No |
applicationLifecycleService | Yes | Yes |
systemArtifactLoader | Yes | Yes |
programRuntimeService | Yes | Yes |
streamCoordinatorClient | Yes | Yes |
programLifecycleService | Yes | Yes |
pluginService | Yes | Yes |
httpService | Yes | Yes (but only with preview handler). |
PreviewDatasetFramework
Requirements:
1) Pipeline want's to read from a dataset source (or) pipeline wants to write to a dataset sink (or) transform uses a lookup table. These datasets are in CDAP Standalone space.
2) Pipeline run's records, Pipeline run metrics, program status, etc are stored in System datasets in Preview space.
3) Error dataset : Its not clear if using error dataset should cause creating an error dataset in CDAP standalone space. I feel it might not be required to created in Standalone space. In which case if its a dataset then it's the only user level dataset that has to be created in Preview space, we can say we would have an in-memory implementation for maintaining error records.
Assumptions :
1) All Datasets in System Namespace will be using the "LocalDatasetFramework"
2) All Datasets in User's Namespaces will be using the "RemoteDatasetFramework"
... snippet @Nullable @Override public <T extends Dataset> T getDataset(Id.DatasetInstance datasetInstanceId, @Nullable Map<String, String> arguments, @Nullable ClassLoader classLoader) throws DatasetManagementException, IOException { if (datasetInstanceId.getNamespace().equals(Id.Namespace.SYSTEM)) { return localDatasetFramework.getDataset(datasetInstanceId, arguments, classLoader); } else { return remoteDatasetFramework.getDataset(datasetInstanceId, arguments, classLoader); } }
Adapting to Cluster
Having a LocalDatasetFramework for system namespace would make it useful for adapting to cluster, where the container's local directory will be used to store the system datasets and we can use the RemoteDatasetFramework of CDAP master for datasets in other namespace.