Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

This is the design doc for Spark2 and interactive Spark program.

Goals

We introduced complete Spark integration in CDAP 3.4 that is compatible with Spark 1.2.x - 1.6.x. Since then, Spark has evolved a lot with new APIs (DataFrame / DataSet / SQL) and new Spark2 runtime. Moreover, the current CDAP + Spark integration requires Spark developers to first create a CDAP application, add the Spark program to the application, package and deploy the application before being able to run a single line of Spark code in CDAP. This is alien to most Spark developers since they are expecting a fast / interactive way to work in Spark, which they can just type couple lines of Spark code, run it and get the result right away.

Here are the goals:

Have CDAP compatible with Spark 1.x and 2.x
- Both API and runtime
Provide fast and interactive experience for Spark developer
- Pipeline plugin developer
- Spark and Spark in workflow.

User Stories

Spark2 User Stories

As a cluster administrator, I want CDAP to automatically use whatever Spark version is installed on my cluster
As an application developer, I want to be able to upgrade my spark program from spark1 to spark2 by only changing pom dependencies
As an application developer, I want to be able to control which spark version is used by CDAP standalone through command line args or configuration file edits
As an application developer, I want a clear error message if I try to deploy or run an application compiled with the wrong spark version.
As an application developer, I want to use the newer Spark 2 APIs in my applications
As an application developer, I want to be able to control which Spark version is used in my unit tests
As a pipeline developer, I want to build pipelines without caring which version of Spark is on the cluster
As a pipeline developer, I want to be able to export a pipeline built on my SDK and import it onto the cluster without caring about Spark versions

Interactive Spark User Stories

As a pipeline developer, I want to add custom spark code that transforms data in realtime data pipeline (CDAP 4.2)
As a pipeline developer, I want to add custom spark code that transforms data in batch data pipeline (CDAP 4.2)
As a pipeline developer, I want to add custom spark code that runs in a separate workflow node (CDAP Release TBD)
As a spark developer, I want to have a notebook experience for writing Spark program in interactive way, similar to spark-shell (CDAP Release TBD)

Design

Spark2

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Spark2 is built using Scala 2.11 by default. From the Spark documentation, as of Spark 2.1.0, Scala 2.10 is deprecated, and may be removed in the next minor version of Spark. Spark 1 is built using Scala 2.10 by default. Since Scala is not binary compatible, CDAP components that use Spark must have a Spark1 version and a Spark2 version, with the correct version used at runtime.

Assumptions

Users will not use a mixed setup, with some programs running Spark1 and some running Spark2
Users will not use a mixed setup, with some programs compiled against Scala 2.11 and some against 2.10
Anyone using Spark2 will be using Scala 2.11
Anyone using Spark1 will be using Scala 2.10

API Design

To support Spark2, application developers can use one of two separate Spark API modules, depending on the Spark version they will run with:

cdap-api-spark - for Spark1, Scala 2.10. Same as in previous CDAP releases
cdap-api-spark2_2.11 - for Spark2, Scala 2.11. New in CDAP 4.2.0

The API module contains 2 Java classes and 2 Scala classes:

JavaSparkExecutionContext
JavaSparkMain
SparkExecutionContext
SparkMain

These classes will have the same package and same name in both api modules. To begin with, they will look exactly the same. This is to allow application developers to upgrade by simply changing pom dependencies. In the future, we will add Spark2 specific APIs to the classes in cdap-api-spark2_2.11, and deprecate the APIs that use Spark1 classes.

Runtime Design

In distributed mode, we can take a similar approach to how HBase versions are handled. CDAP packages will come with two separate Spark extensions. When starting CDAP, we can detect the Spark version and set a SPARK_VERSION environment variable. Based on this value, the correct extension will be used to launch Spark programs.

In standalone mode, we can add a configuration setting in cdap-site.xml that will control the spark version:

  <property>
    <name>spark.compat.version</name>
    <value>2</value>
    <description>Which Spark compact version to use</description>
  </property>

Another option would be to add an option to the 'cdap sdk' command to select the Spark version, but that approach would make it easy for the user to sometimes start CDAP with Spark1 and sometimes with Spark2, which would result in confusing behavior.

Unit Test Design

Because cdap-unit-test has a dependency on cdap-spark-core in order to launch Spark programs, we will need to have separate test modules as well. We could do this by introducing a new cdap-unit-test-spark2_2.11 module that also has a TestBase of the same package and name. This would allow people to move their unit tests from Spark1 to Spark2 by just updating their pom.

Alternatively, we could see if we can avoid introducing a mirrored cdap-unit-test module and see if it is possible to simply include the right cdap-spark-core module as a dependency.

Pipeline Design

The pipeline applications will also require separate jars for each spark version. Since we can't deploy both spark1 and spark2 versions of the artifact, we will need to change the way we package the artifacts

today:
/opt/cdap/master/artifacts/cdap-data-pipeline-4.1.0.jar
 
new:
/opt/cdap/master/artifacts/spark1_2.10/cdap-data-pipeline-4.2.0.jar
/opt/cdap/master/artifacts/spark1_2.10/spark-plugins-4.2.0.jar
/opt/cdap/master/artifacts/spark1_2.10/spark-plugins-4.2.0.json
/opt/cdap/master/artifacts/spark2_2.11/cdap-data-pipeline-4.2.0-spark2_2.11.jar
/opt/cdap/master/artifacts/spark2_2.11/spark-plugins-4.2.0-spark2_2.11.jar
/opt/cdap/master/artifacts/spark2_2.11/spark-plugins-4.2.0-spark2_2.11.json

The artifact directory config setting will be changed to use multiple directories, substituting in the spark compat version:

<property>
  <name>app.artifact.dir</name>
  <value>/opt/cdap/master/artifacts;/opt/cdap/master/artifacts/${spark.compat.version}</value>
  <description>
    Semicolon-separated list of local directories scanned for system artifacts to add to the artifact repository
  </description>
</property>

However, this makes things more complicated in the pipeline config and in the UI, if we want to support exporting from an sdk that uses spark1 to a cluster that users spark2. If we want to support that use case, we can use the new version range capability to always specify a range instead of exact version:

{
  "artifact": {
    "scope": "system",
    "name": "cdap-data-pipeline",
    "version": "[4.2.0, 4.2.1)"
  },
  ...
}

Market Design

Every plugin in the market that uses spark will need to include a version for spark1 and a version for spark2.

Every pipeline in the market will need to use a version range instead of an exact version in order to allow for different spark runtime environments.

Interactive Spark

Approach

Spark2 Approach #1

API

We will introduce two new spark api modules:

cdap-api-spark-base
cdap-api-spark2_2.11

cdap-api-spark-base will contain a base class for JavaSparkExecutionContext and SparkExecutionContext. In order to ensure that the base classes are compiled with the right scala version, we will use the build-helper-maven-plugin to add cdap-api-spark-base as a source in both cdap-api-spark and cdap-api-spark2_2.11.

Though this works in maven, it messes up IntelliJ, as IntelliJ doesn't expect two modules to have the same source and will complain about compilation errors. To make sure IntelliJ is happy, we will add a new 'cdap-dev' profile that will be used in cdap-api-spark and cdap-api-spark2_2.11 to add cdap-api-spark-base as a compile dependency.

Runtime

A similar change will be required for cdap-spark-core, with a module for base, spark1, and spark2:

cdap-spark-core-base
cdap-spark-core
cdap-spark-core2_2.11

It will use the same build-helper-maven-plugin to ensure that base classes are compiled with the right scala version.

A change will be made to the ProgramRuntimeProvider to support an annotation property for the Spark version. The SparkProgramRuntimeProvider for spark1 will be annotated slightly differently than that for spark2:

@ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK1)
public class SparkProgramRuntimeProvider implements ProgramRuntimeProvider {
  ...
}
 
@ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK2)
public class Spark2ProgramRuntimeProvider implements ProgramRuntimeProvider {
  ...
}

ProgramRuntimeProviderLoader will then be changed to examine the sparkcompat annotation value and compare it to runtime spark version to only load the correct runtime provider for spark. The runtime spark version will be determined by first checking the 'spark.compat.version' from cdap-site.xml (for CDAP standalone and unit test, assuming unit tests go through this code path). If not defined, it will check system property 'cdap.spark.compat.version' (for CDAP distributed) for the spark version.

These is also a small change required when determining which jars to localize in CDAP distributed mode. In spark1 there is just a single spark-assembly jar, but in Spark2 there are multiple jars.

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

CDAP

Spark2 and Interactive Spark