Spark2 and Interactive Spark
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
IntroductionÂ
This is the design doc for Spark2 and interactive Spark program.
Goals
We introduced complete Spark integration in CDAP 3.4 that is compatible with Spark 1.2.x - 1.6.x. Since then, Spark has evolved a lot with new APIs (DataFrame / DataSet / SQL) and new Spark2 runtime. Moreover, the current CDAP + Spark integration requires Spark developers to first create a CDAP application, add the Spark program to the application, package and deploy the application before being able to run a single line of Spark code in CDAP. This is alien to most Spark developers since they are expecting a fast / interactive way to work in Spark, which they can just type couple lines of Spark code, run it and get the result right away.
Here are the goals:
- Have CDAP compatible with Spark 1.x and 2.x
- Both API and runtime
- Provide fast and interactive experience for Spark developer
- Pipeline plugin developer
- Spark and Spark in workflow.
User StoriesÂ
Spark2 User Stories
- As a cluster administrator, I want CDAP to automatically use whatever Spark version is installed on my cluster
- As an application developer, I want to be able to upgrade my spark program from spark1 to spark2 by only changing pom dependencies
- As an application developer, I want to be able to control which spark version is used by CDAP standalone through command line args or configuration file edits
- As an application developer, I want a clear error message if I try to deploy or run an application compiled with the wrong spark version.
- As an application developer, I want to use the newer Spark 2 APIs in my applications
- As an application developer, I want to be able to control which Spark version is used in my unit tests
- As a pipeline developer, I want to build pipelines without caring which version of Spark is on the cluster
- As a pipeline developer, I want to be able to export a pipeline built on my SDK and import it onto the cluster without caring about Spark versions
Interactive Spark User Stories
- As a pipeline developer, I want to add custom spark code that transforms data in realtime data pipeline (CDAP 4.2)
- As a pipeline developer, I want to add custom spark code that transforms data in batch data pipeline (CDAP 4.2)
- As a pipeline developer, I want to add custom spark code that runs in a separate workflow node (CDAP Release TBD)
- As a spark developer, I want to have a notebook experience for writing Spark program in interactive way, similar to spark-shell (CDAP Release TBD)
Design
Spark2
Spark2 is built using Scala 2.11 by default. From the Spark documentation, as of Spark 2.1.0, Scala 2.10 is deprecated, and may be removed in the next minor version of Spark. Spark 1 is built using Scala 2.10 by default. Since Scala is not binary compatible, CDAP components that use Spark must have a Spark1 version and a Spark2 version, with the correct version used at runtime.
Assumptions
- Users will not use a mixed setup, with some programs running Spark1 and some running Spark2
- Users will not use a mixed setup, with some programs compiled against Scala 2.11 and some against 2.10
- Anyone using Spark2 will be using Scala 2.11
- Anyone using Spark1 will be using Scala 2.10
API Design
To support Spark2, application developers can use one of two separate Spark API modules, depending on the Spark version they will run with:
- cdap-api-spark - for Spark1, Scala 2.10. Same as in previous CDAP releases
- cdap-api-spark2_2.11 - for Spark2, Scala 2.11. New in CDAP 4.2.0
The API module contains 2 Java classes and 2 Scala classes:
- JavaSparkExecutionContext
- JavaSparkMain
- SparkExecutionContext
- SparkMain
These classes will have the same package and same name in both api modules. To begin with, they will look exactly the same. This is to allow application developers to upgrade by simply changing pom dependencies. In the future, we will add Spark2 specific APIs to the classes in cdap-api-spark2_2.11, and deprecate the APIs that use Spark1 classes.Â
Runtime Design
In distributed mode, we can take a similar approach to how HBase versions are handled. CDAP packages will come with two separate Spark extensions. When starting CDAP, we can detect the Spark version and set a SPARK_VERSION environment variable. Based on this value, the correct extension will be used to launch Spark programs.
In standalone mode, we can add a configuration setting in cdap-site.xml that will control the spark version:
<property> <name>spark.compat.version</name> <value>2</value> <description>Which Spark compact version to use</description> </property>
Another option would be to add an option to the 'cdap sdk' command to select the Spark version, but that approach would make it easy for the user to sometimes start CDAP with Spark1 and sometimes with Spark2, which would result in confusing behavior.
Unit Test Design
Because cdap-unit-test has a dependency on cdap-spark-core in order to launch Spark programs, we will need to have separate test modules as well. We could do this by introducing a new cdap-unit-test-spark2_2.11 module that also has a TestBase of the same package and name. This would allow people to move their unit tests from Spark1 to Spark2 by just updating their pom.
Alternatively, we could see if we can avoid introducing a mirrored cdap-unit-test module and see if it is possible to simply include the right cdap-spark-core module as a dependency.
Pipeline Design
The pipeline applications will also require separate jars for each spark version. Since we can't deploy both spark1 and spark2 versions of the artifact, we will need to change the way we package the artifacts
today: /opt/cdap/master/artifacts/cdap-data-pipeline-4.1.0.jar new: /opt/cdap/master/artifacts/spark1_2.10/cdap-data-pipeline-4.2.0.jar /opt/cdap/master/artifacts/spark1_2.10/spark-plugins-4.2.0.jar /opt/cdap/master/artifacts/spark1_2.10/spark-plugins-4.2.0.json /opt/cdap/master/artifacts/spark2_2.11/cdap-data-pipeline-4.2.0-spark2_2.11.jar /opt/cdap/master/artifacts/spark2_2.11/spark-plugins-4.2.0-spark2_2.11.jar /opt/cdap/master/artifacts/spark2_2.11/spark-plugins-4.2.0-spark2_2.11.json
The artifact directory config setting will be changed to use multiple directories, substituting in the spark compat version:
<property> <name>app.artifact.dir</name> <value>/opt/cdap/master/artifacts;/opt/cdap/master/artifacts/${spark.compat.version}</value> <description> Semicolon-separated list of local directories scanned for system artifacts to add to the artifact repository </description> </property>
However, this makes things more complicated in the pipeline config and in the UI, if we want to support exporting from an sdk that uses spark1 to a cluster that users spark2. If we want to support that use case, we can use the new version range capability to always specify a range instead of exact version:
{ "artifact": { "scope": "system", "name": "cdap-data-pipeline", "version": "[4.2.0, 4.2.1)" }, ... }
Â
Market Design
Every plugin in the market that uses spark will need to include a version for spark1 and a version for spark2.
Every pipeline in the market will need to use a version range instead of an exact version in order to allow for different spark runtime environments.
Interactive Spark
Approach
Spark2 Approach #1
API
We will introduce two new spark api modules:
- cdap-api-spark-base
- cdap-api-spark2_2.11
cdap-api-spark-base will contain a base class for JavaSparkExecutionContext and SparkExecutionContext. In order to ensure that the base classes are compiled with the right scala version, we will use the build-helper-maven-plugin to add cdap-api-spark-base as a source in both cdap-api-spark and cdap-api-spark2_2.11.
Though this works in maven, it messes up IntelliJ, as IntelliJ doesn't expect two modules to have the same source and will complain about compilation errors. To make sure IntelliJ is happy, we will add a new 'cdap-dev' profile that will be used in cdap-api-spark and cdap-api-spark2_2.11 to add cdap-api-spark-base as a compile dependency.
Runtime
A similar change will be required for cdap-spark-core, with a module for base, spark1, and spark2:
- cdap-spark-core-base
- cdap-spark-core
- cdap-spark-core2_2.11
It will use the same build-helper-maven-plugin to ensure that base classes are compiled with the right scala version.
A change will be made to the ProgramRuntimeProvider to support an annotation property for the Spark version. The SparkProgramRuntimeProvider for spark1 will be annotated slightly differently than that for spark2:
@ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK1) public class SparkProgramRuntimeProvider implements ProgramRuntimeProvider { ... } Â @ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK2) public class Spark2ProgramRuntimeProvider implements ProgramRuntimeProvider { ... }
ProgramRuntimeProviderLoader will then be changed to examine the sparkcompat annotation value and compare it to runtime spark version to only load the correct runtime provider for spark. The runtime spark version will be determined by first checking the 'spark.compat.version' from cdap-site.xml (for CDAP standalone and unit test, assuming unit tests go through this code path). If not defined, it will check system property 'cdap.spark.compat.version' (for CDAP distributed) for the spark version.
These is also a small change required when determining which jars to localize in CDAP distributed mode. In spark1 there is just a single spark-assembly jar, but in Spark2 there are multiple jars.
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors | Â |
 |  |  |  |  |
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security ImpactÂ
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure OutagesÂ
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ]Â component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
 |  |  |
 |  |  |
 |  |  |
 |  |  |
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3
Â