Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

This is the design doc for Spark2 and interactive Spark program.

Goals

We introduced complete Spark integration in CDAP 3.4 that is compatible with Spark 1.2.x - 1.6.x. Since then, Spark has evolved a lot with new APIs (DataFrame / DataSet / SQL) and new Spark2 runtime. Moreover, the current CDAP + Spark integration requires Spark developers to first create a CDAP application, add the Spark program to the application, package and deploy the application before being able to run a single line of Spark code in CDAP. This is alien to most Spark developers since they are expecting a fast / interactive way to work in Spark, which they can just type couple lines of Spark code, run it and get the result right away.

Here are the goals:

  • Have CDAP compatible with Spark 1.x and 2.x
    • Both API and runtime
  • Provide fast and interactive experience for Spark developer
    • Pipeline plugin developer
    • Spark and Spark in workflow.

User Stories 

Spark2 User Stories

  1. As a cluster administrator, I want CDAP to automatically use whatever Spark version is installed on my cluster
  2. As an application developer, I want to be able to upgrade my spark program from spark1 to spark2 by only changing pom dependencies
  3. As an application developer, I want to be able to control which spark version is used by CDAP standalone through command line args or configuration file edits
  4. As an application developer, I want a clear error message if I try to deploy or run an application compiled with the wrong spark version.
  5. As an application developer, I want to use the newer Spark 2 APIs in my applications
  6. As an application developer, I want to be able to control which Spark version is used in my unit tests
  7. As a pipeline developer, I want to build pipelines without caring which version of Spark is on the cluster
  8. As a pipeline developer, I want to be able to export a pipeline built on my SDK and import it onto the cluster without caring about Spark versions

Interactive Spark User Stories

  • User Story #2
  • User Story #3

Design

Spark2

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Spark2 is built using Scala 2.11 by default. From the Spark documentation, as of Spark 2.1.0, Scala 2.10 is deprecated, and may be removed in the next minor version of Spark. Spark 1 is built using Scala 2.10 by default. Since Scala is not binary compatible, CDAP components that use Spark must have a Spark1 version and a Spark2 version, with the correct version used at runtime.

Assumptions

  • Users will not use a mixed setup, with some programs running Spark1 and some running Spark2
  • Users will not use a mixed setup, with some programs compiled against Scala 2.11 and some against 2.10
  • Anyone using Spark2 will be using Scala 2.11
  • Anyone using Spark1 will be using Scala 2.10

API Design

To support Spark2, application developers can use one of two separate Spark API modules, depending on the Spark version they will run with:

  • cdap-api-spark - for Spark1, Scala 2.10. Same as in previous CDAP releases
  • cdap-api-spark2_2.11 - for Spark2, Scala 2.11. New in CDAP 4.2.0

The API module contains 2 Java classes and 2 Scala classes:

  • JavaSparkExecutionContext
  • JavaSparkMain
  • SparkExecutionContext
  • SparkMain

These classes will have the same package and same name in both api modules. To begin with, they will look exactly the same. This is to allow application developers to upgrade by simply changing pom dependencies. In the future, we will add Spark2 specific APIs to the classes in cdap-api-spark2_2.11, and deprecate the APIs that use Spark1 classes. 

Runtime Design

In distributed mode, we can take a similar approach to how HBase versions are handled. CDAP packages will come with two separate Spark extensions. When starting CDAP, we can detect the Spark version and set a SPARK_VERSION environment variable. Based on this value, the correct extension will be used to launch Spark programs.

In standalone mode, we can add a configuration setting in cdap-site.xml that will control the spark version:

  <property>
    <name>spark.compat.version</name>
    <value>2</value>
    <description>Which Spark compact version to use</description>
  </property>

Another option would be to add an option to the 'cdap sdk' command to select the Spark version, but that approach would make it easy for the user to sometimes start CDAP with Spark1 and sometimes with Spark2, which would result in confusing behavior.

Unit Test Design

TestBase supports a way use a special TestConfiguration when advanced users want to change the capabilities of their test. We can extend this capability to support a config setting for the Spark version to be used. 

@ClassRule
public static final TestConfiguration CONFIG = new TestConfiguration(Constants.Explore.EXPLORE_ENABLED, false,
                                                                     Constants.Security.Store.PROVIDER, "file",
                                                                     Constants.Spark.COMPAT_VERSION, Constants.Spark.SPARK2_COMPAT_VERSION);

A nicer user experience, though more difficult to implement, would be to detect the Spark version used by a program and use the correct Spark runtime when running those programs.

Interactive Spark

Approach

Spark2 Approach #1

API

We will introduce two new spark api modules:

  • cdap-api-spark-base
  • cdap-api-spark2_2.11

cdap-api-spark-base will contain a base class for JavaSparkExecutionContext and SparkExecutionContext. In order to ensure that the base classes are compiled with the right scala version, we will use the build-helper-maven-plugin to add cdap-api-spark-base as a source in both cdap-api-spark and cdap-api-spark2_2.11.

Though this works in maven, it messes up IntelliJ, as IntelliJ doesn't expect two modules to have the same source and will complain about compilation errors. To make sure IntelliJ is happy, we will add a new 'cdap-dev' profile that will be used in cdap-api-spark and cdap-api-spark2_2.11 to add cdap-api-spark-base as a compile dependency.

Runtime

A similar change will be required for cdap-spark-core, with a module for base, spark1, and spark2:

  • cdap-spark-core-base
  • cdap-spark-core
  • cdap-spark-core2_2.11

It will use the same build-helper-maven-plugin to ensure that base classes are compiled with the right scala version.

A change will be made to the ProgramRuntimeProvider to support an annotation property for the Spark version. The SparkProgramRuntimeProvider for spark1 will be annotated slightly differently than that for spark2:

@ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK1)
public class SparkProgramRuntimeProvider implements ProgramRuntimeProvider {
  ...
}
 
@ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK2)
public class Spark2ProgramRuntimeProvider implements ProgramRuntimeProvider {
  ...
}

ProgramRuntimeProviderLoader will then be changed to examine the sparkcompat annotation value and compare it to runtime spark version to only load the correct runtime provider for spark. The runtime spark version will be determined by first checking the 'spark.compat.version' from cdap-site.xml (for CDAP standalone and unit test, assuming unit tests go through this code path). If not defined, it will check system property 'cdap.spark.compat.version' (for CDAP distributed) for the spark version.

These is also a small change required when determining which jars to localize in CDAP distributed mode. In spark1 there is just a single spark-assembly jar, but in Spark2 there are multiple jars.

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

  • No labels