Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

This is the design doc for Spark2 and interactive Spark program.

Goals

We introduced complete Spark integration in CDAP 3.4 that is compatible with Spark 1.2.x - 1.6.x. Since then, Spark has evolved a lot with new APIs (DataFrame / DataSet / SQL) and new Spark2 runtime. Moreover, the current CDAP + Spark integration requires Spark developers to first create a CDAP application, add the Spark program to the application, package and deploy the application before being able to run a single line of Spark code in CDAP. This is alien to most Spark developers since they are expecting a fast / interactive way to work in Spark, which they can just type couple lines of Spark code, run it and get the result right away.

Here are the goals:

  • Have CDAP compatible with Spark 1.x and 2.x
    • Both API and runtime
  • Provide fast and interactive experience for Spark developer
    • Pipeline plugin developer
    • Spark and Spark in workflow.

User Stories 

Spark2 User Stories

  1. As a cluster administrator, I want CDAP to automatically use whatever Spark version is installed on my cluster
  2. As an application developer, I want to be able to upgrade my spark program from spark1 to spark2 by only changing pom dependencies
  3. As an application developer, I want to be able to control which spark version is used by CDAP standalone through command line args or configuration file edits
  4. As an application developer, I want a clear error message if I try to deploy or run an application compiled with the wrong spark version.
  5. As an application developer, I want to use the newer Spark 2 APIs in my applications
  6. As an application developer, I want to be able to control which Spark version is used in my unit tests
  7. As a pipeline developer, I want to build pipelines without caring which version of Spark is on the cluster
  8. As a pipeline developer, I want to be able to export a pipeline built on my SDK and import it onto the cluster without caring about Spark versions

Interactive Spark User Stories

  • User Story #2
  • User Story #3
    1. As a pipeline developer, I want to add custom spark code that transforms data in realtime data pipeline (CDAP 4.2)
    2. As a pipeline developer, I want to add custom spark code that transforms data in batch data pipeline (CDAP 4.2)
    3. As a pipeline developer, I want to add custom spark code that runs in a separate workflow node (CDAP Release TBD)
    4. As a spark developer, I want to have a notebook experience for writing Spark program in interactive way, similar to spark-shell (CDAP Release TBD)

    Design

    Spark2

    Jira Legacy
    serverCask Community Issue Tracker
    serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
    keyCDAP-7875

    Spark2 is built using Scala 2.11 by default. From the Spark documentation, as of Spark 2.1.0, Scala 2.10 is deprecated, and may be removed in the next minor version of Spark. Spark 1 is built using Scala 2.10 by default. Since Scala is not binary compatible, CDAP components that use Spark must have a Spark1 version and a Spark2 version, with the correct version used at runtime.

    Assumptions

    • Users will not use a mixed setup, with some programs running Spark1 and some running Spark2
    • Users will not use a mixed setup, with some programs compiled against Scala 2.11 and some against 2.10
    • Anyone using Spark2 will be using Scala 2.11
    • Anyone using Spark1 will be using Scala 2.10

    API Design

    To support Spark2, application developers can use one of two separate Spark API modules, depending on the Spark version they will run with:

    • cdap-api-spark - for Spark1, Scala 2.10. Same as in previous CDAP releases
    • cdap-api-spark2_2.11 - for Spark2, Scala 2.11. New in CDAP 4.2.0

    The API module contains 2 Java classes and 2 Scala classes:

    • JavaSparkExecutionContext
    • JavaSparkMain
    • SparkExecutionContext
    • SparkMain

    These classes will have the same package and same name in both api modules. To begin with, they will look exactly the same. This is to allow application developers to upgrade by simply changing pom dependencies. In the future, we will add Spark2 specific APIs to the classes in cdap-api-spark2_2.11, and deprecate the APIs that use Spark1 classes. 

    Runtime Design

    In distributed mode, we can take a similar approach to how HBase versions are handled. CDAP packages will come with two separate Spark extensions. When starting CDAP, we can detect the Spark version and set a SPARK_VERSION environment variable. Based on this value, the correct extension will be used to launch Spark programs.

    In standalone mode, we can add a configuration setting in cdap-site.xml that will control the spark version:

    Code Blocknoformat
      <property>
        <name>spark.compat.version</name>
        <value>2</value>
        <description>Which Spark compact version to use</description>
      </property>

    Another option would be to add an option to the 'cdap sdk' command to select the Spark version, but that approach would make it easy for the user to sometimes start CDAP with Spark1 and sometimes with Spark2, which would result in confusing behavior.

    Unit Test Design

    TestBase supports a way use a special TestConfiguration when advanced users want to change the capabilities of their test. We can extend this capability to support a config setting for the Spark version to be used. 

    Code Block
    @ClassRule
    public static final TestConfiguration CONFIG = new TestConfiguration(Constants.Explore.EXPLORE_ENABLED, false,
                                                                         Constants.Security.Store.PROVIDER, "file",
                                                                         Constants.Spark.COMPAT_VERSION, Constants.Spark.SPARK2_COMPAT_VERSION);

    A nicer user experience, though more difficult to implement, would be to detect the Spark version used by a program and use the correct Spark runtime when running those programsBecause cdap-unit-test has a dependency on cdap-spark-core in order to launch Spark programs, we will need to have separate test modules as well. We could do this by introducing a new cdap-unit-test-spark2_2.11 module that also has a TestBase of the same package and name. This would allow people to move their unit tests from Spark1 to Spark2 by just updating their pom.

    Alternatively, we could see if we can avoid introducing a mirrored cdap-unit-test module and see if it is possible to simply include the right cdap-spark-core module as a dependency.

    Pipeline Design

    The pipeline applications will also require separate jars for each spark version. Since we can't deploy both spark1 and spark2 versions of the artifact, we will need to change the way we package the artifacts

    Code Blocknoformat
    today:
    /opt/cdap/master/artifacts/cdap-data-pipeline-4.1.0.jar
      
    new:
    /opt/cdap/master/artifacts/spark1_2.10/cdap-data-pipeline-4.2.0.jar
    /opt/cdap/master/artifacts/spark1_2.10/spark-plugins-4.2.0.jar
    /opt/cdap/master/artifacts/spark1_2.10/spark-plugins-4.2.0.json
    /opt/cdap/master/artifacts/spark2_2.11/cdap-data-pipeline-4.2.0-spark2_2.11.jar
    /opt/cdap/master/artifacts/spark2_2.11/spark-plugins-4.2.0-spark2_2.11.jar
    /opt/cdap/master/artifacts/spark2_2.11/spark-plugins-4.2.0-spark2_2.11.json

    The artifact directory config setting will be changed to use multiple directories, substituting in the spark compat version:

    Code Blocknoformat
    <property>
      <name>app.artifact.dir</name>
      <value>/opt/cdap/master/artifacts;/opt/cdap/master/artifacts/${spark.compat.version}</value>
      <description>
        Semicolon-separated list of local directories scanned for system artifacts to add to the artifact repository
      </description>
    </property>

    However, this makes things more complicated in the pipeline config and in the UI, if we want to support exporting from an sdk that uses spark1 to a cluster that users spark2. If we want to support that use case, we can use the new version range capability to always specify a range instead of exact version:

    Code Blocknoformat
    {
      "artifact": {
        "scope": "system",
        "name": "cdap-data-pipeline",
        "version": "[4.2.0, 4.2.1)"
      },
      ...
    }

     

    Market Design

    Every plugin in the market that uses spark will need to include a version for spark1 and a version for spark2.

    Every pipeline in the market will need to use a version range instead of an exact version in order to allow for different spark runtime environments.

    Interactive Spark

    Approach

    Spark2 Approach #1

    API

    We will introduce two new spark api modules:

    • cdap-api-spark-base
    • cdap-api-spark2_2.11

    cdap-api-spark-base will contain a base class for JavaSparkExecutionContext and SparkExecutionContext. In order to ensure that the base classes are compiled with the right scala version, we will use the build-helper-maven-plugin to add cdap-api-spark-base as a source in both cdap-api-spark and cdap-api-spark2_2.11.

    Though this works in maven, it messes up IntelliJ, as IntelliJ doesn't expect two modules to have the same source and will complain about compilation errors. To make sure IntelliJ is happy, we will add a new 'cdap-dev' profile that will be used in cdap-api-spark and cdap-api-spark2_2.11 to add cdap-api-spark-base as a compile dependency.

    Runtime

    A similar change will be required for cdap-spark-core, with a module for base, spark1, and spark2:

    • cdap-spark-core-base
    • cdap-spark-core
    • cdap-spark-core2_2.11

    It will use the same build-helper-maven-plugin to ensure that base classes are compiled with the right scala version.

    A change will be made to the ProgramRuntimeProvider to support an annotation property for the Spark version. The SparkProgramRuntimeProvider for spark1 will be annotated slightly differently than that for spark2:

    Code Block
    @ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK1)
    public class SparkProgramRuntimeProvider implements ProgramRuntimeProvider {
      ...
    }
     
    @ProgramRuntimeProvider.SupportedProgramType(value = ProgramType.SPARK, sparkcompat = SparkCompat.SPARK2)
    public class Spark2ProgramRuntimeProvider implements ProgramRuntimeProvider {
      ...
    }

    ProgramRuntimeProviderLoader will then be changed to examine the sparkcompat annotation value and compare it to runtime spark version to only load the correct runtime provider for spark. The runtime spark version will be determined by first checking the 'spark.compat.version' from cdap-site.xml (for CDAP standalone and unit test, assuming unit tests go through this code path). If not defined, it will check system property 'cdap.spark.compat.version' (for CDAP distributed) for the spark version.

    These is also a small change required when determining which jars to localize in CDAP distributed mode. In spark1 there is just a single spark-assembly jar, but in Spark2 there are multiple jars.

    API changes

    New Programmatic APIs

    New Java APIs introduced (both user facing and internal)

    Deprecated Programmatic APIs

    New REST APIs

    PathMethodDescriptionResponse CodeResponse
    /v3/apps/<app-id>GETReturns the application spec for a given application

    200 - On success

    404 - When application is not available

    500 - Any internal errors

     

         

    Deprecated REST API

    PathMethodDescription
    /v3/apps/<app-id>GETReturns the application spec for a given application

    CLI Impact or Changes

    • Impact #1
    • Impact #2
    • Impact #3

    UI Impact or Changes

    • Impact #1
    • Impact #2
    • Impact #3

    Security Impact 

    What's the impact on Authorization and how does the design take care of this aspect

    Impact on Infrastructure Outages 

    System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

    Test Scenarios

    Test IDTest DescriptionExpected Results
       
       
       
       

    Releases

    Release X.Y.Z

    Release X.Y.Z

    Related Work

    • Work #1
    • Work #2
    • Work #3

     

    Future work