PySpark Integration

This page talks about the detail for the PySpark integration in CDAP.

Background

When a Spark program get executed in the CDAP runtime environment, a CDAP specific context, SparkExecutionContext, is available to the user program to gather runtime information as well as for interacting with the CDAP system. For PySpark program in CDAP, the same functionality should be available.

Design

A typical Spark program has a Driver process and multiple Executor processes, and it is also the case for PySpark program. In standard PySpark, it uses py4j in the Driver process to bridge calls from the Python land into the JVM. Most of the PySpark API are implemented using such bridge to wrap objects and call methods they are written in Scala/Java. In Executor processes, however, it is only driven from the JVM side to call Python methods using Unix process pipes and Python Pickle library for data serialization. A more detailed description can be found in the now deprecated PySpark internal design doc, and also this blog post.

For PySpark in CDAP, we need to provide access to the CDAP SparkExecutionContext to both the Driver process and also the closure functions that get executed in the Executor process. At a high level, we can piggy back on the py4j gateway server started by Spark in the Driver process, and start our own py4j gateway server in each of the executor processes.