...
Given the ability of caching dataset in Spark, one can have a Spark program first build up RDDs caches and then expose a network service (e.g. HTTP service) to allow querying those RDDs interactively. This gives user a way to easily build interactive query service over a large dataset with relatively low latency.
Here is the proposed CDAP API for service integration in Spark program
Introduce a new interface,
SparkHttpServiceContext
, with provide access to theSparkContext
instance created in the Spark programCode Block language java linenumbers true public interface SparkHttpServiceContext extends HttpServiceContext { SparkContext getSparkContext(); }
User can add multiple
HttpServiceHandler
instances to the spark program in theSpark.configure
method through the SparkConfigurer- CDAP will call
HttpServiceHandler.initialize
method with aSparkHttpServiceContext
instance.CDAP will provide an abstract class,
AbstractSparkHttpServiceHandler
, to deal with the casting in the initialize method.Code Block language java linenumbers true public abstract class AbstractSparkHttpServiceHandler extends AbstractHttpServiceHandler { private SparkContext sparkContext; @Override public void initialize(HttpServiceContext context) throws Exception { super.initialize(context); // Shouldn't happen. The CDAP framework guarantees it. if (!(context instanceof SparkHttpServiceContext)) { throw new IllegalArgumentException("The context type should be SparkHttpServiceContext"); } this.sparkContext = ((SparkHttpServiceContext) context).getSparkContext(); } protected final SparkContext getSparkContext() { return sparkContext; } }
- Because CDAP needs to provide the
SparkContext
to the http handler, the Http Service and the initialization ofHttpServiceHandler
will only happen after the user Spark program instantiated theSparkContext
(see option b. above).
API for Dataframe/SparkSQL
...