Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Note: Make this page's parent CDAP once facets work for both streams and datasets.

Requirements

  • CDAP exposes the API for developers to build their own plugin for parsing data in a Stream.

  • Developer should have the ability to build his own parser using the CDAP provided API for parsing events in the stream.
  • Developer/Operations should then have the ability to deploy the parser implemented into a directory with a configuration
  • User should specify at minimum a name and description for the plugin in a configuration
  • User should have the ability to list the available plugins using REST API / CLI
  • User should have the ability to view using REST API / CLI the pre-defined schema of the plugin in case the plugin defines one.
  • User should have the ability to list the views associated with a Stream using REST API / CLI / UI
  • User should have the ability to apply the plugin to a Stream and create a view
  • User specified view name should be registered in a catalog allowing one to query (SQL) using the view name.
  • User should have the ability to apply different plugins on the same Stream creating different view
  • User should have the ability to change the plugin associated with a view
  • CDAP should provide a text wrangler plugin that allows one to create rules for parsing mostly text files.

Overview

  • A facet is another place where data can be read, like streams and datasets.
    •  Therefore, facets are readable anywhere a stream or dataset is readable (MapReduce/Spark program, flows, ETL) 
  • A facet is a read-only view of a stream or dataset, with a specific read format (schema + format (csv, avro))
  • If explore is enabled, then a Hive table will be created for each facet

3.2 Plan

  • Facet HTTP API, client, CLI
  • Facets can be a view of a stream (not dataset yet)
  • Hive tables will be created for facets when explore is enabled

Facet HTTP API

Path
Request
Response
Notes
PUT /v3/namespaces/<namespace>/facets/<facet>
{
  "stream""stream1",
  "format": <same as before>
}
 Creates or modifies a facet.
GET /v3/namespaces/<namespace>/facets/<facet> 
{"id":"someFacet""stream""stream1""format": ..}
Get details of an individual facet.
GET /v3/namespaces/<namespace>/facets  Lists all facets.
DELETE /v3/namespace/<namespace>/facet/<facet>  Deletes a facet.
GET /v3/namespaces/<namespace>/stream/<stream>/facets 
[
  {"id":"someFacet""stream""stream1""format": ..},
  {"id":"otherFacet""stream""stream2""format": ..}
]
Lists all facets associated with a stream.

Notes

  • If Explore is disabled, then Hive tables will not be created for facets

Sample CLI Flow

  1. User wants to create a stream "stream1" that contains CSV data and read using two facets "facet1" and "facet2".
    1. create stream stream1
    2. send stream stream1 "a,b,c"
      send stream stream1 "d,e,f" 
    3. execute "select * from stream_stream1" // may be removed later, as facets already cover this

      body
      a,b,c
      d,e,f
    4. create facet facet1 stream1 format csv "ticker string, num_traded int, price double"
    5. execute "select * from facet_facet1"

      ticker
      num_traded
      price
      abc
      def
    6. create facet facet2 stream1 format csv "ticker string, price double" "drop=$2" <-- drop $2 indicates "drop the 2nd field"

    7. execute "select * from facet_facet2"

      ticker
      price
      ac
      df
  • No labels