Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Operations
    1. Perform single + batch read on single + multiple dataset from script transform
    2. Perform single + batch read on DistributedCache single + multiple files from script transform
  2. Supported datasets tables for lookup
      Key
    1. KeyValueTable dataset
    2. ObjectMappedTable dataset
    3. CSV files treated as a list of key-value tablepairs
    4. ObjectMappedTable
  3. Optional caching with time-based expiration

Design

Proposed changes

  1. TransformContext changes
    1. Rename TransformContext to StageContext (since it is used as the base context in BatchRuntimeContext, not only transform contexts).
    2. Create TransformContext which extends StageContext
      1. Has a single method: Lookup getLookup()
  2. Transform changes
  3. Lookup interface with lookup methods that perform read operations on datasets
  4. DefaultLookup: implementation that uses Transactional to implement Lookup, used in ETLWorker
  5. Add Lookup field to ScriptContext, so Lookup is accessible via transforms that interpret JavaScriptSample usage: ctx.lookup.lookupKVString('purchases', 'key')

     Lookup interface 

    Code Block
    interface Lookup<T> {
      T lookup(String key);
      Map<String, T> lookup(String... keys);
      Map<String, T> lookup(Set<String> keys);
    }
  6. Implement Lookup in KeyValueTable and ObjectMappedTable
    1. KeyValueTable implements Lookup<String>
    2. ObjectMappedTable implements Lookup<StructuredRecord>
  7. DatasetConfigurer changes
    1. Add method: void useDataset(String datasetName);
  8. ScriptTransform changes
    1. Add configuration property for declaring lookup tables to use, properties for each table (e.g. dataset properties)

      Code Block
      "tables": [
        {
          "name":"purchases",
          "type":"dataset",
          "properties": {
            "dataset":"purchases",
            "properties":{.. dataset properties ..},
            "enableCache":"true",
            "cacheExpiry":1234
          }
        },
        {"name":"ip2geo", "type":"file", "properties":{"file":"/data/ip2geo.csv"}}
      ]
    2. configure(): verify tables (datasets and files) exist by calling DatasetConfigurer.useDataset()
    3. transform(): execute lookup methods in a transaction, provide Lookup instance to script
      1. Options for lookup usage: 

        Code Block
        var result = context.getLookup("purchases").lookup(user);
      2. Options for batch lookup usage:

        Code Block
        var result = context.getLookup("purchases").lookup(["alice", "bob"]);
        // do something with result["alice"]
        // do something with result["bob"]