Sampling Transform
- Russ Savage
- Romy Khetan
Introduction
Many times, users would like to sample a large dataset to pull only a few records for analysis. This transform would allow them to take a random sample of the data flowing through the transform. We should use the sampling method described for HEDIS reporting.
Use case(s)
- I would like to sample my member database for calculating the Adult BMI Measure HEDIS measure. In this case, I would like to build a pipeline to pull records from my member database, sort them alphabetically using a OrderBy plugin (in development), then apply a sampling methodology as follows: input a sample size, an over sampling percentage (the final sample size is calculated as Final Sample Size = Input Sample Size * (Input Sample Size * Over Sampling Percentage) (round up to the next whole number)). So we will choose every Nth = (Total Records/Final Sample Size) member. The first member is chosen using a (random number between 0 and 1) * N and then every Nth member after that.
- As a data scientist, I would like to sample 20% of the records in the dataset for training a machine learning model. I would like to build a hydrator pipeline where I can leverage a transform where 1000 records go into the plugin, and only 200 records come out for processing.
- I have a stream of items of large and unknown length and I would like to randomly choose items from this stream such that each item is equally likely to be selected. I would like to leverage this transform with a kafka queue in a spark streaming pipeline. (Reservoir Sampling Example)
User Storie(s)
- As a hydrator user, i would like to sample the records in my pipeline so that a large number of records go in, but only a specified number of records + over sampling percentage comes out of the transform.
Plugin Type
- Aggregate (Or maybe a transform)
Configurables
This section defines properties that are configurable for this plugin.Â
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Input Sample Size | String | The number of records that you would like to sample from the input records. | Â |
Input Sample Percentage | String | The % of records that you would like to sample from the input records. | 0 - 100 |
Oversampling Percentage | String | The % of additional records you would like to include in addition to the input sample size to account for oversampling. Defaults to 0. | 0 - 100 |
Sampling Type | String | Type of the Sampling algorithm that needs to be used to sample the data. | Â |
Random | String | Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will | Â |
Total Records | String | Total number of input records to be used in Systematic Sampling. | Â |
Design / Implementation Tips
- One of Input Sample Size or Input Sample Percentage must be specified.
- Please follow the "Systematic Sampling Methodology" (starts on page 44) found in this document: https://drive.google.com/open?id=0B1DD6Nd_UiCZZzNBN1Z2ZHZHZUk for Inout Sample Size
- Please use Reservoir Sampling method http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/ which may require different input values.
- This should be a single plugin that allows the user to choose the method of sampling they would like to use. We should design this in such a way that additional sampling methods can be added to the same plugin.
Design
{ "name": "Sampling", "plugin": { "name": "Sampling", "type": "batchaggregator", "label": "Sampling", "artifact": { "name": "sampling-aggregator-plugin", "version": "1.6.0", "scope": "SYSTEM" }, "properties": { "samplingType": "Systematic", "sampleSize": "2", "random": "0.2", "overSamplingPercentage": "30", "totalRecords": "11" } } }
Properties
sampleSize: The number of records that needs to be sampled from the input records.
samplePercentage: The percentage of records that needs to be sampled from the input records. Either of 'samplePercentage' or 'sampleSize' needs to be mentioned.
overSamplingPercentage: The percentage of additional records that needs to be included in addition to the input sample size to account for oversampling to be used in Systematic Sampling.
samplingType: Type of the Sampling algorithm that needs to be used to sample the data. For example: Systematic or Reservoir
random: Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will internally generate random value.
totalRecords: Total number of input records to be used in Systematic Sampling.
Â
NFR
Only Performance measurement is in scope as part of NFR.
Limitation(s)
User has to provide total number of records when selecting Sampling Type as Systematic.
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Sample records with Systematic sampling
- Sample records with Reservoir Sampling
- Sample records with Systematic sampling along with over-sampling percentage
Sample Pipeline
sampling-systematic-cdap-data-pipeline.json
sampling-systematic_samplePercentage-cdap-data-pipeline.json
sampling-reservoir-cdap-data-pipeline.json
Â
Table of Contents
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature