Plugin Coding Standards


The purpose of this page is to define best practices around pipeline plugins. It is important for our plugins to present a consistent style and convention so that users will know what to expect.

Documentation

Pipelines are designed to be used by non-programmers. When writing documentation for a plugin, keep that type of user in mind. Except in the properties section, use complete, grammatically correct sentences. Here are some general guidelines to keep in mind when writing documentation.


Avoid the third person ("you") in most scenarios. For example, instead of "This plugin allows you to read from HDFS", write "This plugin reads from HDFS".


If an acronym is used, make sure the first usage spells out the entire name with the acronym in parentheses. Subsequent references can use the acronym. For example, "This plugin reads a directory from the Hadoop Distributed File System (HDFS). It will read from the same HDFS ...."  


Be mindful of your vocabulary -- use plain english when possible. For example, instead of "If there is an exception parsing the field...", write "If there is an error parsing the field..."

Common Terminology

It is important to use the same terminology throughout our plugin documentation


Stage – The individual node in the pipeline DAG. Avoid calling this a 'node' or a 'plugin'.

Plugin – Defines the type and functionality of a stage. Corresponds to a box on the left hand side of the pipeline studio.

Partition – A single part of the entire data. Use this instead of 'split' or 'shard' or 'mapper/reducer'.

Reference Doc

Reference docs should contain at least two sections: Description and Properties. Oftentimes, it is useful to have an Examples section. 

Description

The description should contain information about what the plugin does. It does not need to go into great detail about every property it supports. Instead, it should mention high level use cases.

If the plugin is a source or a sink, it is often useful to put a couple sentences about the system it is reading from or writing to. This is so that a user who is unfamiliar with the plugin can quickly determine if this is the right plugin for them to use. For example, a user might think the Table source is reading from the relational database table and not a CDAP table. By reading the description for the Table source, the user should be able to quickly realize it is not the plugin they are looking for and move on.

Properties

Describe all the plugin properties in the order that they appear in the UI. Keep the following point in mind when writing the descriptions:

  • Names should be the names shown in the UI and not the names used in the backend.
  • Do not mention the format that the backend expects if the UI does not expose the format. For example, do not mention that 'fields' is a comma separated list of fields if the widget is using the 'csv' widget.
  • The first "sentence" is a fragment. You can think of it starting with an implicit "This property is ".
  • Always end the description with a period.
  • Mention restrictions or special values for properties. For example, document that a timeout property cannot be below 0 and that 0 means there is no timeout.
  • For numeric properties, include the unit. For example, instead of 'timestamp', use 'timestamp in seconds'. Instead of 'size', use 'size (GB)'.
  • Reference Name should always have the same description for sources and the same description for sinks – "Used to uniquely identify this <source/sink> for lineage, annotating metadata, and other governance operations."


For example:

Properties
----------
**Reference Name:** Used to uniquely identify this sink for lineage, annotating metadata, and other governance operations.

**Project ID**: The Google Cloud Project ID, which uniquely identifies a project.
It can be found on the Dashboard in the Google Cloud Platform Console.

**Service Account File Path**: Path on the local file system of the service account key used for
authorization. Does not need to be specified when running on a Dataproc cluster.
When running on other clusters, the file must be present on every node in the cluster.


Examples

Many plugin types benefit from including a couple examples in the reference doc. Sources and sinks often do not benefit from examples, but most other plugins types do. When writing an example, include how the plugin is configured, and what output it would generate with some given input.

For example:

Example
-------
In this example, the plugin is configured to with the unique fields as `fname,lname` and the filter operation as `max(cost)`.


Suppose the input records are:

    +======================================+
    | fname  | lname   | cost   |  zipcode |
    +======================================+
    | bob    | smith   | 50.23  |  12345   |
    | bob    | smith   | 0.50   |  45678   |
    | bob    | jones   | 30.64  |  23456   |
    | alice  | smith   | 1.50   |  34567   |
    | alice  | smith   | 30.21  |  56789   |
    | alice  | jones   | 500.93 |  67890   |
    +======================================+

The plugin will group all the 'bob smith' records together and only output the record that has the maximum cost.
Similarly, only the 'alice smith' record with the highest cost will be included in the output:

    +======================================+
    | fname  | lname   | cost   |  zipcode |
    +======================================+
    | bob    | smith   | 50.23  |  12345   |
    | bob    | jones   | 30.64  |  23456   |
    | alice  | smith   | 30.21  |  56789   |
    | alice  | jones   | 500.93 |  67890   |
    +======================================+


Widgets

  • Labels should be capitalized, with the exception of 'a', 'an', 'of', 'the'.
  • If a property does not have to be a text box, it probably should not be a textbox.
  • Most widgets should specify a placeholder:
  {
    "widget-type": "textbox",
    "label": "Bucket Name",
    "name": "bucket",
    "widget-attributes" : {
      "placeholder": "The bucket to be used to create directories."
    }
  }
  • Properties should be grouped into the standard 'Basic', 'Credentials', and 'Advanced' sections when possible. 
    • 'Basic' properties generally define the "core" of what the plugin does, likely including most required properties.
    • 'Credentials' are things related to authentication/authorization. Usernames, passwords, account keys, etc.
    • Advanced properties are things new users don't need to look at and are generally related to error scenarios and performance but not functionality.
  • Boolean properties should use the 'radio-group' widget type and not a select drop down.

Naming

Coming up with a good name for a plugin and its properties can be a difficult task. The user facing name for plugins and properties is the 'label' specified in the widget json. 

Here are some guidelines:

  • Don't put the plugin type in the name. For example, instead of 'Table Source', just use 'Table'.
  • Use 'partition' instead of 'split' or 'shard'. These all mean the same thing, but we just need a standard
  • 'Reference Name' is the standard name for external datasets
  • Use a positive name for boolean properties. For example, 'Enable Auto Commit' instead of 'Disable Auto Commit'. 

Validation

User input should be validated as early as possible, which means in the configurePipeline() method.

Some useful things to keep in mind while validating:

  • Use the containsMacro() method to check if the property is ready to be validated
  • If a property is invalid, throw an InvalidConfigPropertyException with a user friendly message. This message will be shown in the UI and should mention which property is invalid, why it is invalid, and what action the user can take to make it valid.
  • If multiple properties are invalid, throw an InvalidStagePropertyException that contains multiple exceptions are the reasons.
  • For numeric properties, can the property be 0? Can it be negative? Does it need to be within a certain range?
  • The input schema is often required to perform validation. For example, a plugin may operate on a specific field, which can only be a specific type.
  • A property cannot be null unless it is annotated as @Nullable.
  • Don't forget to handle empty strings.