Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »


The purpose of this page is to define best practices around pipeline plugins. It is important for our plugins to present a consistent style and convention so that users will know what to expect.

Documentation

Pipelines are designed to be used by non-programmers. When writing documentation for a plugin, keep that type of user in mind. Except in the properties section, use complete, grammatically correct sentences. Here are some general guidelines to keep in mind when writing documentation.


Avoid the third person ("you") in most scenarios. For example, instead of "This plugin allows you to read from HDFS", write "This plugin reads from HDFS".


If an acronym is used, make sure the first usage spells out the entire name with the acronym in parentheses. Subsequent references can use the acronym. For example, "This plugin reads a directory from the Hadoop Distributed File System (HDFS). It will read from the same HDFS ...."  


Be mindful of your vocabulary -- use plain english when possible. For example, instead of "If there is an exception parsing the field...", write "If there is an error parsing the field..."

Common Terminology

It is important to use the same terminology throughout our plugin documentation


Stage – The individual node in the pipeline DAG. Avoid calling this a 'node' or a 'plugin'.

Plugin – Defines the type and functionality of a stage. Corresponds to a box on the left hand side of the pipeline studio.

Partition – A single part of the entire data. Use this instead of 'split' or 'shard' or 'mapper/reducer'.

Reference Doc

Reference docs should contain at least two sections: Description and Properties. Oftentimes, it is useful to have an Examples section. 

Description

The description should contain information about what the plugin does. It does not need to go into great detail about every property it supports. Instead, it should mention high level use cases.

If the plugin is a source or a sink, it is often useful to put a couple sentences about the system it is reading from or writing to. This is so that a user who is unfamiliar with the plugin can quickly determine if this is the right plugin for them to use. For example, a user might think the Table source is reading from the relational database table and not a CDAP table. By reading the description for the Table source, the user should be able to quickly realize it is not the plugin they are looking for and move on.

Properties

Describe all the plugin properties in the order that they appear in the UI. Keep the following point in mind when writing the descriptions:

  • Names should be the names shown in the UI and not the names used in the backend.
  • Do not mention the format that the backend expects if the UI does not expose the format. For example, do not mention that 'fields' is a comma separated list of fields if the widget is using the 'csv' widget.
  • The first "sentence" is a fragment. You can think of it starting with an implicit "This property is ".
  • Always end the description with a period.
  • Mention restrictions or special values for properties. For example, document that a timeout property cannot be below 0 and that 0 means there is no timeout.
  • For numeric properties, include the unit. For example, instead of 'timestamp', use 'timestamp in seconds'. Instead of 'size', use 'size (GB)'.


For example:

Properties
----------
**Reference Name:** Used to uniquely identify this sink for lineage, annotating metadata, etc.

**Project ID**: The Google Cloud Project ID, which uniquely identifies a project.
It can be found on the Dashboard in the Google Cloud Platform Console.

**Service Account File Path**: Path on the local file system of the service account key used for
authorization. Does not need to be specified when running on a Dataproc cluster.
When running on other clusters, the file must be present on every node in the cluster.


Examples

Many plugin types benefit from including a couple examples in the reference doc. Sources and sinks often do not benefit from examples, but most other plugins types do. When writing an example, include how the plugin is configured, and what output it would generate with some given input.

For example:

Example
-------
In this example, the plugin is configured to with the unique fields as `fname,lname` and the filter operation as `max(cost)`.


Suppose the input records are:

    +======================================+
    | fname  | lname   | cost   |  zipcode |
    +======================================+
    | bob    | smith   | 50.23  |  12345   |
    | bob    | smith   | 0.50   |  45678   |
    | bob    | jones   | 30.64  |  23456   |
    | alice  | smith   | 1.50   |  34567   |
    | alice  | smith   | 30.21  |  56789   |
    | alice  | jones   | 500.93 |  67890   |
    +======================================+

The plugin will group all the 'bob smith' records together and only output the record that has the maximum cost.
Similarly, only the 'alice smith' record with the highest cost will be included in the output:

    +======================================+
    | fname  | lname   | cost   |  zipcode |
    +======================================+
    | bob    | smith   | 50.23  |  12345   |
    | bob    | jones   | 30.64  |  23456   |
    | alice  | smith   | 30.21  |  56789   |
    | alice  | jones   | 500.93 |  67890   |
    +======================================+


Widgets

Labels should be capitalized, with the exception of 'a', 'an', 'of', 'the'.


Naming

Coming up with a good name for a plugin and its properties can be a difficult task. The user facing name for plugins and properties is the 'label' specified in the widget json. 



Validation


Error Handling



  • No labels