Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Names should be the names shown in the UI and not the names used in the backend.
  • Do not mention the format that the backend expects if the UI does not expose the format. For example, do not mention that 'fields' is a comma separated list of fields if the widget is using the 'csv' widget.
  • The first "sentence" is a fragment. You can think of it starting with an implicit "This property is ".
  • Always end the description with a period.
  • Mention restrictions or special values for properties. For example, document that a timeout property cannot be below 0 and that 0 means there is no timeout.
  • For numeric properties, include the unit. For example, instead of 'timestamp', use 'timestamp in seconds'. Instead of 'size', use 'size (GB)'.
  • Reference Name should always have the same description for sources and the same description for sinks – "Used to uniquely identify this <source/sink> for lineage, annotating metadata, and other governance operations."


For example:

No Format
Properties
----------
**Reference Name:** Used to uniquely identify this sink for lineage, annotating metadata, etc and other governance operations.

**Project ID**: The Google Cloud Project ID, which uniquely identifies a project.
It can be found on the Dashboard in the Google Cloud Platform Console.

**Service Account File Path**: Path on the local file system of the service account key used for
authorization. Does not need to be specified when running on a Dataproc cluster.
When running on other clusters, the file must be present on every node in the cluster.

...

No Format
Example
-------
In this example, the plugin is configured to with the unique fields as `fname,lname` and the filter operation as `max(cost)`.


Suppose the input records are:

    +======================================+
    | fname  | lname   | cost   |  zipcode |
    +======================================+
    | bob    | smith   | 50.23  |  12345   |
    | bob    | smith   | 0.50   |  45678   |
    | bob    | jones   | 30.64  |  23456   |
    | alice  | smith   | 1.50   |  34567   |
    | alice  | smith   | 30.21  |  56789   |
    | alice  | jones   | 500.93 |  67890   |
    +======================================+

The plugin will group all the 'bob smith' records together and only output the record that has the maximum cost.
Similarly, only the 'alice smith' record with the highest cost will be included in the output:

    +======================================+
    | fname  | lname   | cost   |  zipcode |
    +======================================+
    | bob    | smith   | 50.23  |  12345   |
    | bob    | jones   | 30.64  |  23456   |
    | alice  | smith   | 30.21  |  56789   |
    | alice  | jones   | 500.93 |  67890   |
    +======================================+


Widgets

  • Labels should be capitalized, with the exception of 'a', 'an', 'of', 'the'.
  • If a property does not have to be a text box, it probably should not be a textbox.
  • Most widgets should specify a placeholder:
No Format
  {
    "widget-type": "textbox",
    "label": "Bucket Name",
    "name": "bucket",
    "widget-attributes" : {
      "placeholder": "The bucket to be used to create directories."
    }
  }
  • Properties should be grouped into the standard 'Basic', 'Credentials', and 'Advanced' sections when possible. 
    • 'Basic' properties generally define the "core" of what the plugin does, likely including most required properties.
    • 'Credentials' are things related to authentication/authorization. Usernames, passwords, account keys, etc.
    • Advanced properties are things new users don't need to look at and are generally related to error scenarios and performance but not functionality.
  • Boolean properties should use the 'radio-group' widget type and not a select drop down.

Naming

Coming up with a good name for a plugin and its properties can be a difficult task. The user facing name for plugins and properties is the 'label' specified in the widget json

Here are some guidelines:

  • Don't put the plugin type in the name. For example, instead of 'Table Source', just use 'Table'.
  • Use 'partition' instead of 'split' or 'shard'. These all mean the same thing, but we just need a standard
  • 'Reference Name' is the standard name for external datasets
  • Use a positive name for boolean properties. For example, 'Enable Auto Commit' instead of 'Disable Auto Commit'. 

Validation

...

User input should be validated as early as possible, which means in the configurePipeline() method.

Some useful things to keep in mind while validating:

  • Use the containsMacro() method to check if the property is ready to be validated
  • If a property is invalid, throw an InvalidConfigPropertyException with a user friendly message. This message will be shown in the UI and should mention which property is invalid, why it is invalid, and what action the user can take to make it valid.
  • If multiple properties are invalid, throw an InvalidStagePropertyException that contains multiple exceptions are the reasons.
  • For numeric properties, can the property be 0? Can it be negative? Does it need to be within a certain range?
  • The input schema is often required to perform validation. For example, a plugin may operate on a specific field, which can only be a specific type.
  • A property cannot be null unless it is annotated as @Nullable.
  • Don't forget to handle empty strings.