Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

A separate database plugin to support MongoDB-specific features and configurations.

...

The following example uses '{ status: { $in: [ "A", "D" ] } }' query filter document to retrieve all documents from the 'inventory' collection where 'status' equals either "A" or "D":

...

Code Block
SELECT * FROM inventory WHERE status in ("A", "D")

Sink Properties

User Facing NameWidget TypeDescriptionConstraints
Label
String
textboxLabel for UI.
Reference Name
String
textboxUniquely identified name for lineage.
Host
String
textboxHost that MongoDB is running on.

Required

(defaults to localhost on UI)

Port
Number
numberPort that MongoDB is listening to.

Optional

(default 27017)

Database
String
textboxMongoDB database name.Required
Collection
String
textboxName of the database collection to write to.Required
ID FieldtextboxAllows the user to specify which of the incoming fields should be used as an object identifier.

Optional.

Object ID will be generated if no value is specified.

Username
String
textboxUser identity for connecting to the specified database.
Password
Password
passwordPassword to use to connect to the specified database.
Connection Arguments
Keyvalue
keyvalue

A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments.


Sink Data Types Mapping

To support all data types in the Sink we can use MongoDB extended JSON format or/and infer a data type of record field based on its name. 

The table below does not honor non-standard MongoDB data types and lists how CDAP data types are stored.

The following MongoDB data types are missing: Undefined, Regular Expression, DBPointer, JavaScript, Symbol, JavaScript (with scope), Timestamp, Min key, Max key.

CDAP Schema Data TypeMongoDB Data Types
booleanBoolean
bytesBinary data, ObjectId(if 'ID Field' specified)
dateDate
doubleDouble
decimalDecimal128
floatDouble
int32-bit integer
long64-bit integer
stringString, ObjectId(if 'ID Field' specified)
timeString
timestampDate
arrayArray
record

Object

enumArray
map

Object

union

Array


Source Properties

User Facing NameWidget TypeDescriptionConstraints
Label
String
textboxLabel for UI.
Reference Name
String
textboxUniquely identified name for lineage.
Host
String
textboxHost that MongoDB is running on.

Required

(defaults to localhost on UI)

Port
Number
numberPort that MongoDB is listening to.

Optional

(default 27017)

Database
String
textboxMongoDB database name.Required
Collection
String
textboxName of the database collection to write to.Required
Output Schema
Schema
schemaSpecifies the schema of the documents.Required
On Record ErrorselectSpecifies how to handle error in record processing. An error will be thrown if failed to parse value according to a provided schema.

Possible values are:

  • Skip error
  • Fail pipeline
  • Write to error dataset

Default: 'Fail pipeline'

Error datasettextboxName of the dataset to store error record.

Optional.

Default:

Reference Name + "-error"

Input Query
String
json-editorOptionally filter the input collection with a query. This query must be represented in JSON format and use the MongoDB extended JSON format to represent non-native JSON data types.
Username
String
textboxUser identity for connecting to the specified database.
Password
Password
passwordPassword to use to connect to the specified database.
Authentication Connection StringtextboxAuxiliary MongoDB connection string to authenticate against when constructing splits.
Connection Arguments
Keyvalue
keyvalue

A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments.



Source Data Types Mapping

...

The source requires Output Schema to be set. Based on the schema source will expect a field in each document to be of a specific Mongo data type.

On Record Error error handling property allows the user to decide whether the pipeline should fail, the record should be skipped, or the record should be sent to the error dataset.

The following table shows what MongoDB data types can be read as CDAP types.

CDAP Schema Data Type
SupportCommentDouble
Schema.Type.DOUBLE+StringSchema.Type.STRING+ObjectSchema.Type.RECORD+ArraySchema.Type.ARRAY+Binary dataSchema.Type.BYTES*

Value can be mapped to Schema.Type.BYTES, but this can lead to subtype information loss.

Subtypes:

  • generic: \x00 (0)
  • function: \x01 (1)
  • old: \x02 (2)
  • uuid_old: \x03 (3)
  • uuid: \x04 (4)
  • md5: \x05 (5)
  • user: \x80 (128)

There are several options:

1) Support only 'generic' subtype.

2) Map using MongoDB extended JSON format:

"binary": {"$binary": "YmluYXJ5IGRhdGE=", "$type": "00"}

UndefinedSchema.Type.NULL*

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"undefined": {"$undefined": true}

ObjectId*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: {"$oid": "5d3f1c2a2f547625b0bbb397"}

BooleanSchema.Type.BOOLEAN+DateSchema.LogicalType.TIMESTAMP_MILLIS+NullSchema.Type.UNION+A nullable version of the actual type, corresponds to Schema.nullableOf(actualTypeSchema).Regular ExpressionSchema.Type.STRING*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: "regex": {"$regex": ".", "$options": ""}

DBPointerSchema.Type.STRING*

String in MongoDB extended JSON format:

"dbpointer": {"$ref": "source", "$id": {"$oid": "5d079ee6d078c94008e4bb3a"}}

JavaScriptSchema.Type.STRING*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: "javascript": {"$code": "var l = 1;"}

SymbolSchema.Type.STRING*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: "symbol": {"$symbol": "a"}JavaScript (with scope)Schema.Type.STRING*

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"javascriptwithscope": {"$code": "var l = 1;", "$scope": {"scope": "scope_val"}

32-bit integerSchema.Type.INT+Timestamp*

Special type for internal MongoDB use which is not associated with the regular Date type. Timestamp values are a 64 bit value where:

  • the first 32 bits are a time_t value (seconds since the Unix epoch)
  • the second 32 bits are an incrementing ordinal for operations within a given second.

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"timestamp": {"$timestamp": {"t": 1564410161, "i": 1}}

64-bit integerSchema.Type.LONG+Decimal128Schema.LogicalType.DECIMAL+Min key*

Is less than any other value of any type. This can be useful for always returning certain documents first (or last).

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"minkey": {"$minKey": 1}

Max key*

Is greater than any other value of any type. This can be useful for always returning certain documents first (or last).

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"maxkey": {"$maxKey": 1}
MongoDB Data Types
booleanBoolean
bytesBinary data, ObjectId
date-
doubleDouble
decimalDecimal128
float-
int32-bit integer
long64-bit integer
string

String, Symbol, JavaScript, Boolean, Double, Decimal128, 32-bit integer, 64-bit integer, ObjectId, Regular Expression, Date, Binary data

* All data types can be read as a string

time-
timestampDate
arrayArray
record

Object

The following schema:

Code Block
{"type":"record","name":"object","fields":[{"name":"inner_field","type":"string"}]}

is used for 'object' field:

Code Block
{
 "object" : {
        "inner_field" : "val"
    }
}


* We can map all non-standard data types to record, like JavaScript (with scope) in the example below.

The following schema:

Code Block
{
  "type":"record",
  "name":"javascriptwithscope",
  "fields":[
    {"name":"$code","type":"string"},
    {"name":"$scope","type":{"type":"record","name":"scope-object-record","fields"[{"name":"scope","type":"string"}]}}
  ]
}

is used for 'javascriptwithscope' field:

Code Block
{
  "javascriptwithscope" : { "$code" : var l = 1; ,  "$scope" : { "scope" : "scope_val" } }
}
enum-
map

Object

The following schema:

Code Block
{"type":"map","keys":"string","values":"string"}

is used for 'object' field:

Code Block
{
 "object" : {
        "inner_field" : "val"
    }
}


union

-


A
pproach

Move existing mongodb-plugins module to the mongodb-plugins project. Add MongoDB-specific properties to configuration, add support for MongoDB-specific datatypes. Update UI widgets JSON definitions.

...