Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

...

The suggestion is to move existing mongodb-plugins module to the mongodb-plugins repository.


MongoDB Overview

Document database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

Code Block
{
    "_id" : ObjectId("5d3f1c2a2f547625b0bbb397"),
    "string" : "AAPL",
    "int32" : 10,
    "double" : 23.23,
    "array" : [ 
        "a1", 
        "a2"
    ],
    "object" : {
        "inner_field" : "val"
    },
    "binary" : { "$binary" : "YmluYXJ5IGRhdGE=", "$type" : "00" },
    "undefined" : undefined,
    "boolean" : false,
    "date" : ISODate("2019-07-29T16:17:46.109Z"),
    "null" : null,
    "regex" : /./,
    "dbpointer" : DBRef("source", "5d079ee6d078c94008e4bb3a"),
    "javascript" : var l = 1;,
    "javascriptwithscope" : { "$code" : var l = 1; ,  "$scope" : { "scope" : "scope_val" } },
    "symbol" : "a",
    "timestamp" : Timestamp(1564417066, 1),
    "long" : NumberLong(9223372036854775807),
    "decimal" : NumberDecimal("3.100000"),
    "minkey" : { "$minKey" : 1 },
    "maxkey" : { "$maxKey" : 1 }
}

Document limitations

  • The maximum BSON document size is 16 megabytes.
  • In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.

Flexible schema

Unlike SQL databases, where you must determine and declare a table’s schema before inserting data, MongoDB’s collections, by default, does not require its documents to have the same schema.

  • The documents in a single collection do not need to have the same set of fields and the data type for a field can differ across documents within a collection. 
  • To change the structure of the documents in a collection, such as add new fields, remove existing fields, or change the field values to a new type, update the documents to the new structure.

Query filter documents

A query filter document and query operators can be used to specify conditions.

The following example uses '{ status: { $in: [ "A", "D" ] } }' query filter document to retrieve all documents from the 'inventory' collection where 'status' equals either "A" or "D":

Code Block
db.inventory.find( { status: { $in: [ "A", "D" ] } } )

The operation corresponds to the following SQL statement:

Code Block
SELECT * FROM inventory WHERE status in ("A", "D")

Sink Properties

User Facing NameTypeDescriptionConstraints
LabelStringLabel for UI.
Reference NameStringUniquely identified name for lineage.
HostStringHost that MongoDB is running on.

Required

(defaults to localhost on UI)

PortNumberPort that MongoDB is listening to.

Optional

(default 27017)

DatabaseStringMongoDB database name.Required
CollectionStringName of the database collection to write to.Required
UsernameStringUser identity for connecting to the specified database.
PasswordPasswordPassword to use to connect to the specified database.
Connection ArgumentsKeyvalue

A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments.


...

User Facing NameTypeDescriptionConstraints
LabelStringLabel for UI.
Reference NameStringUniquely identified name for lineage.
HostStringHost that MongoDB is running on.

Required

(defaults to localhost on UI)

PortNumberPort that MongoDB is listening to.

Optional

(default 27017)

DatabaseStringMongoDB database name.Required
CollectionStringName of the database collection to write to.Required
Output SchemaSchemaSpecifies the schema of the documents.Required
Input QueryStringOptionally filter the input collection with a query. This query must be represented in JSON format and use the MongoDB extended JSON format to represent non-native JSON data types.
Input FieldsStringProjection document that can limit the fields that appear in each document. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types. If no projection document is provided, all fields will be read.
Splitter Class

The name of the Splitter class to use. If left empty, the MongoDB Hadoop Connector will attempt to make a best-guess as to which Splitter to use.

The Hadoop connector provides these Splitters:

  • com.mongodb.hadoop.splitter.StandaloneMongoSplitter
  • com.mongodb.hadoop.splitter.ShardMongoSplitter
  • com.mongodb.hadoop.splitter.ShardChunkMongoSplitter
  • com.mongodb.hadoop.splitter.MultiMongoCollectionSplitter

UsernameStringUser identity for connecting to the specified database.
PasswordPasswordPassword to use to connect to the specified database.
Authentication Connection String
Auxiliary MongoDB connection string to authenticate against when constructing splits.
Connection ArgumentsKeyvalue

A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments.



Data Types Mapping

MongoDB Data TypeCDAP Schema Data TypeSupportComment
Double
Schema.Type.DOUBLE+
StringSchema.Type.STRING+
ObjectSchema.Type.RECORD+
ArraySchema.Type.ARRAY+
Binary dataSchema.Type.BYTES*

Value can be mapped to Schema.Type.BYTES, but this can lead to subtype information loss.

Subtypes:


  • generic: \x00 (0)
  • function: \x01 (1)
  • old: \x02 (2)
  • uuid_old: \x03 (3)
  • uuid: \x04 (4)
  • md5: \x05 (5)
  • user: \x80 (128)


There are several options:

1) Support only 'generic' subtype.

2) Map using MongoDB extended JSON format:

"binary": {"$binary": "YmluYXJ5IGRhdGE=", "$type": "00"}

UndefinedSchema.Type.NULL*

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"undefined": {"$undefined": true}

ObjectId
*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: {"$oid": "5d3f1c2a2f547625b0bbb397"}

BooleanSchema.Type.BOOLEAN+
DateSchema.LogicalType.TIMESTAMP_MILLIS+
NullSchema.Type.UNION+A nullable version of the actual type, corresponds to Schema.nullableOf(actualTypeSchema).
Regular ExpressionSchema.Type.STRING*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: "regex": {"$regex": ".", "$options": ""}

DBPointerSchema.Type.STRING*

String in MongoDB extended JSON format:

"dbpointer": {"$ref": "source", "$id": {"$oid": "5d079ee6d078c94008e4bb3a"}}

JavaScriptSchema.Type.STRING*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: "javascript": {"$code": "var l = 1;"}

SymbolSchema.Type.STRING*

Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.

There are several options:

1) Do not support this data type for the Sink

2) Map using MongoDB extended JSON format: "symbol": {"$symbol": "a"}

JavaScript (with scope)Schema.Type.STRING*

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"javascriptwithscope": {"$code": "var l = 1;", "$scope": {"scope": "scope_val"}

32-bit integerSchema.Type.INT+
Timestamp
*

Special type for internal MongoDB use which is not associated with the regular Date type. Timestamp values are a 64 bit value where:

  • the first 32 bits are a time_t value (seconds since the Unix epoch)
  • the second 32 bits are an incrementing ordinal for operations within a given second.

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"timestamp": {"$timestamp": {"t": 1564410161, "i": 1}}

64-bit integerSchema.Type.LONG+
Decimal128Schema.LogicalType.DECIMAL+
Min key
*

Is less than any other value of any type. This can be useful for always returning certain documents first (or last).

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"minkey": {"$minKey": 1}

Max key
*

Is greater than any other value of any type. This can be useful for always returning certain documents first (or last).

Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:

"maxkey": {"$maxKey": 1}


A
pproach

Move existing mongodb-plugins module to the mongodb-plugins project. Add MongoDB-specific properties to configuration, add support for MongoDB-specific datatypes. Update UI widgets JSON definitions.

...