Introduction
A separate database plugin to support MongoDB-specific features and configurations.
Use-Case
- Users can choose and install MongoDB source and sink plugins.
- Users should see MongoDB logo on plugin configuration page for better experience.
- Users should get relevant information from the tool tip:
- The tool tip should describe accurately what each field is used for.
- Users should not have to specify any redundant configuration
- Users should get field level lineage for the source and sink that is being used.
- Reference documentation should be updated to account for the changes.
- The source code for MongoDB database plugin should be placed in repo under data-integrations org.
- The data pipeline using source and sink plugins should run on both mapreduce and spark engines.
User Stories
- User should be able to install MongoDB specific database source and sink plugins from the Hub
- Users should have each tool tip accurately describe what each field does
- Users should get field level lineage information for the MongoDB source and sink
- Users should be able to setup a pipeline avoiding specifying redundant information
- Users should get updated reference document for MongoDB source and sink
- Users should be able to read all the DB types
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design Tips
MongoDB driver reference: http://mongodb.github.io/mongo-java-driver/3.10/driver/
Design
The suggestion is to move existing mongodb-plugins module to the mongodb-plugins repository.
MongoDB Overview
Document database
A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.
{ "_id" : ObjectId("5d3f1c2a2f547625b0bbb397"), "string" : "AAPL", "int32" : 10, "double" : 23.23, "array" : [ "a1", "a2" ], "object" : { "inner_field" : "val" }, "binary" : { "$binary" : "YmluYXJ5IGRhdGE=", "$type" : "00" }, "undefined" : undefined, "boolean" : false, "date" : ISODate("2019-07-29T16:17:46.109Z"), "null" : null, "regex" : /./, "dbpointer" : DBRef("source", "5d079ee6d078c94008e4bb3a"), "javascript" : var l = 1;, "javascriptwithscope" : { "$code" : var l = 1; , "$scope" : { "scope" : "scope_val" } }, "symbol" : "a", "timestamp" : Timestamp(1564417066, 1), "long" : NumberLong(9223372036854775807), "decimal" : NumberDecimal("3.100000"), "minkey" : { "$minKey" : 1 }, "maxkey" : { "$maxKey" : 1 } }
Document limitations
- The maximum BSON document size is 16 megabytes.
- In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.
Flexible schema
Unlike SQL databases, where you must determine and declare a table’s schema before inserting data, MongoDB’s collections, by default, does not require its documents to have the same schema.
- The documents in a single collection do not need to have the same set of fields and the data type for a field can differ across documents within a collection.
- To change the structure of the documents in a collection, such as add new fields, remove existing fields, or change the field values to a new type, update the documents to the new structure.
Query filter documents
A query filter document and query operators can be used to specify conditions.
The following example uses '{ status: { $in: [ "A", "D" ] } }' query filter document to retrieve all documents from the 'inventory' collection where 'status' equals either "A" or "D":
db.inventory.find( { status: { $in: [ "A", "D" ] } } )
The operation corresponds to the following SQL statement:
SELECT * FROM inventory WHERE status in ("A", "D")
Sink Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Label | String | Label for UI. | |
Reference Name | String | Uniquely identified name for lineage. | |
Host | String | Host that MongoDB is running on. | Required (defaults to localhost on UI) |
Port | Number | Port that MongoDB is listening to. | Optional (default 27017) |
Database | String | MongoDB database name. | Required |
Collection | String | Name of the database collection to write to. | Required |
Username | String | User identity for connecting to the specified database. | |
Password | Password | Password to use to connect to the specified database. | |
Connection Arguments | Keyvalue | A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments. |
Source Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Label | String | Label for UI. | |
Reference Name | String | Uniquely identified name for lineage. | |
Host | String | Host that MongoDB is running on. | Required (defaults to localhost on UI) |
Port | Number | Port that MongoDB is listening to. | Optional (default 27017) |
Database | String | MongoDB database name. | Required |
Collection | String | Name of the database collection to write to. | Required |
Output Schema | Schema | Specifies the schema of the documents. | Required |
Input Query | String | Optionally filter the input collection with a query. This query must be represented in JSON format and use the MongoDB extended JSON format to represent non-native JSON data types. | |
Input Fields | String | Projection document that can limit the fields that appear in each document. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types. If no projection document is provided, all fields will be read. | |
Splitter Class | The name of the Splitter class to use. If left empty, the MongoDB Hadoop Connector will attempt to make a best-guess as to which Splitter to use. The Hadoop connector provides these Splitters:
| ||
Username | String | User identity for connecting to the specified database. | |
Password | Password | Password to use to connect to the specified database. | |
Authentication Connection String | Auxiliary MongoDB connection string to authenticate against when constructing splits. | ||
Connection Arguments | Keyvalue | A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments. |
Data Types Mapping
MongoDB Data Type | CDAP Schema Data Type | Support | Comment |
---|---|---|---|
Double | Schema.Type.DOUBLE | + | |
String | Schema.Type.STRING | + | |
Object | Schema.Type.RECORD | + | |
Array | Schema.Type.ARRAY | + | |
Binary data | Schema.Type.BYTES | * | Value can be mapped to Schema.Type.BYTES, but this can lead to subtype information loss.
There are several options: 1) Support only 'generic' subtype. 2) Map using MongoDB extended JSON format: "binary": {"$binary": "YmluYXJ5IGRhdGE=", "$type": "00"} |
Undefined | Schema.Type.NULL | * | Can be mapped to Schema.Type.STRING using MongoDB extended JSON format: "undefined": {"$undefined": true} |
ObjectId | * | Value can be mapped to Schema.Type.STRING, but this will lead to type information loss. There are several options: 1) Do not support this data type for the Sink 2) Map using MongoDB extended JSON format: {"$oid": "5d3f1c2a2f547625b0bbb397"} | |
Boolean | Schema.Type.BOOLEAN | + | |
Date | Schema.LogicalType.TIMESTAMP_MILLIS | + | |
Null | Schema.Type.UNION | + | A nullable version of the actual type, corresponds to Schema.nullableOf(actualTypeSchema). |
Regular Expression | Schema.Type.STRING | * | Value can be mapped to Schema.Type.STRING, but this will lead to type information loss. There are several options: 1) Do not support this data type for the Sink 2) Map using MongoDB extended JSON format: "regex": {"$regex": ".", "$options": ""} |
DBPointer | Schema.Type.STRING | * | String in MongoDB extended JSON format: "dbpointer": {"$ref": "source", "$id": {"$oid": "5d079ee6d078c94008e4bb3a"}} |
JavaScript | Schema.Type.STRING | * | Value can be mapped to Schema.Type.STRING, but this will lead to type information loss. There are several options: 1) Do not support this data type for the Sink 2) Map using MongoDB extended JSON format: "javascript": {"$code": "var l = 1;"} |
Symbol | Schema.Type.STRING | * | Value can be mapped to Schema.Type.STRING, but this will lead to type information loss. There are several options: 1) Do not support this data type for the Sink 2) Map using MongoDB extended JSON format: "symbol": {"$symbol": "a"} |
JavaScript (with scope) | Schema.Type.STRING | * | Can be mapped to Schema.Type.STRING using MongoDB extended JSON format: "javascriptwithscope": {"$code": "var l = 1;", "$scope": {"scope": "scope_val"} |
32-bit integer | Schema.Type.INT | + | |
Timestamp | * | Special type for internal MongoDB use which is not associated with the regular Date type. Timestamp values are a 64 bit value where:
Can be mapped to Schema.Type.STRING using MongoDB extended JSON format: "timestamp": {"$timestamp": {"t": 1564410161, "i": 1}} | |
64-bit integer | Schema.Type.LONG | + | |
Decimal128 | Schema.LogicalType.DECIMAL | + | |
Min key | * | Is less than any other value of any type. This can be useful for always returning certain documents first (or last). Can be mapped to Schema.Type.STRING using MongoDB extended JSON format: "minkey": {"$minKey": 1} | |
Max key | * | Is greater than any other value of any type. This can be useful for always returning certain documents first (or last). Can be mapped to Schema.Type.STRING using MongoDB extended JSON format: "maxkey": {"$maxKey": 1} |
Approach
Move existing mongodb-plugins module to the mongodb-plugins project. Add MongoDB-specific properties to configuration, add support for MongoDB-specific datatypes. Update UI widgets JSON definitions.