Introduction
A separate database plugin to support MongoDB-specific features and configurations.
...
The suggestion is to move existing mongodb-plugins module to the mongodb-plugins repository.
MongoDB Overview
Document database
A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.
Code Block |
---|
{
"_id" : ObjectId("5d3f1c2a2f547625b0bbb397"),
"string" : "AAPL",
"int32" : 10,
"double" : 23.23,
"array" : [
"a1",
"a2"
],
"object" : {
"inner_field" : "val"
},
"binary" : { "$binary" : "YmluYXJ5IGRhdGE=", "$type" : "00" },
"undefined" : undefined,
"boolean" : false,
"date" : ISODate("2019-07-29T16:17:46.109Z"),
"null" : null,
"regex" : /./,
"dbpointer" : DBRef("source", "5d079ee6d078c94008e4bb3a"),
"javascript" : var l = 1;,
"javascriptwithscope" : { "$code" : var l = 1; , "$scope" : { "scope" : "scope_val" } },
"symbol" : "a",
"timestamp" : Timestamp(1564417066, 1),
"long" : NumberLong(9223372036854775807),
"decimal" : NumberDecimal("3.100000"),
"minkey" : { "$minKey" : 1 },
"maxkey" : { "$maxKey" : 1 }
} |
BSON
BSON is a binary serialization format used to store documents and make remote procedure calls in MongoDB. The BSON specification is located at bsonspec.org
Document limitations
- The maximum BSON document size is 16 megabytes.
- In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.
Flexible schema
Unlike SQL databases, where you must determine and declare a table’s schema before inserting data, MongoDB’s collections, by default, does not require its documents to have the same schema.
- The documents in a single collection do not need to have the same set of fields and the data type for a field can differ across documents within a collection.
- To change the structure of the documents in a collection, such as add new fields, remove existing fields, or change the field values to a new type, update the documents to the new structure.
Query filter documents
A query filter document and query operators can be used to specify conditions.
The following example uses '{ status: { $in: [ "A", "D" ] } }' query filter document to retrieve all documents from the 'inventory' collection where 'status' equals either "A" or "D":
Code Block |
---|
db.inventory.find( { status: { $in: [ "A", "D" ] } } ) |
The operation corresponds to the following SQL statement:
Code Block |
---|
SELECT * FROM inventory WHERE status in ("A", "D") |
Sink Properties
User Facing Name | Widget Type | Description | Constraints |
---|---|---|---|
Label |
textbox | Label for UI. | |
Reference Name |
textbox | Uniquely identified name for lineage. | |
Host |
textbox | Host that MongoDB is running on. | Required (defaults to localhost on UI) |
Port |
number | Port that MongoDB is listening to. | Optional (default 27017) |
Database |
textbox | MongoDB database name. | Required |
Collection |
textbox | Name of the database collection to write to. | Required |
ID Field | textbox | Allows the user to specify which of the incoming fields should be used as an object identifier. | Optional. Object ID will be generated if no value is specified. |
Username | textbox | User identity for connecting to the specified database. | |
Password |
password | Password to use to connect to the specified database. | |
Connection Arguments |
keyvalue | A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments. |
Sink Data Types Mapping
To support all data types in the Sink we can use MongoDB extended JSON format or/and infer a data type of record field based on its name.
The table below does not honor non-standard MongoDB data types and lists how CDAP data types are stored.
The following MongoDB data types are missing: Undefined, Regular Expression, DBPointer, JavaScript, Symbol, JavaScript (with scope), Timestamp, Min key, Max key.
CDAP Schema Data Type | MongoDB Data Types | Comment | ||
---|---|---|---|---|
boolean | Boolean | |||
bytes | Binary data, ObjectId(if 'ID Field' specified) | |||
date | Date | |||
double | Double | |||
decimal | Decimal128 | The Decimal128 type supports up to 34 digits of precision. | ||
float | Double | |||
int | 32-bit integer | |||
long | 64-bit integer | |||
string | String, ObjectId(if 'ID Field' specified) | |||
time | String | |||
timestamp | Date | |||
array | Array | |||
record | Object | |||
enum | String | |||
map | Object | |||
union | Depends on the actual value. For example, if it's a union:
and the value is actually a long, the mongo document will have the field as a 64-bit integer. If a different record comes in with the value as a string, the mongo document will end up with a String for that field. |
Source Properties
User Facing Name | Widget Type | Description | Constraints |
---|---|---|---|
Label |
textbox | Label for UI. | |
Reference Name |
textbox | Uniquely identified name for lineage. | |
Host |
textbox | Host that MongoDB is running on. | Required (defaults to localhost on UI) |
Port |
number | Port that MongoDB is listening to. | Optional (default 27017) |
Database |
textbox | MongoDB database name. | Required |
Collection |
textbox | Name of the database collection to write to. | Required |
Output Schema |
schema | Specifies the schema of the documents. | Required | |
On Record Error | radio-group | Specifies how to handle error in record processing. An error will be thrown if failed to parse value according to a provided schema. | Possible values are:
Default: 'Fail pipeline' |
Input Query |
json-editor | Optionally filter the input collection with a query. This query must be represented in JSON format and use the MongoDB extended JSON format to represent non-native JSON data types |
. |
The Hadoop connector provides these Splitters:
com.mongodb.hadoop.splitter.StandaloneMongoSplitter
com.mongodb.hadoop.splitter.ShardMongoSplitter
com.mongodb.hadoop.splitter.ShardChunkMongoSplitter
com.mongodb.hadoop.splitter.MultiMongoCollectionSplitter
Username | textbox | User identity for connecting to the specified database. | |
Password |
password | Password to use to connect to the specified database. | ||
Authentication Connection String | textbox | Auxiliary MongoDB connection string to authenticate against when constructing splits. | |
Connection Arguments |
keyvalue | A list of arbitrary string key/value pairs as connection arguments. See Connection String Options for a full description of these arguments. |
Source Data Types Mapping
The source requires Output Schema to be set. Based on the schema source will expect a field in each document to be of a specific Mongo data type.
...
On Record Error error handling property allows the user to decide whether the pipeline should fail, the record should be skipped, or the record should be sent to the error dataset.
The following table shows what MongoDB data types can be read as CDAP types.
CDAP Schema Data Type | MongoDB Data Types |
---|
boolean |
Is less than any other value of any type. This can be useful for always returning certain documents first (or last).
Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:
"minkey": {"$minKey": 1}
Is greater than any other value of any type. This can be useful for always returning certain documents first (or last).
Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:
"maxkey": {"$maxKey": 1}Boolean |
Value can be mapped to Schema.Type.BYTES, but this can lead to subtype information loss.
- generic: \x00 (0)
- function: \x01 (1)
- old: \x02 (2)
- uuid_old: \x03 (3)
- uuid: \x04 (4)
- md5: \x05 (5)
- user: \x80 (128)
There are several options:
1) Support only 'generic' subtype.
2) Map using MongoDB extended JSON format:
"binary": {"$binary": "YmluYXJ5IGRhdGE=", "$type": "00"}
Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:
"undefined": {"$undefined": true}
Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.
There are several options:
1) Do not support this data type for the Sink
2) Map using MongoDB extended JSON format: {"$oid": "5d3f1c2a2f547625b0bbb397"}
Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.
There are several options:
1) Do not support this data type for the Sink
2) Map using MongoDB extended JSON format: "regex": {"$regex": ".", "$options": ""}
String in MongoDB extended JSON format:
"dbpointer": {"$ref": "source", "$id": {"$oid": "5d079ee6d078c94008e4bb3a"}}
Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.
There are several options:
1) Do not support this data type for the Sink
2) Map using MongoDB extended JSON format: "javascript": {"$code": "var l = 1;"}
Value can be mapped to Schema.Type.STRING, but this will lead to type information loss.
There are several options:
1) Do not support this data type for the Sink
2) Map using MongoDB extended JSON format: "symbol": {"$symbol": "a"}Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:
"javascriptwithscope": {"$code": "var l = 1;", "$scope": {"scope": "scope_val"}
Special type for internal MongoDB use which is not associated with the regular Date type. Timestamp values are a 64 bit value where:
- the first 32 bits are a time_t value (seconds since the Unix epoch)
- the second 32 bits are an incrementing ordinal for operations within a given second.
Can be mapped to Schema.Type.STRING using MongoDB extended JSON format:
"timestamp": {"$timestamp": {"t": 1564410161, "i": 1}}
bytes | Binary data, ObjectId | ||||||||
date | - | ||||||||
double | Double | ||||||||
decimal | Decimal128 | ||||||||
float | - | ||||||||
int | 32-bit integer | ||||||||
long | 64-bit integer | ||||||||
string | String, Symbol | ||||||||
time | - | ||||||||
timestamp | Date | ||||||||
array | Array | ||||||||
record | Object The following schema:
is used for 'object' field:
* We can map all non-standard data types to record, like JavaScript (with scope) in the example below. The following schema:
is used for 'javascriptwithscope' field:
| ||||||||
enum | - | ||||||||
map | Object The following schema:
is used for 'object' field:
| ||||||||
union | - |
Approach
Move existing mongodb-plugins module to the mongodb-plugins project. Add MongoDB-specific properties to configuration, add support for MongoDB-specific datatypes. Update UI widgets JSON definitions.
...