Introduction
A separate database plugin to support Neo4j-specific features and configurations.
Use-case
- Users can choose and install Neo4j source and sink plugins.
- Users should see Neo4j logo on plugin configuration page for better experience.
- Users should get relevant information from the tool tip:
- The tool tip should describe accurately what each field is used for.
- Users should not have to specify any redundant configuration.
- Users should get field level lineage for the source and sink that is being used.
- Reference documentation should be updated to account for the changes.
- The source code for Neo4j database plugin should be placed in repo under data-integrations.org.
- The data pipeline using source and sink plugins should run on both mapreduce and spark engines.
User Storie
- User should be able to install Neo4j specific database source and sink plugins from the Hub.
- Users should have each tool tip accurately describe what each field does.
- Users should get field level lineage information for the Neo4j source and sink.
- Users should be able to setup a pipeline avoiding specifying redundant information.
- Users should get updated reference document for Neo4j source and sink.
- Users should be able to read all the DB types.
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design Tips
- Reference to the Neo4j driver manual: https://neo4j.com/docs/driver-manual/1.7/
- Reference to the Neo4j jdbc driver (it is Community-Contributed Drivers): https://neo4j.com/developer/java-third-party/
- Reference to Cypher Query Language manual: https://neo4j.com/docs/cypher-manual/current/
Design
Neo4j Overview
Neo4j is a graph database management system with native graph storage and processing. In Neo4j, everything is stored in the form of an edge, node, or attribute. Each node and edge can have any number of attributes. Both nodes and edges can be labelled. Labels can be used to narrow searches.
Cypher Query Language
Cypher is a declarative graph query language that allows for expressive and efficient querying and updating of the graph.
Cypher is inspired by a number of different approaches and builds on established practices for expressive querying. Many of the keywords, such as WHERE
and ORDER BY
, are inspired by SQL. Pattern matching borrows expression approaches from SPARQL. Some of the list semantics are borrowed from languages such as Haskell and Python.
Here are a few clauses used to read from the graph:
MATCH
: The graph pattern to match. This is the most common way to get data from the graph.WHERE
: Not a clause in its own right, but rather part ofMATCH
,OPTIONAL MATCH
andWITH
. Adds constraints to a pattern, or filters the intermediate result passing throughWITH
.RETURN
: What to return.
Here’s an example of simple Cypher Query:
MATCH (n) |
---|
Source Properties
Section | User Facing Name | Widget Type | Description | Constraints |
---|---|---|---|---|
General | Label | textbox | Label for UI. | |
Reference Name | textbox | Uniquely identified name for lineage. | Required | |
Neo4j Host | textbox | Neo4j database host. | Required | |
Neo4j Port | number | Neo4j database port. | Required | |
Input Query | textbox | The query to use to import data from the Neo4j database. | Required | |
Credentials | Username | textbox | User identity for connecting to the Neo4j. | Required |
Password | password | Password to use to connect to the Neo4j. | Required | |
Advanced | Splits Number | number | The number of splits to generate. If set to one, the orderBy is not needed. | |
Order By | textbox | Field Name which will be used for ordering during splits generation. This is required unless numSplits is set to one. |
Source Data Types Mapping
Neo4j Data Types | CDAP Schema Data Types |
---|---|
null | null |
List | array |
Map | record |
Boolean | boolean |
Integer | long |
Float | double |
String | string |
ByteArray | bytes |
Date | date |
Time | time-micros |
LocalTime | time-micros |
DateTime | timestamp-micros |
LocalDateTime | timestamp-micros |
Node | record Schema example: {"name": "n", "type": { "type": "record", "name": "n", "fields": [ {"name": "born", "type": "long"}, {"name": "name", "type": "string"}, {"name": "_id", "type": "long"}, {"name": "_labels", "type": {"type": "array", "items": "string"}} ] }} |
Relationship | record Schema example: {"name": "r", "type": { "type": "record", "name": "r", "fields": [ {"name": "_startId", "type": "long"}, {"name": "roles", "type": {"type": "array", "items": "string"}}, {"name": "_type", "type": "string"}, {"name": "_endId", "type": "long"}, {"name": "_id", "type": "long"} ] }} |
Duration A Duration represents a temporal amount, capturing the difference in time between two instants, and can be negative. | record Schema example: {"name": "dr", "type": { "type": "record", "name": "dr", "fields": [ {"name": "duration", "type": "string"}, {"name": "seconds", "type": "long"}, {"name": "months", "type": "long"}, {"name": "days", "type": "long"}, {"name": "nanoseconds", "type": "string"} ] }} |
Point | record Schema example: {"name": "p", "type": { "type": "record", "name": "p", "fields": [ {"name": "crs", "type": "string"}, {"name": "x", "type": "double"}, {"name": "y", "type": "double"}, {"name": "srid", "type": "string"} ] }} |
Path |
Sink Properties
Section | User Facing Name | Widget Type | Description | Constraints |
---|---|---|---|---|
General | Label | textbox | Label for UI. | |
Reference Name | textbox | Uniquely identified name for lineage. | Required | |
Neo4j Host | textbox | Neo4j database host. | Required | |
Neo4j Port | number | Neo4j database port. | Required | |
Output Query | textbox | The query to use to export data to the Neo4j database. Query example: 'CREATE (n:<label_field>l {property_1, property_2})'. | Required | |
Credentials | Username | textbox | User identity for connecting to the Neo4j. | Required |
Password | password | Password to use to connect to the Neo4j. | Required |
Sink Data Types Mapping
CDAP Schema Data Types | Neo4j Data Types |
---|---|
null | null |
array | List |
boolean | Boolean |
long | Integer |
double | Float |
string | String |
bytes | ByteArray |
date | Date |
time-micros | Time |
timestamp-micros | DateTime |
Duration | |
Point |
Approach
Create a new maven project in it's own repository.
Pipeline Samples
Please attach one or more sample pipeline(s) and associated data.
Releases
Release X.Y.Z