Table of Contents | ||
---|---|---|
|
...
Neo4j database contains information about Persons and Movies and relation between them.
For getting information what movies related with person 'Meg Ryan' can be used nex next CQL query:
Code Block |
---|
MATCH (person:Person {name: "Meg Ryan"})-[rel]-(movie) RETURN person, rel, movie |
Result of this query will be next:
Graf view | Text view | ||
---|---|---|---|
"person" | "rel" | "movie" | |
{"name":"Meg Ryan","born":1961} | {"roles":["DeDe","Angelica Graynamore","Patricia Graynamore"]} | {"title":"Joe Versus the Volcano", | |
{"name":"Meg Ryan","born":1961} | {"roles":["Sally Albright"]} | {"title":"When Harry Met Sally", | |
{"name":"Meg Ryan","born":1961} | {"roles":["Kathleen Kelly"]} | {"title":"You've Got Mail", | |
{"name":"Meg Ryan","born":1961} | {"roles":["Carole"]} | {"title":"Top Gun", | |
{"name":"Meg Ryan","born":1961} | {"roles":["Annie Reed"]} | {"title":"Sleepless in Seattle", |
Source Splitter
The proposal is to add "Splits Number" Source configuration property, which allows specifying the desired number of splits to divide the query into when reading from Neo4j.
Fewer splits may be created if the query cannot be divided into the desired number of splits.
Also, we can use '0' as the default value for this configuration property and determine the number of splits according to the number of map tasks (controlled by the "mapreduce.job.maps" property):
Code Block |
---|
public List<InputSplit> getSplits(JobContext job) throws IOException {
...
int targetNumTasks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1);
... |
'MATCH ... RETURN COUNT(*)' CQL query can be used in order to get a total number of documents, that will be divided between splits using 'SKIP' and 'LIMIT'
Source Properties
...
The query to use to import data from the Neo4j database.
Query example: 'MATCH (n:Label) RETURN n.property_1, n.property_2'.
...
Field Name which will be used for ordering during splits generation. This is required unless numSplits is set to one.
Source Data Types Mapping
record
Schema example:
Code Block |
---|
{"name": "n", "type": {
"type": "record", "name": "n", "fields": [
{"name": "born", "type": "long"},
{"name": "name", "type": "string"},
{"name": "_id", "type": "long"},
{"name": "_labels", "type": {"type": "array", "items": "string"}}
]
}} |
record
Schema example:
Other case using CQL for getting data from Neo4j:
Code Block |
---|
MATCH (person:Person {name: "Meg Ryan"})-[rel]-(movie) RETURN person.name AS name, rel.roles AS roles, movie.title AS title |
Result of this query will be next:
Text view | ||
---|---|---|
name | roles | title |
"Meg Ryan" | ["DeDe", "Angelica Graynamore", "Patricia Graynamore"] | "Joe Versus the Volcano" |
"Meg Ryan" | ["Sally Albright"] | "When Harry Met Sally" |
"Meg Ryan" | ["Kathleen Kelly"] | "You've Got Mail" |
"Meg Ryan" | ["Carole"] | "Top Gun" |
"Meg Ryan" | ["Annie Reed"] | "Sleepless in Seattle" |
Source Splitter
The proposal is to add "Splits Number" Source configuration property, which allows specifying the desired number of splits to divide the query into when reading from Neo4j.
Fewer splits may be created if the query cannot be divided into the desired number of splits.
Also, we can use '0' as the default value for this configuration property and determine the number of splits according to the number of map tasks (controlled by the "mapreduce.job.maps" property):
Code Block |
---|
public List<InputSplit> getSplits(JobContext job) throws IOException {
...
int targetNumTasks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1);
... |
'MATCH ... RETURN COUNT(*)' CQL query can be used in order to get a total number of documents, that will be divided between splits using 'SKIP' and 'LIMIT'
Example:
Input query
Code Block MATCH (person:Person) RETURN person
Order By
Code Block person.born
In this case each split will be run next query
Code Block |
---|
MATCH (person:Person) RETURN person ORDER BY person.born SKIP x LIMIT y |
where 'x' and 'y' determined for each split based on 'Splits Number' and total counts of records.
Source Properties
Section | User Facing Name | Widget Type | Description | Constraints |
---|---|---|---|---|
General | Label | textbox | Label for UI. | |
Reference Name | textbox | Uniquely identified name for lineage. | Required | |
Neo4j Host | textbox | Neo4j database host. | Required | |
Neo4j Port | number | Neo4j database port. | Required | |
Input Query | textbox | The query to use to import data from the Neo4j database. | Required | |
Credentials | Username | textbox | User identity for connecting to the Neo4j. | Required |
Password | password | Password to use to connect to the Neo4j. | Required | |
Advanced | Splits Number | number | The number of splits to generate. If set to one, the orderBy is not needed. | |
Order By | textbox | Field Name which will be used for ordering during splits generation. This is required unless numSplits is set to one. |
Source Data Types Mapping
Query example: 'CREATE (n:<label_field>l {property_1, property_2})'.
Neo4j Data Types | CDAP Schema Data Types | ||
---|---|---|---|
null | null | ||
List | array | ||
Map | record | ||
Boolean | boolean | ||
Integer | long | ||
Float | double | ||
String | string | ||
ByteArray | bytes | ||
Date | date | ||
Time | time-micros | ||
LocalTime | time-micros | ||
DateTime | timestamp-micros | ||
LocalDateTime | timestamp-micros | ||
Node https://neo4j.com/docs/cypher-manual/3.5/syntax/values/#structural-types | record Schema example:
| ||
Duration A Duration represents a temporal amount, capturing the difference in time between two instants, and can be negative.Relationship https://neo4j.com/docs/cypher-manual/3.5/syntax/values/#structural-types | record Schema example:
| ||
Point | record Schema example:
| Path |
Sink Properties
| |||
Duration A Duration represents a temporal amount, capturing the difference in time between two instants, and can be negative. | record Schema example:
| ||
Point | record Schema example:
| ||
Path https://neo4j.com/docs/cypher-manual/3.5/syntax/values/#structural-types |
Sink Properties
Section | User Facing Name | Widget Type | Description | Constraints |
---|---|---|---|---|
General | Label | textbox | Label for UI. | |
Reference Name | textbox | Uniquely identified name for lineage. | Required | |
Neo4j Host | textbox | Neo4j database host. | Required | |
Neo4j Port | number | Neo4j database port. | Required | |
Output Query | textbox | The query to use to export data to the Neo4j database. | Required | |
Credentials | Username | textbox | User identity for connecting to the Neo4j. | Required |
Password | password | Password to use to connect to the Neo4j. | Required |
Output query additionl information
Output query is based on CQL syntax, but using CQL query with CDAP has several problem:
- neo4j-jdbc-driver can process property values only if it primitive types or arrays thereof.
- difficult to relate the output data to CQL query.
To solve these problems, the following solution was proposed:
Using next structure $(...) for identify place where properties will be inserted.
Example of using $(...):
List of output fields: ["name", "age", "profesion", "company", "rating", "position"]
Output query | Expected results |
---|---|
CREATE (n:Node $(*)) | Will be created node with label Node and properties ["name", "age", "profesion", "company", "rating", "position"] |
CREATE (p:Person $(name, age, profesion)), (c:Company $(company, rating)) | Will be created node with label Person and properties ["name", "age", "profesion"] Will be created node with label Companyand properties ["company", "rating"] |
CREATE (p:Person $(name, profesion))-[r:WorkOn $(position)]->(c:Company $(company)) | Will be created node with label Person and properties ["name", "profesion"] Will be created relation with type WorkOn and properties ["position"] Will be created node with label Companyand properties ["company"] |
Sink Data Types Mapping
CDAP Schema Data Types | Neo4j Data Types |
---|---|
null | null |
array | List |
boolean | Boolean |
long | Integer |
double | Float |
string | String |
bytes | ByteArray |
date | Date |
time-micros | Time |
timestamp-micros | DateTime |
Duration | |
Point |
...