Table of Contents

style	circle

Neo4j database contains information about Persons and Movies and relation between them.
For getting information what movies related with person 'Meg Ryan' can be used nex next CQL query:

Code Block
MATCH (person:Person {name: "Meg Ryan"})-[rel]-(movie) RETURN person, rel, movie

Result of this query will be next:

Graf view	Text view
Image Modified	"person"	"rel"	"movie"
	{"name":"Meg Ryan","born":1961}	{"roles":["DeDe","Angelica Graynamore","Patricia Graynamore"]}	{"title":"Joe Versus the Volcano", "tagline":"A story of love, lava andburning desire.", "released":1990}
	{"name":"Meg Ryan","born":1961}	{"roles":["Sally Albright"]}	{"title":"When Harry Met Sally", "tagline":"Can two friends sleep toget her and still love each other in the morning?", "released":1998}
	{"name":"Meg Ryan","born":1961}	{"roles":["Kathleen Kelly"]}	{"title":"You've Got Mail", "tagline":"At odds in life... in love on-line.", "released":1998}
	{"name":"Meg Ryan","born":1961}	{"roles":["Carole"]}	{"title":"Top Gun", "tagline":"I feel the need, the need for speed.", "released":1986}
	{"name":"Meg Ryan","born":1961}	{"roles":["Annie Reed"]}	{"title":"Sleepless in Seattle", "tagline":"What if someone you never met, someone you never saw, someone you never knew was the only someone for you?", "released":1993}

Source Splitter

The proposal is to add "Splits Number" Source configuration property, which allows specifying the desired number of splits to divide the query into when reading from Neo4j.
Fewer splits may be created if the query cannot be divided into the desired number of splits.
Also, we can use '0' as the default value for this configuration property and determine the number of splits according to the number of map tasks (controlled by the "mapreduce.job.maps" property):

Code Block
public List<InputSplit> getSplits(JobContext job) throws IOException { ... int targetNumTasks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1); ...

'MATCH ... RETURN COUNT(*)' CQL query can be used in order to get a total number of documents, that will be divided between splits using 'SKIP' and 'LIMIT'

Source Properties

...

The query to use to import data from the Neo4j database.
Query example: 'MATCH (n:Label) RETURN n.property_1, n.property_2'.

...

Field Name which will be used for ordering during splits generation. This is required unless numSplits is set to one.

Source Data Types Mapping

Neo4j Data TypesCDAP Schema Data Types
nullnullListarrayMaprecordBooleanbooleanIntegerlongFloatdoubleStringstringByteArraybytesDatedateTimetime-microsLocalTimetime-microsDateTimetimestamp-microsLocalDateTimetimestamp-microsNode

record

Schema example:

Code Block

{"name": "n", "type": {
	"type": "record", "name": "n", "fields": [
		{"name": "born", "type": "long"}, 
		{"name": "name", "type": "string"}, 
		{"name": "_id", "type": "long"}, 
		{"name": "_labels", "type": {"type": "array", "items": "string"}}
	]
}}

Relationship

record

Schema example:

Code Block{"name": "r", "type": { "type": "record", "name": "r", "fields": [ {"name": "_startId", "type": "long"}, {"name": "roles", "type": {"type": "array", "items": "string"}}, {"name": "_type

Other case using CQL for getting data from Neo4j:

Code Block
MATCH (person:Person {name: "Meg Ryan"})-[rel]-(movie) RETURN person.name AS name, rel.roles AS roles, movie.title AS title

Result of this query will be next:

Text view
name	roles	title
"Meg Ryan"	["DeDe", "Angelica Graynamore", "Patricia Graynamore"]	"Joe Versus the Volcano"
"Meg Ryan"	["Sally Albright"]	"When Harry Met Sally"
"Meg Ryan"	["Kathleen Kelly"]	"You've Got Mail"
"Meg Ryan"	["Carole"]	"Top Gun"
"Meg Ryan"	["Annie Reed"]	"Sleepless in Seattle"

Source Splitter

The proposal is to add "Splits Number" Source configuration property, which allows specifying the desired number of splits to divide the query into when reading from Neo4j.
Fewer splits may be created if the query cannot be divided into the desired number of splits.
Also, we can use '0' as the default value for this configuration property and determine the number of splits according to the number of map tasks (controlled by the "mapreduce.job.maps" property):

Code Block
public List<InputSplit> getSplits(JobContext job) throws IOException { ... int targetNumTasks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1); ...

'MATCH ... RETURN COUNT(*)' CQL query can be used in order to get a total number of documents, that will be divided between splits using 'SKIP' and 'LIMIT'
Example:

Input query
Code Block
MATCH (person:Person) RETURN person
Order By
Code Block
person.born

In this case each split will be run next query

Code Block
MATCH (person:Person) RETURN person ORDER BY person.born SKIP x LIMIT y

where 'x' and 'y' determined for each split based on 'Splits Number' and total counts of records.

Source Properties

Section	User Facing Name	Widget Type	Description	Constraints
General	Label	textbox	Label for UI.
	Reference Name	textbox	Uniquely identified name for lineage.	Required
	Neo4j Host	textbox	Neo4j database host.	Required
	Neo4j Port	number	Neo4j database port.	Required
	Input Query	textbox	The query to use to import data from the Neo4j database. Query example: 'MATCH (n:Label) RETURN n.property_1, n.property_2'.	Required
Credentials	Username	textbox	User identity for connecting to the Neo4j.	Required
	Password	password	Password to use to connect to the Neo4j.	Required
Advanced	Splits Number	number	The number of splits to generate. If set to one, the orderBy is not needed.
	Order By	textbox	Field Name which will be used for ordering during splits generation. This is required unless numSplits is set to one.

Source Data Types Mapping

SectionUser Facing NameWidget TypeDescriptionConstraintsGeneralLabeltextboxLabel for UI.Reference NametextboxUniquely identified name for lineage.RequiredNeo4j HosttextboxNeo4j database host.RequiredNeo4j PortnumberNeo4j database port.RequiredOutput QuerytextboxThe query to use to export data to the Neo4j database.
Query example: 'CREATE (n:<label_field>l {property_1, property_2})'.RequiredCredentialsUsernametextboxUser identity for connecting to the Neo4j.RequiredPasswordpasswordPassword to use to connect to the Neo4j.Required

Neo4j Data Types

CDAP Schema Data Types

null

List

array

Map

record

Boolean

boolean

Integer

long

Float

double

String

string

ByteArray

bytes

Date

date

Time

time-micros

LocalTime

time-micros

DateTime

timestamp-micros

LocalDateTime

timestamp-micros

Node

https://neo4j.com/docs/cypher-manual/3.5/syntax/values/#structural-types

record

Schema example:

Code Block

{"name": "n", "type": {
	"type": "record", "name": "n", "fields": [
		{"name": "born", "type": "long"}, 
		{"name": "name", "type": "string"}, 
       		{"name": "_endIdid", "type": "long"}, 
       		{"name": "_idlabels", "type": {"longtype"}:     "array", "items": "string"}}
	]
}}

Duration

A Duration represents a temporal amount, capturing the difference in time between two instants, and can be negative.Relationship

https://neo4j.com/docs/cypher-manual/3.5/syntax/values/#structural-types

record

Schema example:

Code Block

{"name": "drr", "type": {
    "type": "record", "name": "drr", "fields": [
        {"name": "duration_startId", "type": "stringlong"},
        {"name": "secondsroles", "type": {"type": "array", "items": "longstring"}},
        {"name": "_type", "type":      {"name": "months", "type": "long"},
        {"name": "days", "type": "long"},
        {"name": "nanoseconds", "type": "int"}
    ]
}}

Point

record

Schema example:

Code Block

{"name": "p", "type": {
    "type": "record", "name": "p", "fields": [
        {"name": "crs", "type": "string"},
        {"name": "x", "type": "double"},
        {"name": "y", "type": "double"},
        {"name": "srid", "type": "string"}
    ]
}}

Path

Sink Properties

"string"},
        {"name": "_endId", "type": "long"},
        {"name": "_id", "type": "long"}
    ]
}}

Duration

A Duration represents a temporal amount, capturing the difference in time between two instants, and can be negative.
https://neo4j.com/docs/cypher-manual/3.5/syntax/temporal/#cypher-temporal-durations

record

Schema example:

Code Block

{"name": "dr", "type": {
    "type": "record", "name": "dr", "fields": [
        {"name": "duration", "type": "string"},
        {"name": "seconds", "type": "long"},
        {"name": "months", "type": "long"},
        {"name": "days", "type": "long"},
        {"name": "nanoseconds", "type": "int"}
    ]
}}

Point

https://neo4j.com/docs/cypher-manual/3.5/syntax/spatial/

record

Schema example:

Code Block

{"name": "p", "type": {
    "type": "record", "name": "p", "fields": [
        {"name": "crs", "type": "string"},
        {"name": "x", "type": "double"},
        {"name": "y", "type": "double"},
        {"name": "srid", "type": "string"}
    ]
}}

Path

https://neo4j.com/docs/cypher-manual/3.5/syntax/values/#structural-types

Sink Properties

Section	User Facing Name	Widget Type	Description	Constraints
General	Label	textbox	Label for UI.
	Reference Name	textbox	Uniquely identified name for lineage.	Required
	Neo4j Host	textbox	Neo4j database host.	Required
	Neo4j Port	number	Neo4j database port.	Required
	Output Query	textbox	The query to use to export data to the Neo4j database. Query example: 'CREATE (n:<label_field> $(property_1, property_2))' or 'CREATE (n:<label_field> $(*))'	Required
Credentials	Username	textbox	User identity for connecting to the Neo4j.	Required
	Password	password	Password to use to connect to the Neo4j.	Required

Output query additionl information

Output query is based on CQL syntax, but using CQL query with CDAP has several problem:

neo4j-jdbc-driver can process property values only if it primitive types or arrays thereof.
difficult to relate the output data to CQL query.

To solve these problems, the following solution was proposed:
Using next structure $(...) for identify place where properties will be inserted.
Example of using $(...):
List of output fields: ["name", "age", "profesion", "company", "rating", "position"]

Output query	Expected results
CREATE (n:Node $(*))	Will be created node with label *Node* and properties ["name", "age", "profesion", "company", "rating", "position"]
CREATE (p:Person $(name, age, profesion)), (c:Company $(company, rating))	Will be created node with label Person and properties ["name", "age", "profesion"] Will be created node with label Companyand properties ["company", "rating"]
CREATE (p:Person $(name, profesion))-[r:WorkOn $(position)]->(c:Company $(company))	Will be created node with label Person and properties ["name", "profesion"] Will be created relation with type *WorkOn* and properties ["position"] Will be created node with label Companyand properties ["company"]

Sink Data Types Mapping

CDAP Schema Data Types	Neo4j Data Types
null	null
array	List
boolean	Boolean
long	Integer
double	Float
string	String
bytes	ByteArray
date	Date
time-micros	Time
timestamp-micros	DateTime
	Duration
	Point

...

Versions Compared

Old Version 7

New Version 8

Key

Source Splitter

Source Properties

Source Data Types Mapping

Source Splitter

Source Properties

Source Data Types Mapping

Sink Properties

Sink Properties

Output query additionl information

Sink Data Types Mapping

Page Comparison

Versions Compared

Old Version 7

New Version 8

Key

Source Splitter

Source Properties

Source Data Types Mapping

Source Splitter

Source Properties

Source Data Types Mapping

Sink Properties

Sink Properties

Output query additionl information

Sink Data Types Mapping