Hi,

I wanted to give you a little update. I created a non-parallel
InputFormat which reads Cypher results from Neo4j into Tuples [1].
It can be used like the JDBCInputFormat:

String q = "MATCH (p1:Page)-[:Link]->(p2) RETURN id(p1), id(p2)";

Neo4jInputFormat<Tuple2<Integer, Integer>> neoInput = Neo4jInputFormat.buildNeo4jInputFormat()
      .setRestURI(restURI)
      .setCypherQuery(q)
      .setUsername("neo4j")
      .setPassword("test")
      .setConnectTimeout(1000)
      .setReadTimeout(1000)
      .finish();

Atm, to run the tests, a Neo4j instance needs to be up and running.
I tried to get neo4j-harness [2] into the project, but there are some
dependency conflicts which I need to figure out.

I ran a first benchmark on my Laptop (4 cores, 12GB, SSD) with Neo4j
running on the same machine. My dataset is the polish wiki dump [2]
which consists of 430,602 pages and 2,727,302 links. The protocol is:

1) Read edge ids from cold Neo4j into Tuple2<Integer, Integer>
2) Convert Tuple2 into Tuple3<Integer, Integer, Double> for edge value
3) Create Gelly graph from Tuple3, init vertex values to 1.0
4) Run PageRank with beta=0.85 and 5 iterations

This takes about 22 seconds on my machine which is very promising.

Next steps are:
- OutputFormat
- Better Unit tests (integrate neo4j-harness)
- Bigger graphs :)

Any ideas and suggestions are of course highly appreciated :)

Best,
Martin


[1] https://github.com/s1ck/flink-neo4j
[2] http://neo4j.com/docs/stable/server-unmanaged-extensions-testing.html
[3] plwiktionary-20151002-pages-articles-multistream.xml.bz2


On 29.10.2015 14:51, Vasiliki Kalavri wrote:
Hello everyone,

Martin, Martin, Alex (cc'ed) and myself have started discussing about
implementing a neo4j-Flink connector. I've opened a corresponding JIRA
(FLINK-2941) containing an initial document [1], but we'd also like to
share our ideas here to engage the community and get your feedback.

We've had a skype call today and I will try to summarize some of the key
points here. The main use-cases we see are the following:

- Use Flink for ETL / creating the graph and then insert it to a graph
database, like neo4j, for querying and search.
- Query neo4j on some topic or search the graph for patterns and extract a
subgraph, on which we'd then like to run some iterative graph analysis
task. This is where Flink/Gelly can help, by complementing the querying
(neo4j) with efficient iterative computation.

We all agreed that the main challenge is efficiently getting the data out
of neo4j and into Flink. There have been some attempts to do similar things
with neo4j and Spark, but the performance results are not very promising:

- Mazerunner [2] is using HDFS for communication. We think that's it's not
worth it going towards this direction, as dumping the neo4j database to
HDFS and then reading it back to Flink would probably be terribly slow.
- In [3], you can see Michael Hunger's findings on using neo's HTTP
interface to import data into Spark, run PageRank and then put data back
into neo4j. It seems that this took > 2h for a 125m edge graph. The main
bottlenecks appear to be (1) reading the data as an RDD => this had to be
performed into small batches to avoid OOM errors and (2) PageRank
computation itself, which seems weird to me.

We decided to experiment with neo4j HTTP and Flink and we'll report back
when we have some results.

In the meantime, if you have any ideas on how we could speed up reading
from neo4j or any suggestion on approaches that I haven't mentioned, please
feel free to reply to this e-mail or add your comment in the shared
document.

Cheers,
-Vasia.

[1]:
https://docs.google.com/document/d/13qT_e-y8aTNWQnD43jRBq1074Y1LggPNDsic_Obwc28/edit?usp=sharing
[2]: https://github.com/kbastani/neo4j-mazerunner
[3]: https://gist.github.com/jexp/0dfad34d49a16000e804

Reply via email to