Wow, very nice results :-) This input format alone is probably a very useful contribution, so I would open a contribution there once you manage to get a few tests running.
I know little about neo4j, is there a way to read cypher query results in parallel? (most systems do not expose such an interface, but maybe neo4j is special there). I recall at some point in time Martin Neumann asking about a way to create dense contiguous unique IDs for creating graphs that can be bulk-imported into neo4j. There is code for that in the data set utils, this may be valuable for an output format. On Sat, Oct 31, 2015 at 9:51 AM, Martin Junghanns <m.jungha...@mailbox.org> wrote: > Hi, > > I wanted to give you a little update. I created a non-parallel > InputFormat which reads Cypher results from Neo4j into Tuples [1]. > It can be used like the JDBCInputFormat: > > String q = "MATCH (p1:Page)-[:Link]->(p2) RETURN id(p1), id(p2)"; > > Neo4jInputFormat<Tuple2<Integer, Integer>> neoInput = > Neo4jInputFormat.buildNeo4jInputFormat() > .setRestURI(restURI) > .setCypherQuery(q) > .setUsername("neo4j") > .setPassword("test") > .setConnectTimeout(1000) > .setReadTimeout(1000) > .finish(); > > Atm, to run the tests, a Neo4j instance needs to be up and running. > I tried to get neo4j-harness [2] into the project, but there are some > dependency conflicts which I need to figure out. > > I ran a first benchmark on my Laptop (4 cores, 12GB, SSD) with Neo4j > running on the same machine. My dataset is the polish wiki dump [2] > which consists of 430,602 pages and 2,727,302 links. The protocol is: > > 1) Read edge ids from cold Neo4j into Tuple2<Integer, Integer> > 2) Convert Tuple2 into Tuple3<Integer, Integer, Double> for edge value > 3) Create Gelly graph from Tuple3, init vertex values to 1.0 > 4) Run PageRank with beta=0.85 and 5 iterations > > This takes about 22 seconds on my machine which is very promising. > > Next steps are: > - OutputFormat > - Better Unit tests (integrate neo4j-harness) > - Bigger graphs :) > > Any ideas and suggestions are of course highly appreciated :) > > Best, > Martin > > > [1] https://github.com/s1ck/flink-neo4j > [2] http://neo4j.com/docs/stable/server-unmanaged-extensions-testing.html > [3] plwiktionary-20151002-pages-articles-multistream.xml.bz2 > > > > On 29.10.2015 14:51, Vasiliki Kalavri wrote: > >> Hello everyone, >> >> Martin, Martin, Alex (cc'ed) and myself have started discussing about >> implementing a neo4j-Flink connector. I've opened a corresponding JIRA >> (FLINK-2941) containing an initial document [1], but we'd also like to >> share our ideas here to engage the community and get your feedback. >> >> We've had a skype call today and I will try to summarize some of the key >> points here. The main use-cases we see are the following: >> >> - Use Flink for ETL / creating the graph and then insert it to a graph >> database, like neo4j, for querying and search. >> - Query neo4j on some topic or search the graph for patterns and extract a >> subgraph, on which we'd then like to run some iterative graph analysis >> task. This is where Flink/Gelly can help, by complementing the querying >> (neo4j) with efficient iterative computation. >> >> We all agreed that the main challenge is efficiently getting the data out >> of neo4j and into Flink. There have been some attempts to do similar >> things >> with neo4j and Spark, but the performance results are not very promising: >> >> - Mazerunner [2] is using HDFS for communication. We think that's it's not >> worth it going towards this direction, as dumping the neo4j database to >> HDFS and then reading it back to Flink would probably be terribly slow. >> - In [3], you can see Michael Hunger's findings on using neo's HTTP >> interface to import data into Spark, run PageRank and then put data back >> into neo4j. It seems that this took > 2h for a 125m edge graph. The main >> bottlenecks appear to be (1) reading the data as an RDD => this had to be >> performed into small batches to avoid OOM errors and (2) PageRank >> computation itself, which seems weird to me. >> >> We decided to experiment with neo4j HTTP and Flink and we'll report back >> when we have some results. >> >> In the meantime, if you have any ideas on how we could speed up reading >> from neo4j or any suggestion on approaches that I haven't mentioned, >> please >> feel free to reply to this e-mail or add your comment in the shared >> document. >> >> Cheers, >> -Vasia. >> >> [1]: >> >> https://docs.google.com/document/d/13qT_e-y8aTNWQnD43jRBq1074Y1LggPNDsic_Obwc28/edit?usp=sharing >> [2]: https://github.com/kbastani/neo4j-mazerunner >> [3]: https://gist.github.com/jexp/0dfad34d49a16000e804 >> >>