Re: GraphX partition problem

2014-05-28 Thread Ankur Dave
I've been trying to reproduce this but I haven't succeeded so far. For example, on the web-Google graph, I get the expected results both on v0.9.1-handle-empty-partitions and on master: // Load web-Google and run connected componentsimport org.apache

RE: GraphX partition problem

2014-05-28 Thread Zhicharevich, Alex
below? Can you advise on how to solve this issue? Thanks, Alex From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Thursday, May 22, 2014 6:59 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: GraphX partition problem The fix will be included in Spark 1.0, but if you jus

RE: GraphX partition problem

2014-05-26 Thread Zhicharevich, Alex
Can we do better with Bagel somehow? Control how we store the graph? From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Monday, May 26, 2014 12:13 PM To: user@spark.apache.org Subject: Re: GraphX partition problem GraphX only performs sequential scans over the edges, so we could in theory

Re: GraphX partition problem

2014-05-26 Thread Ankur Dave
GraphX only performs sequential scans over the edges, so we could in theory store them on disk and stream through them, but we haven't implemented this yet. In-memory storage is the only option for now. Ankur

RE: GraphX partition problem

2014-05-25 Thread Zhicharevich, Alex
I’m not sure about 1.2TB, but I can give it a shot. Is there some way to persist intermediate results to disk? Does all the graph has to be in memory? Alex From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Monday, May 26, 2014 12:23 AM To: user@spark.apache.org Subject: Re: GraphX partition

Re: GraphX partition problem

2014-05-25 Thread Ankur Dave
Once the graph is built, edges are stored in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). Unfortunately, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sortin

RE: GraphX partition problem

2014-05-25 Thread Zhicharevich, Alex
: user@spark.apache.org Subject: Re: GraphX partition problem The fix will be included in Spark 1.0, but if you just want to apply the fix to 0.9.1, here's a hotfixed version of 0.9.1 that only includes PR #367: https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions. You can

Re: GraphX partition problem

2014-05-22 Thread Ankur Dave
The fix will be included in Spark 1.0, but if you just want to apply the fix to 0.9.1, here's a hotfixed version of 0.9.1 that only includes PR #367: https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions. You can clone and build this. Ankur On Thu, Ma

GraphX partition problem

2014-05-22 Thread Zhicharevich, Alex
Hi, I'm running a simple connected components code using GraphX (version 0.9.1) My input comes from a HDFS text file partitioned to 400 parts. When I run the code on a single part or a small number of files (like 20) the code runs fine. As soon as I'm trying to read more files (more than 30) I'