Re: creating a distributed index

2015-07-15 Thread Ankur Dave
The latest version of IndexedRDD supports any key type with a defined serializer , including Strings. It's not released yet, but you can use it from the master branch i

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread Ankur Dave
At 2015-01-22 02:06:37 -0800, NicolasC wrote: > I try to execute a simple program that runs the ShortestPaths algorithm > (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. > I use Spark 1.2.0 downloaded from spark.apache.org. > > This program runs more than 2 hours when the grid s

Re: graph.inDegrees including zero values

2015-01-25 Thread Ankur Dave
You can do this using leftJoin, as collectNeighbors [1] does: graph.vertices.leftJoin(graph.inDegrees) { (vid, attr, inDegOpt) => inDegOpt.getOrElse(0) } [1] https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala#L145 Ankur On Sun, Jan 25, 2

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Ankur Dave
Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles wrote: > Just curious, is this set to be merged at some point? - To unsubscribe, e-mail:

Re: Learning GraphX Questions

2015-02-13 Thread Ankur Dave
At 2015-02-13 12:19:46 -0800, Matthew Bucci wrote: > 1) How do you actually run programs in GraphX? At the moment I've been doing > everything live through the shell, but I'd obviously like to be able to work > on it by writing and running scripts. You can create your own projects that build agai

Re: Graphx gets slower as the iteration number increases

2015-03-24 Thread Ankur Dave
This might be because partitions are getting dropped from memory and needing to be recomputed. How much memory is in the cluster, and how large are the partitions? This information should be in the Executors and Storage pages in the web UI. Ankur On Tue, Mar 24, 2015 a

Re: [GraphX] aggregateMessages with active set

2015-04-07 Thread Ankur Dave
We thought it would be better to simplify the interface, since the active set is a performance optimization but the result is identical to calling subgraph before aggregateMessages. The active set option is still there in the package-private method aggregateMessagesWithActiveSet. You can actually

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread Ankur Dave
Actually, GraphX doesn't need to scan all the edges, because it maintains a clustered index on the source vertex id (that is, it sorts the edges by source vertex id and stores the offsets in a hash table). If the activeDirection is appropriately set, it can then jump only to the clusters with activ

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Ankur Dave
I'm the primary author of IndexedRDD. To answer your questions: 1. Operations on an IndexedRDD partition can only be performed from a task operating on that partition, since doing otherwise would require decentralized coordination between workers, which is difficult in Spark. If you want to perfor

Re: How does GraphX stores the routing table?

2015-04-21 Thread Ankur Dave
On Tue, Apr 21, 2015 at 10:39 AM, mas wrote: > How does GraphX stores the routing table? Is it stored on the master node > or > chunks of the routing table is send to each partition that maintains the > record of vertices and edges at that node? > The latter: the routing table is stored alongsid

Re: Effecient way to fetch all records on a particular node/partition in GraphX

2015-05-17 Thread Ankur Dave
If you know the partition IDs, you can launch a job that runs tasks on only those partitions by calling sc.runJob . For example, we do this in IndexedRDD

Re: Bayes Net with Graphx?

2014-06-06 Thread Ankur Dave
Vertices can have arbitrary properties associated with them: http://spark.apache.org/docs/latest/graphx-programming-guide.html#the-property-graph Ankur

Re: Query on Merge Message (Graph: pregel operator)

2014-06-19 Thread Ankur Dave
Many merge operations can be broken up to work incrementally. For example, if the merge operation is to sum *n* rank updates, then you can set mergeMsg = (a, b) => a + b and this function will be applied to all *n* updates in arbitrary order to yield a final sum. Addition, multiplication, min, max,

Re: BSP realization on Spark

2014-06-19 Thread Ankur Dave
Master/worker assignment and barrier synchronization are handled by Spark, so Pregel will automatically coordinate the computation, communication, and synchronization needed to run for the specified number of supersteps (or until there are no messages sent in a particular superstep). Ankur

Re: Graphx SubGraph

2014-06-24 Thread Ankur Dave
Yes, the subgraph operator takes a vertex predicate and keeps only the edges where both vertices satisfy the predicate, so it will work as long as you can express the sublist in terms of a vertex predicate. If that's not possible, you can still obtain the same effect, but you'll have to use lower-

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
A common reason for the "Joining ... is slow" message is that you're joining VertexRDDs without having cached them first. This will cause Spark to recompute unnecessarily, and as a side effect, the same index will get created twice and GraphX won't be able to do an efficient zip join. For example,

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
Oh, I just read your message more carefully and noticed that you're joining a regular RDD with a VertexRDD. In that case I'm not sure why the warning is occurring, but it might be worth caching both operands (graph.vertices and the regular RDD) just to be sure. Ankur

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread Ankur Dave
Interesting problem! My understanding is that you want to (1) find paths matching a particular pattern, and (2) add edges between the start and end vertices of the matched paths. For (1), I implemented a pattern matcher for GraphX

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Ankur Dave
When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check for identical indexes is performed using reference equality. Without caching, two copies of the index are created. Although the two indexes are structurally

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Ankur Dave
Well, the alternative is to do a deep equality check on the index arrays, which would be somewhat expensive since these are pretty large arrays (one element per vertex in the graph). But, in case the reference equality check fails, it actually might be a good idea to do the deep check before resort

Re: tiers of caching

2014-07-07 Thread Ankur Dave
I think tiers/priorities for caching are a very good idea and I'd be interested to see what others think. In addition to letting libraries cache RDDs liberally, it could also unify memory management across other parts of Spark. For example, small shuffles benefit from explicitly keeping the shuffle

Re: GraphX: how to specify partition strategy?

2014-07-10 Thread Ankur Dave
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI wrote: > > - how to "build the latest version of Spark from the master branch, which > contains a fix"? Instead of downloading a prebuilt Spark release from http://spark.apache.org/downloads.html, follow the instructions under "Development Version" on th

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB wrote: > > -- Is it a correct way to load file to get best performance? Yes, edgeListFile should be efficient at loading the edges. -- What should be the partition size? =computing node or =cores? In general it should be a multiple of the number of

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
I don't think it should affect performance very much, because GraphX doesn't serialize ShippableVertexPartition in the "fast path" of mapReduceTriplets execution (instead it calls ShippableVertexPartition.shipVertexAttributes and serializes the result). I think it should only get serialized for spe

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
Spark just uses opens up inter-slave TCP connections for message passing during shuffles (I think the relevant code is in ConnectionManager). Since TCP automatically determines the optimal sending rate, Spark doesn't need any configu

Re: RACK_LOCAL Tasks Failed to finish

2014-07-14 Thread Ankur Dave
What GraphX application are you running? If it's a custom application that calls RDD.unpersist, that might cause RDDs to be recomputed. It's tricky to do unpersisting correctly, so you might try not unpersisting and see if that helps. Ankur

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Ankur Dave
On Jul 15, 2014, at 12:06 PM, Yifan LI wrote: > Btw, is there any possibility to customise the partition strategy as we > expect? I'm not sure I understand. Are you asking about defining a custom

Re: GraphX Pragel implementation

2014-07-17 Thread Ankur Dave
If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store the incoming messages into the tuple, allowing sendMsg to access them. For example, if

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-18 Thread Ankur Dave
On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI wrote: > Yes, is possible to defining a custom partition strategy? Yes, you just need to create a subclass of PartitionStrategy as follows: import org.apache.spark.graphx._ object MyPartitionStrategy extends PartitionStrategy { override def getPartit

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-18 Thread Ankur Dave
Sorry, I didn't read your vertex replication example carefully, so my previous answer is wrong. Here's the correct one: On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI wrote: > I don't understand, for instance, we have 3 edge partition tables(EA: a -> > b, a -> c; EB: a -> d, a -> e; EC: d -> c ), 2 v

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread Ankur Dave
Thanks for your interest. I should point out that the numbers in the arXiv paper are from GraphX running on top of a custom version of Spark with an experimental in-memory shuffle prototype. As a result, if you benchmark GraphX at the current master, it's expected that it will be 2-3x slower than G

Re: Graphx : Perfomance comparison over cluster

2014-07-20 Thread Ankur Dave
On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB wrote: > > Does the suggested version with in-memory shuffle affects performance too > much? We've observed a 2-3x speedup from it, at least on larger graphs like twitter-2010 and uk-2007-05

Re: Question about initial message in graphx

2014-07-21 Thread Ankur Dave
On Mon, Jul 21, 2014 at 8:05 PM, Bin WU wrote: > I am not sure how to specify different initial values to each node in the > graph. Moreover, I am wondering why initial message is necessary. I think > we can instead initialize the graph and then pass it into Pregel interface? Indeed it's not ne

Re: the implications of some items in webUI

2014-07-22 Thread Ankur Dave
On Tue, Jul 22, 2014 at 7:08 AM, Yifan LI wrote: > 1) what is the difference between "Duration"(Stages -> Completed Stages) > and "Task Time"(Executors) ? Stages are composed of tasks that run on executors. Tasks within a stage may run concurrently, since there are multiple executors and each e

Re: Where is the "PowerGraph abstraction"

2014-07-23 Thread Ankur Dave
We removed the PowerGraph abstraction layer when merging GraphX into Spark to reduce the maintenance costs. You can still read the code

Re: GraphX Pragel implementation

2014-07-24 Thread Ankur Dave
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar wrote: > While using pregel API for Iterations how to figure out which super step > the iteration currently in. The Pregel API doesn't currently expose this, but it's very straightforward to modify Pregel.scala

Re: VertexPartition and ShippableVertexPartition

2014-07-28 Thread Ankur Dave
On Mon, Jul 28, 2014 at 4:29 AM, Larry Xiao wrote: > On 7/28/14, 3:41 PM, shijiaxin wrote: > >> There is a VertexPartition in the EdgePartition,which is created by >> >> EdgePartitionBuilder.toEdgePartition. >> >> and There is also a ShippableVertexPartition in the VertexRDD. >> >> These two Part

Re: Spark streaming vs. spark usage

2014-07-28 Thread Ankur Dave
On Mon, Jul 28, 2014 at 12:53 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > But when done processing, one would still have to pull out the wrapped > object, knowing what it was, and I don't see how to do that. It's pretty tricky to get the level of type safety you're looking for.

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
Yifan LI writes: > Maybe you could get the vertex, for instance, which id is 80, by using: > > graph.vertices.filter{case(id, _) => id==80}.collect > > but I am not sure this is the exactly efficient way.(it will scan the whole > table? if it can not get benefit from index of VertexRDD table) Un

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Ankur Dave
Denis RP writes: > [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to > stage failure: Task 6.0:4 failed 4 times, most recent failure: Exception > failure in TID 598 on host worker6.local: java.lang.NullPointerException > [error] scala.collection.Iterator$$anon$11.n

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
andy petrella writes: > Oh I was almost sure that lookup was optimized using the partition info It does use the partitioner to run only one task, but within that task it has to scan the entire partition: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDD

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Ankur Dave
Denis RP writes: > I build it with sbt package, I run it with sbt run, and I do use > SparkConf.set for deployment options and external jars. It seems that > spark-submit can't load extra jars and will lead to noclassdeffounderror, > should I pack all the jars to a giant one and give it a try? Ye

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
Jeffrey Picard writes: > As the program runs I’m seeing each iteration take longer and longer to > complete, this seems counter intuitive to me, especially since I am seeing > the shuffle read/write amounts decrease with each iteration. I would think > that as more and more vertices converged t

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
Jeffrey Picard writes: > I tried unpersisting the edges and vertices of the graph by hand, then > persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see > the same behavior in connected components however, and the same thing you > described in the storage page. Unfortunately

Re: Graphx : Perfomance comparison over cluster

2014-07-30 Thread Ankur Dave
ShreyanshB writes: >> The version with in-memory shuffle is here: >> https://github.com/amplab/graphx2/commits/vldb. > > It'd be great if you can tell me how to configure and invoke this spark > version. Sorry for the delay on this. Assuming you're planning to launch an EC2 cluster, here's how t

Re: GraphX Connected Components

2014-07-30 Thread Ankur Dave
On Wed, Jul 30, 2014 at 11:32 PM, Jeffrey Picard wrote: > That worked! The entire thing ran in about an hour and a half, thanks! Great! > Is there by chance an easy way to build spark apps using the master branch > build of spark? I’ve been having to use the spark-shell. The easiest way is pro

Re: GraphX Pragel implementation

2014-07-31 Thread Ankur Dave
On Wed, Jul 30, 2014 at 04:55 PM, Arun Kumar wrote: > For my implementation to work the vprog function which is responsible for > handling in coming messages and the sendMsg function should be aware of > which super step they are in. > Is it possible to pass super step information in this methods

Re: configuration needed to run twitter(25GB) dataset

2014-07-31 Thread Ankur Dave
On Thu, Jul 31, 2014 at 08:28 PM, Jiaxin Shi wrote: > We have a 6-nodes cluster , each node has 64GB memory. > [...] > But it ran out of memory. I also try 2D and 1D partition. > > And I also try Giraph under the same configuration, and it runs for 10 > iterations , and then it ran out of memory a

Re: configuration needed to run twitter(25GB) dataset

2014-08-01 Thread Ankur Dave
At 2014-07-31 21:40:39 -0700, shijiaxin wrote: > Is it possible to reduce the number of edge partitions and exploit > parallelism fully at the same time? > For example, one partition per node, and the threads in the same node share > the same partition. It's theoretically possible to parallelize

Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Ankur Dave
At 2014-08-01 11:23:49 +0800, Bin wrote: > I am wondering what is the best way to construct a graph? > > Say I have some attributes for each user, and specific weight for each user > pair. The way I am currently doing is first read user information and edge > triple into two arrays, then use sc.

Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Ankur Dave
Attempting to build Spark from source on EC2 using sbt gives the error "sbt.ResolveException: unresolved dependency: org.scala-lang#scala-library;2.10.2: not found". This only seems to happen on EC2, not on my local machine. To reproduce, launch a cluster using spark-ec2, clone the Spark reposi

Re: configuration needed to run twitter(25GB) dataset

2014-08-01 Thread Ankur Dave
At 2014-08-01 02:12:08 -0700, shijiaxin wrote: > When I use fewer partitions, (like 6) > It seems that all the task will be assigned to the same machine, because the > machine has more than 6 cores.But this will run out of memory. > How to set fewer partitions number and use all the machine at the

Re: creating a distributed index

2014-08-01 Thread Ankur Dave
At 2014-08-01 14:50:22 -0600, Philip Ogren wrote: > It seems that I could do this with mapPartition so that each element in a > partition gets added to an index for that partition. > [...] > Would it then be possible to take a string and query each partition's index > with it? Or better yet, take

Re: GraphX

2014-08-02 Thread Ankur Dave
At 2014-08-02 21:29:33 +0530, Deep Pradhan wrote: > How should I run graphx codes? At the moment it's a little more complicated to run the GraphX algorithms than the Spark examples due to SPARK-1986 [1]. There is a driver program in org.apache.spark.graphx.lib.Analytics which you can invoke usi

Re: [GraphX] how to compute only a subset of vertices in the whole graph?

2014-08-02 Thread Ankur Dave
At 2014-08-02 19:04:22 +0200, Yifan LI wrote: > But I am thinking of if I can compute only some selected vertexes(hubs), not > to do "update" on every vertex… > > is it possible to do this using Pregel API? The Pregel API already only runs vprog on vertices that received messages in the previou

Re: GraphX runs without Spark?

2014-08-03 Thread Ankur Dave
At 2014-08-03 13:14:52 +0530, Deep Pradhan wrote: > I have a single node cluster on which I have Spark running. I ran some > graphx codes on some data set. Now when I stop all the workers in the > cluster (sbin/stop-all.sh), the codes still run and gives the answers. Why > is it so? I mean does gr

Re: [GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Ankur Dave
At 2014-08-04 20:52:26 +0800, Bin wrote: > I wonder how spark parameters, e.g., number of paralellism, affect Pregel > performance? Specifically, sendmessage, mergemessage, and vertexprogram? > > I have tried label propagation on a 300,000 edges graph, and I found that no > paralellism is much f

Re: GraphX Pagerank application

2014-08-15 Thread Ankur Dave
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers < alexander.rigg...@gmail.com> wrote: > To perform the page rank I have to create a graph object, adding the edges > by setting sourceID=id and distID=brand. In GraphLab there is function: g = > SGraph().add_edges(data, src_field='id', dst_field='b

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote: > I am testing our application(similar to "personalised page rank" using > Pregel, and note that each vertex property will need pretty much more space > to store after new iteration) [...] But when we ran it on larger graph(e.g. LiveJouranl), it

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
(+user) On Tue, Aug 19, 2014 at 12:05 PM, spr wrote: > I want to assign each vertex to a community with the name of the vertex. As I understand it, you want to set the vertex attributes of a graph to the corresponding vertex ids. You can do this using Graph#mapVertices [1] as follows: val

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
At 2014-08-19 12:47:16 -0700, spr wrote: > One follow-up question. If I just wanted to get those values into a vanilla > variable (not a VertexRDD or Graph or ...) so I could easily look at them in > the REPL, what would I do? Are the aggregate data structures inside the > VertexRDD/Graph/... Ar

Re: Personalized Page rank in graphx

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:57:57 -0700, Mohit Singh wrote: > I was wondering if Personalized Page Rank algorithm is implemented in graphx. > If the talks and presentation were to be believed > (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) > it is.. but cant find

Re: GraphX question about graph traversal

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:34:50 -0700, Cesar Arevalo wrote: > I would like to get the type B vertices that are connected through type A > vertices where the edges have a score greater than 5. So, from the example > above I would like to get V1 and V4. It sounds like you're trying to find paths in the grap

Re: The running time of spark

2014-08-23 Thread Ankur Dave
At 2014-08-23 08:33:48 -0700, Denis RP wrote: > Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM. > The caching is maintained by pregel, should be reliable. Storage level is > MEMORY_AND_DISK_SER. I'd suggest trying the DISK_ONLY storage level and possibly increasing the

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-25 Thread Ankur Dave
At 2014-08-25 06:41:36 -0700, BertrandR wrote: > Unfortunately, this works well for extremely small graphs, but it becomes > exponentially slow with the size of the graph and the number of iterations > (doesn't finish 20 iterations with graphs having 48000 edges). > [...] > It seems to me that a

Re: GraphX usecases

2014-08-25 Thread Ankur Dave
At 2014-08-25 11:23:37 -0700, Sunita Arvind wrote: > Does this "We introduce GraphX, which combines the advantages of both > data-parallel and graph-parallel systems by efficiently expressing graph > computation within the Spark data-parallel framework. We leverage new ideas > in distributed graph

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-26 Thread Ankur Dave
At 2014-08-26 01:20:09 -0700, BertrandR wrote: > I actually tried without unpersisting, but given the performance I tryed to > add these in order to free the memory. After your anwser I tried to remove > them again, but without any change in the execution time... This is probably a related issue

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-03 Thread Ankur Dave
At 2014-09-03 17:58:09 +0200, Yifan LI wrote: > val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = > numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) > > Error: java.lang.UnsupportedOperationException: Cannot change storage l

Re: spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread Ankur Dave
At 2014-09-05 21:40:51 +0800, marylucy wrote: > But running graphx edgeFileList ,some tasks failed > error:requested array size exceed vm limits Try passing a higher value for minEdgePartitions when calling GraphLoader.edgeListFile. Ankur --

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-08 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI wrote: > But how to assign the storage level to a new vertices RDD that mapped from > an existing vertices RDD, > e.g. > *val newVertexRDD = > graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, > a:Array[VertexId]) => (id, initialHashMap(a))}*

Re: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2

2014-09-10 Thread Ankur Dave
On Wed, Sep 10, 2014 at 2:00 PM, Jeffrey Picard wrote: > After rebuilding from the master branch this morning, I’ve started to see > these errors that I’ve never gotten before while running connected > components. Anyone seen this before? > [...] > at > org.apache.spark.shuffle.sort.SortS

Re: vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Ankur Dave
At 2014-09-15 16:25:04 +0200, Yifan LI wrote: > I am wondering if the vertex active/inactive(corresponding the change of its > value between two supersteps) feature is introduced in Pregel API of GraphX? Vertex activeness in Pregel is controlled by messages: if a vertex did not receive a messag

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 10:55:37 +0200, Yifan LI wrote: > - from [1], and my understanding, the existing inactive feature in graphx > pregel api is “if there is no in-edges, from active vertex, to this vertex, > then we will say this one is inactive”, right? Well, that's true when messages are only sent

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 12:23:10 +0200, Yifan LI wrote: > but I am wondering if there is a message(none?) sent to the target vertex(the > rank change is less than tolerance) in below dynamic page rank implementation, > > def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { > if (edge.src

Re: how to group within the messages at a vertex?

2014-09-17 Thread Ankur Dave
At 2014-09-17 11:39:19 -0700, spr wrote: > I'm trying to implement label propagation in GraphX. The core step of that > algorithm is > > - for each vertex, find the most frequent label among its neighbors and set > its label to that. > > [...] > > It seems on the "broken" line above, I don't want

Re: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2

2014-09-22 Thread Ankur Dave
I diagnosed this problem today and found that it's because the GraphX custom serializers make an assumption that is violated by sort-based shuffle. I filed SPARK-3649 explaining the problem and submitted a PR to fix it [2]. The fix removes the custom serializers, which has a 10% performance pena

Re: Pregel messages serialized in local machine?

2014-09-25 Thread Ankur Dave
At 2014-09-25 06:52:46 -0700, Cheuk Lam wrote: > This is a question on using the Pregel function in GraphX. Does a message > get serialized and then de-serialized in the scenario where both the source > and the destination vertices are in the same compute node/machine? Yes, message passing curre

Re: GraphX Java API Timeline

2014-10-02 Thread Ankur Dave
Yes, I'm working on a Java API for Spark 1.2. Here's the issue to track progress: https://issues.apache.org/jira/browse/SPARK-3665 Ankur On Thu, Oct 2, 2014 at 11:10 AM, Adams, Jeremiah wrote: > Are there any plans to create a java api for GraphX? If so, what is the

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 18:22:44 -0400, Soumitra Siddharth Johri wrote: > I have a flat tab separated file like below: > > [...] > > where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the > properties which should form the edges between the nodes. > > How can I construct a graph from the above f

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 21:08:15 -0400, Soumitra Johri wrote: > There is no 'long' field in my file. So when I form the edge I get a type > mismatch error. Is it mandatory for GraphX that every vertex should have a > distinct id. ? > > in my case n1,n2,n3,n4 are all strings. (+user so others can see the s

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 12:36 PM, ll wrote: > hi again. just want to check in again to see if anyone could advise on how > to implement a "mutable, growing graph" with graphx? > > we're building a graph is growing over time. it adds more vertices and > edges every iteration of our algorithm. >

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh wrote: > a related question, what is the best way to update the values of existing > vertices and edges? > Many of the Graph methods deal with updating the existing values in bulk, including mapVertices, mapEdges, mapTriplets, mapReduceTriplets, and out

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Ankur Dave
At 2014-10-23 09:48:55 +0530, Arpit Kumar wrote: > error: value partitionBy is not a member of > org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID, > org.apache.spark.graphx.Edge[ED])] Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit conversion from RDD

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Ankur Dave
At 2014-10-25 08:56:34 +0530, Arpit Kumar wrote: > GraphLoader1.scala:49: error: class EdgePartitionBuilder in package impl > cannot be accessed in package org.apache.spark.graphx.impl > [INFO] val builder = new EdgePartitionBuilder[Int, Int] Here's a workaround: 1. Copy and modify the Gra

Re: GraphX StackOverflowError

2014-10-28 Thread Ankur Dave
At 2014-10-28 16:27:20 +0300, Zuhair Khayyat wrote: > I am using connected components function of GraphX (on Spark 1.0.2) on some > graph. However for some reason the fails with StackOverflowError. The graph > is not too big; it contains 1 vertices and 50 edges. > > [...] > 14/10/28 16:08:

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread Ankur Dave
How large is your graph, and how much memory does your cluster have? We don't have a good way to determine the *optimal* number of partitions aside from trial and error, but to get the job to at least run to completion, it might help to use the MEMORY_AND_DISK storage level and a large number of p

Re: To find distances to reachable source vertices using GraphX

2014-11-03 Thread Ankur Dave
The NullPointerException seems to be because edge.dstAttr is null, which might be due to SPARK-3936 . Until that's fixed, I edited the Gist with a workaround. Does that fix the problem? Ankur On Mon, Nov 3, 2014 at 12:2

Re: Fwd: Executor Lost Failure

2014-11-10 Thread Ankur Dave
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh wrote: > Tasks are now getting submitted, but many tasks don't happen. > Like, after opening the spark-shell, I load a text file from disk and try > printing its contentsas: > >>sc.textFile("/path/to/file").foreach(println) > > It does not give me

Re: inconsistent edge counts in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-11 01:51:43 +, "Buttler, David" wrote: > I am building a graph from a large CSV file. Each record contains a couple > of nodes and about 10 edges. When I try to load a large portion of the > graph, using multiple partitions, I get inconsistent results in the number of > edges b

Re: GraphX / PageRank with edge weights

2014-11-18 Thread Ankur Dave
At 2014-11-13 21:28:52 +, "Ommen, Jurgen" wrote: > I'm using GraphX and playing around with its PageRank algorithm. However, I > can't see from the documentation how to use edge weight when running PageRank. > Is this possible to consider edge weights and how would I do it? There's no built-

Re: Pagerank implementation

2014-11-18 Thread Ankur Dave
At 2014-11-15 18:01:22 -0700, tom85 wrote: > This line: val newPR = oldPR + (1.0 - resetProb) * msgSum > makes no sense to me. Should it not be: > val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum > ? This is an unusual version of PageRank where the messages being passed

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-17 14:47:50 +0530, Deep Pradhan wrote: > I was going through the graphx section in the Spark API in > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ > > Here, I find the word "landmark". Can anyone explain to me what is landmark > me

Re: Running PageRank in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 12:02:52 +0530, Deep Pradhan wrote: > I just ran the PageRank code in GraphX with some sample data. What I am > seeing is that the total rank changes drastically if I change the number of > iterations from 10 to 100. Why is that so? As far as I understand, the total rank should asym

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 14:59:20 +0530, Deep Pradhan wrote: > So "landmark" can contain just one vertex right? Right. > Which algorithm has been used to compute the shortest path? It's distributed Bellman-Ford. Ankur - To unsubscribe,

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 14:51:54 +0530, Deep Pradhan wrote: > I am using Spark-1.0.0. There are two GraphX directories that I can see here > > 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx > which contains LiveJournalPageRank,scala > > 2. spark-1.0.0/graphx/src/main/scala/o

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:29:08 +0530, Deep Pradhan wrote: > Does Bellman-Ford give the best solution? It gives the same solution as any other algorithm, since there's only one correct solution for shortest paths and it's guaranteed to find it eventually. There are probably faster distributed algorithms

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:35:13 +0530, Deep Pradhan wrote: > Now, how do I run the LiveJournalPageRank.scala that is there in 1? I think it should work to use MASTER=local[*] $SPARK_HOME/bin/run-example graphx.LiveJournalPageRank /edge-list-file.txt --numEPart=8 --numIter=10 --partStrategy=EdgePart

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:44:31 +0530, Deep Pradhan wrote: > I meant to ask whether it gives the solution faster than other algorithms. No, it's just that it's much simpler and easier to implement than the others. Section 5.2 of the Pregel paper [1] justifies using it for a graph (a binary tree) with 1

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:51:52 +0530, Deep Pradhan wrote: > Yes the above command works, but there is this problem. Most of the times, > the total rank is Nan (Not a Number). Why is it so? I've also seen this, but I'm not sure why it happens. If you could find out which vertices are getting the NaN rank

Re: how to force graphx to execute transfomtation

2014-11-26 Thread Ankur Dave
At 2014-11-26 05:25:10 -0800, Hlib Mykhailenko wrote: > I work with Graphx. When I call "graph.partitionBy(..)" nothing happens, > because, as I understood, that all transformation are lazy and partitionBy is > built using transformations. > Is there way how to force spark to actually execute

  1   2   >