Re: Join hints for the Gelly functions

Martin Junghanns Sat, 22 Aug 2015 02:31:46 -0700

Hi,

I guess enforcing a Join Strategy by default is not the best optionsince you can't assume what the user did before actually calling theGelly functions and how the data looks like (maybe its one of the 1%graphs where the relation is the other way around or the vertex data setis very large); maybe the datasets are already sorted / partitioned.Another solution could be overloading the Gelly functions that use joinsand letting the users decide to give hints or not?

As an example, I am currently benchmarking graphs with up to 700Mvertices and 3B edges on a YARN cluster and at one point in the job Ineed to join vertices and edges. I also tried to give thebroadcast-hash-second (vertices) hint and the job performedsignificantly slower than letting the system decide.


Best,
Martin

On 22.08.2015 09:51, Andra Lungu wrote:

Hey everyone,

When coding for my thesis, I observed that half of the current Gelly
functions (the ones that use join operators) fail on a cluster environment
with the following exception:

java.lang.IllegalArgumentException: Too few memory segments provided. Hash Join
needs at least 33 memory segments.

This is because, in 99% of the cases, the vertex data set is significantly
smaller than the edge data set. What I did to get rid of the error was the
following:

DataSet<Tuple2<Edge<K, EV>, Vertex<K, VV>>> edgesWithSources = edges
       .join(this.vertices,
JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(0).equalTo(0)

In short, I added join hints. I believe this should also be in Gelly, in
case someone bumps into the same problem somewhere in the future.

What do you think?

Re: Join hints for the Gelly functions

Reply via email to