Hi,
I guess enforcing a Join Strategy by default is not the best option
since you can't assume what the user did before actually calling the
Gelly functions and how the data looks like (maybe its one of the 1%
graphs where the relation is the other way around or the vertex data set
is very large); maybe the datasets are already sorted / partitioned.
Another solution could be overloading the Gelly functions that use joins
and letting the users decide to give hints or not?
As an example, I am currently benchmarking graphs with up to 700M
vertices and 3B edges on a YARN cluster and at one point in the job I
need to join vertices and edges. I also tried to give the
broadcast-hash-second (vertices) hint and the job performed
significantly slower than letting the system decide.
Best,
Martin
On 22.08.2015 09:51, Andra Lungu wrote:
Hey everyone,
When coding for my thesis, I observed that half of the current Gelly
functions (the ones that use join operators) fail on a cluster environment
with the following exception:
java.lang.IllegalArgumentException: Too few memory segments provided. Hash Join
needs at least 33 memory segments.
This is because, in 99% of the cases, the vertex data set is significantly
smaller than the edge data set. What I did to get rid of the error was the
following:
DataSet<Tuple2<Edge<K, EV>, Vertex<K, VV>>> edgesWithSources = edges
.join(this.vertices,
JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(0).equalTo(0)
In short, I added join hints. I believe this should also be in Gelly, in
case someone bumps into the same problem somewhere in the future.
What do you think?