Hey, I agree with Martin on this. It's the optimizer's job to decide the join strategy.
Maybe the join hint worked on 99% of your cases, but we can't simply generalize this for all datasets and algorithms and hard-code a joint hint that assumes that the vertex set is always much smaller than the edge set. Cheers, Vasia. On 22 August 2015 at 11:28, Martin Junghanns <m.jungha...@mailbox.org> wrote: > Hi, > > I guess enforcing a Join Strategy by default is not the best option since > you can't assume what the user did before actually calling the Gelly > functions and how the data looks like (maybe its one of the 1% graphs where > the relation is the other way around or the vertex data set is very large); > maybe the datasets are already sorted / partitioned. Another solution could > be overloading the Gelly functions that use joins and letting the users > decide to give hints or not? > > As an example, I am currently benchmarking graphs with up to 700M vertices > and 3B edges on a YARN cluster and at one point in the job I need to join > vertices and edges. I also tried to give the broadcast-hash-second > (vertices) hint and the job performed significantly slower than letting the > system decide. > > Best, > Martin > > > On 22.08.2015 09:51, Andra Lungu wrote: > >> Hey everyone, >> >> When coding for my thesis, I observed that half of the current Gelly >> functions (the ones that use join operators) fail on a cluster environment >> with the following exception: >> >> java.lang.IllegalArgumentException: Too few memory segments provided. >> Hash Join >> needs at least 33 memory segments. >> >> This is because, in 99% of the cases, the vertex data set is significantly >> smaller than the edge data set. What I did to get rid of the error was the >> following: >> >> DataSet<Tuple2<Edge<K, EV>, Vertex<K, VV>>> edgesWithSources = edges >> .join(this.vertices, >> JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(0).equalTo(0) >> >> In short, I added join hints. I believe this should also be in Gelly, in >> case someone bumps into the same problem somewhere in the future. >> >> What do you think? >> >> >