This is an interesting issue, because, quite frankly, the join hint you passed simply reversed the sides of the join. The algorithm is still the same and has the same minimum memory requirements.
The fact that it made a difference is quite curious. The only thing I can imagine is that this hint changed not just the operator, but more operators changed as part of the holistic planning, and one memory consumer was eliminated. Other than a bug, of course ;-) 33 memory segments are by actually little more than 1 MiByte. If the you does not have that much memory, something else is probably amiss. BTW: The optimizer choice for the join in this case is probably very simple: Make the loop-invariant part the hash table build side. I guess that is the right thing in almost all cases. On Sat, Aug 22, 2015 at 12:57 PM, Andra Lungu <lungu.an...@gmail.com> wrote: > Your arguments are perfectly valid. So, what I suggest is to have the > functions as they are now, e.g. groupReduceOnNeighbors > and to add a groupReduceOnNeighbors(blablaSameArguments, boolean > useJoinHints). That way, the user can decide whether they'd like to trade > speed for a program that actually finishes :). > > On Sat, Aug 22, 2015 at 11:28 AM, Martin Junghanns < > m.jungha...@mailbox.org> > wrote: > > > Hi, > > > > I guess enforcing a Join Strategy by default is not the best option since > > you can't assume what the user did before actually calling the Gelly > > functions and how the data looks like (maybe its one of the 1% graphs > where > > the relation is the other way around or the vertex data set is very > large); > > maybe the datasets are already sorted / partitioned. Another solution > could > > be overloading the Gelly functions that use joins and letting the users > > decide to give hints or not? > > > > As an example, I am currently benchmarking graphs with up to 700M > vertices > > and 3B edges on a YARN cluster and at one point in the job I need to join > > vertices and edges. I also tried to give the broadcast-hash-second > > (vertices) hint and the job performed significantly slower than letting > the > > system decide. > > > > Best, > > Martin > > > > > > On 22.08.2015 09:51, Andra Lungu wrote: > > > >> Hey everyone, > >> > >> When coding for my thesis, I observed that half of the current Gelly > >> functions (the ones that use join operators) fail on a cluster > environment > >> with the following exception: > >> > >> java.lang.IllegalArgumentException: Too few memory segments provided. > >> Hash Join > >> needs at least 33 memory segments. > >> > >> This is because, in 99% of the cases, the vertex data set is > significantly > >> smaller than the edge data set. What I did to get rid of the error was > the > >> following: > >> > >> DataSet<Tuple2<Edge<K, EV>, Vertex<K, VV>>> edgesWithSources = edges > >> .join(this.vertices, > >> JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(0).equalTo(0) > >> > >> In short, I added join hints. I believe this should also be in Gelly, in > >> case someone bumps into the same problem somewhere in the future. > >> > >> What do you think? > >> > >> > > >