Hi Andra, CoGroup is not implemented as a Cartesian product, so O(V*E) is not a very accurate approximation.
All this depends on what you count. Let's assume single-node execution and that everything fits in memory, and let's count comparisons and UDFs on groups. Then, coGroup sorts both inputs, so the complexity of sorting is O(VlogV+ElogE) comparisons. The merge phase is O(V+E), so sorting dominates. During the merge, for each group (say we have G groups) you want to do some operation. If this is an operation that only depends on the group size, and we assume a constant group size that is very small compared to V and E (if this is not true then the grouping does not really make sense) you end up with O(VlogV+ElogE+V+E+G). How many groups do you have? Assuming that you coGroup on say, the "from" vertex of the edge, worst case is V groups, so your worst case complexity is O(VlogV+ElogE+V+E+V)=O(VlogV+ElogE). In a parallel scenario, you might want to count the number of records transferred over the network to measure complexity. An of course, all of this is, in general, not a good predictor of performance. Hopefully, I didn't screw this up, it's been a while I thought about complexity :-D Kostas On Wed, Jul 22, 2015 at 10:56 AM, Andra Lungu <lungu.an...@gmail.com> wrote: > Hi everyone, > > I am not 100% sure about this one, so I thought that I could set my > thoughts straight via the mailing list. > > Here's the use case. You coGroup a data set of vertices with a data set of > edges. That gives you a complexity of* O(|V| * |E|)*, where |V| is the > total number of vertices and |E| the total number of edges. You however do > a vertex value update in the coGroup function. Say you intersect some hash > sets and assign that value to the vertices. Does that give you an extra > O(|V|) [making the total O(|V|^2*|E|)] or is that covered by the initial > O(|V|*|E|)? I guess this has to do with the way a coGroup is internally > built... > > Thanks a lot! > Andra >