Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on.
On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber <[email protected]> wrote: > Hello, > > We run GraphX ConnectedComponents, and we notice that there is a time gap > that becomes larger and larger during Jobs, that is not accounted for. > > In the screenshot attached, you will notice that each job only takes > around 2 1/2min. At first, the next job/iteration starts immediately after > the previous one. But as we go through iterations, there is a gap (time > where job N+1 starts - time where job N finishes) that grows, reaching > ultimately 6 minutes around the 30th iteration . > > I suspect it has to do with DAG computation on the driver, as evidenced by > the very large (and getting larger at every iteration) of pending stages > that are ultimately skipped. > > So, > 1. is there anything obvious we can do to make that "gap" between > iterations shorter? > 2. would dividing the number of partitions in the input RDD per 2 divide > the gap by 2 as well? > > I ask because 3 min gap on average for a job length of 2 1/2 min => we are > "wasting" 50% of CPU time on the Executors. > > Thanks! > Thomas >
