Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Koert Kuipers
hey matei, ok when i switch to java 7 with G1 the GC time for all the "quick" tasks goes from 150ms to 10ms, but the slow ones stay just as slow. all i did was add -XX:+UseG1GC so maybe thats wrong, i still have to read up on G1. an example of GC in a slow task is below. best, koert [GC pause (y

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Matei Zaharia
Yeah, System.gc() is a suggestion but in practice it does invoke full GCs on the Sun JVM. Matei On Mar 11, 2014, at 12:35 PM, Koert Kuipers wrote: > hey matei, > ha i will definitely that one! looks like a total hack... i might just > schedule it after the precaching of rdds defensively. > >

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Koert Kuipers
hey matei, ha i will definitely that one! looks like a total hack... i might just schedule it after the precaching of rdds defensively. also trying java 7 with g1 On Tue, Mar 11, 2014 at 3:17 PM, Matei Zaharia wrote: > Right, that's it. I think what happened is the following: all the nodes > ge

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Andrew Ash
Note that calling System.gc() is just a suggestion to the JVM that it should run a garbage collection and doesn't force it right then 100% of the time. http://stackoverflow.com/questions/1481178/forcing-garbage-collection-in-java On Tue, Mar 11, 2014 at 12:17 PM, Matei Zaharia wrote: > Right, t

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Matei Zaharia
Right, that’s it. I think what happened is the following: all the nodes generated some garbage that put them very close to the threshold for a full GC in the first few runs of the program (when you cached the RDDs), but on the subsequent queries, only a few nodes are hitting full GC per query, s

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Koert Kuipers
hey matei, most tasks have GC times of 200ms or less, and then a few tasks take many seconds. example GC activity for a slow one: [GC [PSYoungGen: 1051814K->262624K(1398144K)] 3789259K->3524429K(5592448K), 0.0986800 secs] [Times: user=1.53 sys=0.01, real=0.10 secs] [GC [PSYoungGen: 786935K->524512

Re: computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hey matei, it happens repeatedly. we are currently runnning on java 6 with spark 0.9. i will add -XX:+PrintGCDetails and collect details, and also look into java 7 G1. thanks On Mon, Mar 10, 2014 at 6:27 PM, Matei Zaharia wrote: > Does this happen repeatedly if you keep running the computa

Re: computation slows down 10x because of cached RDDs

2014-03-10 Thread Matei Zaharia
Does this happen repeatedly if you keep running the computation, or just the first time? It may take time to move these Java objects to the old generation the first time you run queries, which could lead to a GC pause that also slows down the small queries. If you can run with -XX:+PrintGCDetai

computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hello all, i am observing a strange result. i have a computation that i run on a cached RDD in spark-standalone. it typically takes about 4 seconds. but when other RDDs that are not relevant to the computation at hand are cached in memory (in same spark context), the computation takes 40 seconds o