subject:"Re\: No disk single pass RDD aggregation"

Re: No disk single pass RDD aggregation

2014-12-18 Thread Jim Carroll

Hi, This was all my fault. It turned out I had a line of code buried in a library that did a "repartition." I used this library to wrap an RDD to present it to legacy code as a different interface. That's what was causing the data to spill to disk. The really stupid thing is it took me the better

Re: No disk single pass RDD aggregation

2014-12-17 Thread thanhtien522

Jim Carroll wrote > Okay, > > I have an rdd that I want to run an aggregate over but it insists on > spilling to disk even though I structured the processing to only require a > single pass. > > In other words, I can do all of my processing one entry in the rdd at a > time without persisting anyt

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll

Nvm. I'm going to post another question since this has to do with the way spark handles sc.textFile with a file://.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html Sent from the Apache Spark User L

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll

In case a little more information is helpful: the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a ".gz" file (that's too big to fit on my disk). I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect. This rdd is what I'm calling aggregate on and I expect t