Hi,
This was all my fault. It turned out I had a line of code buried in a
library that did a "repartition." I used this library to wrap an RDD to
present it to legacy code as a different interface. That's what was causing
the data to spill to disk.
The really stupid thing is it took me the better
Jim Carroll wrote
> Okay,
>
> I have an rdd that I want to run an aggregate over but it insists on
> spilling to disk even though I structured the processing to only require a
> single pass.
>
> In other words, I can do all of my processing one entry in the rdd at a
> time without persisting anyt
Nvm. I'm going to post another question since this has to do with the way
spark handles sc.textFile with a file://.gz
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html
Sent from the Apache Spark User L
In case a little more information is helpful:
the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a
".gz" file (that's too big to fit on my disk).
I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect.
This rdd is what I'm calling aggregate on and I expect t