Re: What is the location in the source code of the computation of the elements in a map transformation?

2015-05-18 Thread Tom Hubregtsen
Hi Patrick, Thank you very much for your response. I am almost there, but am not sure about my conclusion. Let me try to approach it from a different angle. I would like to time the impact of a particular lambda function, or if possible, more broadly measure the the impact of any map function. I

What is the location in the source code of the computation of the elements in a map transformation?

2015-05-02 Thread Tom Hubregtsen
I am trying to understand what the data and computation flow is in Spark, and believe I fairly understand the Shuffle (both map and reduce side), but I do not get what happens to the computation from the map stages. I know all maps gets pipelined on the shuffle (when there is no other action in bet

Re: Spilling when not expected

2015-04-29 Thread Tom Hubregtsen
Hi reynold, It took me some time, but I've finally found that there is a difference between spilling on the map-side and spilling on the reduce-side for a shuffle. Spilling to disk on the map-side happens by default (with the spillToPartitionFiles call from insertAll in ExternalSorter; don't know

Re: Spilling when not expected

2015-03-13 Thread Tom Hubregtsen
apply? How much memory does the web ui say is available? > > BTW - I don't think any JVM can actually handle 700G heap ... (maybe Zing). > > On Thu, Mar 12, 2015 at 4:09 PM, Tom Hubregtsen > wrote: > >> Hi all, >> >> I'm running the teraSort benchmark

Spilling when not expected

2015-03-12 Thread Tom Hubregtsen
Hi all, I'm running the teraSort benchmark with a relative small input set: 5GB. During profiling, I can see I am using a total of 68GB. I've got a terabyte of memory in my system, and set spark.executor.memory 900g spark.driver.memory 900g I use the default for spark.shuffle.memoryFraction spar

Memory

2014-10-23 Thread Tom Hubregtsen
Hi all, I would like to validate my understanding of memory regions in Spark. Any comments on my description below would be appreciated! Execution is split up into stages, based on wide dependencies between RDDs and actions such as save. All transformations involving narrow dependencies before th

Impact of input format on timing

2014-10-05 Thread Tom Hubregtsen
Hi, I ran the same version of a program with two different types of input containing equivalent information. Program 1: 10,000 files with on average 50 IDs, one every line Program 2: 1 file containing 10,000 lines. On average 50 IDs per line My program takes the input, creates key/value pairs of

Re: memory size for caching RDD

2014-09-27 Thread Tom Hubregtsen
Use unpersist(), even when not persisted before. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/memory-size-for-caching-RDD-tp8256p8579.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

RE: spark.local.dir and spark.worker.dir not used

2014-09-27 Thread Tom Hubregtsen
Also, if I am not mistaken, this data is automatically removed after your run. Be sure to check it while running your program. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp8529p8578.html Sent from the

Spark memory regions

2014-09-27 Thread Tom Hubregtsen
As I've told before, I am currently writing my master's thesis on storage and memory usage in Spark. I am currently specifically looking at the different fractions of memory: I was able to find 3 memory regions, but it seems to leave some unaccounted for: 1. spark.shuffle.memoryFraction: 20% 2. sp

Spark spilling location

2014-09-18 Thread Tom Hubregtsen
Hi all, Just one line of context, since last post mentioned this would help: I'm currently writing my masters thesis (Computer Engineering) on storage and memory in both Spark and Hadoop. Right now I'm trying to analyze the spilling behavior of Spark, and I do not see what I expect. Therefor, I w