On Thu, Sep 11, 2014 at 10:17 PM, Tom <thubregt...@gmail.com> wrote:
> If I set SPARK_DRIVER_MEMORY to x GB, Spark reports
> /14/09/11 15:36:41 INFO MemoryStore: MemoryStore started with capacity
> ~0.55*x GB/
> *Question:*
> Does this relate to spark.storage.memoryFraction (default 0.6), and is the
> other 0.4 used by spark.shulffle.memoryFraction (default 0.2) and spark
> general usage (0,2?).

Yes, that matches my understanding.


> Say I have two different programs:
> a)
> JavaPairRDD<String, String> a = file.flatMapToPair(some class());
> JavaPairRDD<String, String> b = a.reduceByKey(some class());
> JavaPairRDD<String, String> c = b.mapValues(some class());
>
> b)
> file.flatMapToPair(some class()).reduceByKey(some class()).mapValues(some
> class());
>
> I am now wondering which RDD's are actually created, and if they are the
> same in both situations:
> I could see a scenario in a) in which lazy evaluation has a similar
> situation too
> Int a, b, c;
> a = 0;
> b = a;
> c = b;
> Were the compiler removes a and b, and only stores c.

These programs create the same RDDs. The difference in how references
to JavaPairRDD are saved or moved around are immaterial. This is not
equivalent to the assignments you mention since these aren't
assignments, but method calls, and not least of which because they
have side effects.


> Now when I look into the output, I see
> /MappedRDD[37]/
> But I only defined 18 RDD's in my code with JavaPairRDD.
>
> *Question:*
> When are RDD's actually created? Can I trace these one-on-one in the output?

I would not necessarily take the [37] to be a count of RDDs created so
far, not necessarily. But they are not 1-1 with method calls,
necessarily, either. It shouldn't really matter. They are handles on
stages of computations, and are materialized or computed as needed for
you.


> When I run a program, I see the following lines:
> /14/09/11 15:36:44 INFO MemoryStore: ensureFreeSpace(3760) called with
> curMem=360852, maxMem=2899102924
> 14/09/11 15:36:44 INFO MemoryStore: Block broadcast_9 stored as values in
> memory (estimated size 3.7 KB, free 2.7 GB)/
> But also
> /14/09/11 12:57:08 INFO ExternalAppendOnlyMap: Thread 239 spilling in-memory
> map of 493 MB to disk (7 times so far)
> 14/09/11 12:57:09 INFO ExternalAppendOnlyMap: Thread 239 spilling in-memory
> map of 493 MB to disk (8 times so far)/
>
> I could see a scenario in which a shuffle uses more than an actual RDD store
> needs, but this seems disproportional to me.
> *Question:*
> Where can I see the actual size of an individual RDD? Or is there a way to
> calculate it?

Look at the "Storage" tab in the UI to see persisted RDDs. For RDDs
that aren't persisted, I'm not sure you can meaningfully say what
their actual resource consumption is. It could be nothing if it has
not yet been computed for example. Memory used for shuffling is not
the same as for persisting RDDs. Yes you can imagine situations in
which shuffles are really large.

I think the answers may depend more specifically on what you're getting at.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to