Hi,
This is on version 1.1.0.
I’m did a simple test on MEMORY_AND_DISK storage level.
> var file =
> sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
> file.count()
The file is 1.5GB and there is only 1 worker. I have requested for 1GB of
worker memory per node:
ID Name Cores Memory per Node Submitted Time
User State Duration
app-20141120193912-0002 Spark shell 64 1024.0 MB 2014/11/20
19:39:12 root RUNNING 6.0 min
After doing a simple count, the job web ui indicates the entire file is saved
on disk?
RDD Name Storage Level Cached
Fraction Size in Size in Size on
Partitions
Cached Memory Tachyon Disk
file:///path/to/file.txt Disk Serialized 1x 46
100% 0.0 B 0.0 B 1476.5 MB
Replicated
1. Shouldn’t some partitions be saved into memory?
2. If I run with MEMORY_ONLY option, I can save some partitions into memory but
there are still space left according to the executor page
220.6 MB / 530.3MB and it did not fully use up them? Each partition is about
73MB.
RDD Name Storage Level Cached
Fraction Size in Size in Size on
Partitions
Cached Memory Tachyon Disk
file:///path/to/file.txt Memory Deserialized 3
7% 220.6 MB 0.0 B 0.0 B
1x Replicated
Executor Address RDD Memory Disk Active Failed
Complete Total Task Input Shuffle Shuffle
ID Blocks Used Used Tasks Tasks Tasks
Tasks Time Read Write
220.6 MB
1457.4MB
0 foo.co:48660 3 / 530.3 0.0 B 0 0 46
46 14.2 m 0.0 B 0.0 B
MB
14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on
foo.co:48660 (size: 73.6 MB, free: 309.6 MB)
14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22)
in 29833 ms on foo.co (43/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33)
in 31502 ms on foo.co (44/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24)
in 31651 ms on foo.co (45/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14)
in 31782 ms on foo.co (46/46)
14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at <console>:16) finished
in 31.818 s
14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
14/11/20 19:53:24 INFO SparkContext: Job finished: count at <console>:16, took
31.926585742 s
res0: Long = 10000000
Is this correct?
3. I can’t seem to work out the math to derive 530MB that is made available in
the executor? 1024MB * memoryFraction(0.6) = 614.4
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]