RDD memory and storage level option

Tsai Li Ming Thu, 20 Nov 2014 04:13:54 -0800

Hi,

This is on version 1.1.0.


I’m did a simple test on MEMORY_AND_DISK storage level.

> var file = 
> sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
> file.count()

The file is 1.5GB and there is only 1 worker. I have requested for 1GB of 
worker memory per node:
                                                                                
                                              
             ID               Name     Cores Memory per Node   Submitted Time   
 User  State  Duration                        
   app-20141120193912-0002 Spark shell 64    1024.0 MB       2014/11/20 
19:39:12 root RUNNING 6.0 min                         


After doing a simple count, the job web ui indicates the entire file is saved 
on disk?

               RDD Name                Storage Level         Cached         
Fraction      Size in     Size in     Size on     
                                                           Partitions        
Cached       Memory      Tachyon       Disk      
   file:///path/to/file.txt Disk Serialized 1x             46               
100%           0.0 B       0.0 B        1476.5 MB    
                                     Replicated                                 
                                              
                                                 
1. Shouldn’t some partitions be saved into memory? 




2. If I run with MEMORY_ONLY option, I can save some partitions into memory but 
there are still space left according to the executor page
220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 
73MB.

              RDD Name                  Storage Level          Cached        
Fraction      Size in     Size in    Size on    
                                                              Partitions       
Cached       Memory      Tachyon      Disk     
   file:///path/to/file.txt Memory Deserialized              3                
7%            220.6 MB    0.0 B        0.0 B      
                                     1x Replicated                              
                                              
                                              
    Executor    Address      RDD     Memory    Disk   Active   Failed   
Complete    Total   Task   Input  Shuffle  Shuffle    
       ID                   Blocks    Used     Used   Tasks    Tasks      Tasks 
    Tasks   Time            Read    Write     
                                    220.6 MB                                    
                  1457.4MB                      
   0          foo.co:48660 3        / 530.3   0.0 B  0        0        46       
   46      14.2 m         0.0 B    0.0 B      
                                    MB        

14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on 
foo.co:48660 (size: 73.6 MB, free: 309.6 MB)
14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) 
in 29833 ms on foo.co (43/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) 
in 31502 ms on foo.co (44/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) 
in 31651 ms on foo.co (45/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) 
in 31782 ms on foo.co (46/46)
14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at <console>:16) finished 
in 31.818 s
14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
14/11/20 19:53:24 INFO SparkContext: Job finished: count at <console>:16, took 
31.926585742 s
res0: Long = 10000000

Is this correct?



3. I can’t seem to work out the math to derive 530MB that is made available in 
the executor? 1024MB * memoryFraction(0.6) = 614.4

Thanks!





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RDD memory and storage level option

Reply via email to