Hi, 
  patrick said "The intermediate shuffle output gets written to disk, but it
often hits the OS-buffer cache
  since it's not explicitly fsync'ed, so in many cases it stays entirely in
memory. The behavior of the     
  shuffle is agnostic to whether the base RDD is in cache or in disk." 

  i do a test with one groupBy action and found the intermediate shuffle
files are written to disk    
  with sufficient free memory, the shuffle size is about 500MB, and there 's
1.5GB free memory,
  and i notice that disk used increases about 500MB during the process.

  here's the log using vmstat, you can see the cache column increases when
reading from disk, but
  buff column is unchanged, so the data written to disk is not buffered 

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free         buff    cache      si   so    bi    bo    in   
cs us sy id wa
 2  0  10256 1616852   6664 557344    0    0     0 51380  972  2852 88  7  0 
5
 1  0  10256 1592636   6664 580676    0    0     0     0     949  3777 91  9 
0  0
 1  0  10256 1568228   6672 604016    0    0     0   576   923  3640 94  6 
0  0
 2  0  10256 1545836   6672 627348    0    0     0     0     893  3261 95  5 
0  0
 1  0  10256 1521552   6672 650668    0    0     0     0     884  3401 89 11 
0  0
 2  0  10256 1497144   6672 674012    0    0     0     0     911  3275 91  9 
0  0
 1  0  10256 1469260   6676 700728    0    0     4 60668 1044 3366 85 15  0 
0
 1  0  10256 1453076   6684 702464    0    0     0   924   853 2596 97  3  0 
0

  is the buffer cache in write through mode? something i need to configure? 
  my os is ubuntu 13.10 64bits.
  thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to