Hi, patrick said "The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk."
i do a test with one groupBy action and found the intermediate shuffle files are written to disk with sufficient free memory, the shuffle size is about 500MB, and there 's 1.5GB free memory, and i notice that disk used increases about 500MB during the process. here's the log using vmstat, you can see the cache column increases when reading from disk, but buff column is unchanged, so the data written to disk is not buffered procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 10256 1616852 6664 557344 0 0 0 51380 972 2852 88 7 0 5 1 0 10256 1592636 6664 580676 0 0 0 0 949 3777 91 9 0 0 1 0 10256 1568228 6672 604016 0 0 0 576 923 3640 94 6 0 0 2 0 10256 1545836 6672 627348 0 0 0 0 893 3261 95 5 0 0 1 0 10256 1521552 6672 650668 0 0 0 0 884 3401 89 11 0 0 2 0 10256 1497144 6672 674012 0 0 0 0 911 3275 91 9 0 0 1 0 10256 1469260 6676 700728 0 0 4 60668 1044 3366 85 15 0 0 1 0 10256 1453076 6684 702464 0 0 0 924 853 2596 97 3 0 0 is the buffer cache in write through mode? something i need to configure? my os is ubuntu 13.10 64bits. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478.html Sent from the Apache Spark User List mailing list archive at Nabble.com.