Hi Todd, Howard, Thanks for your reply, I might not present my question clearly.
What I mean is, if I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager will cache the rdd to MemoryStore. RDD will be migrated to DiskStore when it cannot fit in memory. I think this migration does require data serialization and compression (if spark.rdd.compress is set to be true). So the data in Disk is serialized, even if I didn’t choose a serialized storage level, am I right? Thanks, Zhipeng From: Todd Nist [mailto:tsind...@gmail.com] Sent: Thursday, May 21, 2015 8:49 PM To: Jiang, Zhipeng Cc: user@spark.apache.org Subject: Re: Question about Serialization in Storage Level From the docs, https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence: Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer<https://spark.apache.org/docs/latest/tuning.html>, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng <zhipeng.ji...@intel.com<mailto:zhipeng.ji...@intel.com>> wrote: Hi there, This question may seem to be kind of naïve, but what’s the difference between MEMORY_AND_DISK and MEMORY_AND_DISK_SER? If I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager won’t serialize the rdd? Thanks, Zhipeng