RE: Question about Serialization in Storage Level

Jiang, Zhipeng Fri, 22 May 2015 20:42:07 -0700

Hi Todd, Howard,

Thanks for your reply, I might not present my question clearly.


What I mean is, if I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the 
BlockManager will cache the rdd to MemoryStore. RDD will be migrated to 
DiskStore when it cannot fit in memory. I think this migration does require 
data serialization and compression (if spark.rdd.compress is set to be true). 
So the data in Disk is serialized, even if I didn’t choose a serialized storage 
level, am I right?

Thanks,
Zhipeng


From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Thursday, May 21, 2015 8:49 PM
To: Jiang, Zhipeng
Cc: user@spark.apache.org
Subject: Re: Question about Serialization in Storage Level

From the docs,  
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence:

Storage Level

Meaning

MEMORY_ONLY

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in 
memory, some partitions will not be cached and will be recomputed on the fly 
each time they're needed. This is the default level.

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in 
memory, store the partitions that don't fit on disk, and read them from there 
when they're needed.

MEMORY_ONLY_SER

Store RDD as serialized Java objects (one byte array per partition). This is 
generally more space-efficient than deserialized objects, especially when using 
a fast serializer<https://spark.apache.org/docs/latest/tuning.html>, but more 
CPU-intensive to read.

MEMORY_AND_DISK_SER

Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to 
disk instead of recomputing them on the fly each time they're needed.


On Thu, May 21, 2015 at 3:52 AM, Jiang, Zhipeng 
<zhipeng.ji...@intel.com<mailto:zhipeng.ji...@intel.com>> wrote:
Hi there,

This question may seem to be kind of naïve, but what’s the difference between 
MEMORY_AND_DISK and MEMORY_AND_DISK_SER?

If I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager won’t 
serialize the rdd?

Thanks,
Zhipeng

RE: Question about Serialization in Storage Level

Reply via email to