I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between
org.apache.spark.serializer.JavaSerializer and
org.apache.spark.serializer.KryoSerializer, both having the method named
writeObject.
In my test case, for each line of my text file, it is about 140 bytes of
String. When either JavaSerializer.writeObject(140 bytes of String) or
KryoSerializer.writeObject(140 bytes of String), I didn't see difference in the
underline OutputStream space usage.
Does this mean that KryoSerializer really doesn't give us any benefit for
String type? I understand that for primitives types, it shouldn't have any
benefits, but how about String type?
When we talk about lower the memory using KryoSerializer in spark, under what
case it can bring significant benefits? It is my first experience with the
KryoSerializer, so maybe I am total wrong about its usage.
Thanks
Yong
From: [email protected]
To: [email protected]
Subject: Why I didn't see the benefits of using KryoSerializer
Date: Tue, 17 Mar 2015 12:01:35 -0400
Hi, I am new to Spark. I tried to understand the memory benefits of using
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G. Here is the settings in the
spark-env.sh
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"export
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport
SPARK_EXECUTOR_MEMORY=4g
First test case:val
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web
UI, I can see "Size in Memory" is 1787M, and "Fraction Cached" is 70% with 7
cached partitions.This matched with what I thought, and first count finished
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see "Size in Memory" is 1231M, and "Fraction Cached"
is 100% with 10 cached partitions. It looks like caching the default "java
serialized format" reduce the memory usage, but coming with a cost that first
count finished around 39s and 2nd count finished around 9s. So the job runs
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer"
/opt/spark/bin/spark-shellval
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark
Properties of "Environment" shows "
spark.driver.extraJavaOptions
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
". This is not there for first 2 test cases.But in the web UI of "Storage", the
"Size in Memory" is 1234M, with 100% "Fraction Cached" and 10 cached
partitions. The first count took 46s and 2nd count took 23s.
I don't get much less memory size as I expected, but longer run time for both
counts. Anything I did wrong? Why the memory foot print of "MEMORY_ONLY_SER"
for KryoSerializer still use the same size as default Java serializer, with
worse duration?
Thanks
Yong