One thing to remember is that Strings are composed of chars in Java,
which take 2 bytes each. The encoding of the text on disk on S3 is
probably something like UTF-8, which takes much closer to 1 byte per
character for English text. This might explain the factor of ~2
difference.
On Wed, Oct 22, 2
You can enable rdd compression (*spark.rdd.compress*) also you can
use MEMORY_ONLY_SER (
*sc.sequenceFile[String,String]("s3n://somebucket/part-0").persist(StorageLevel.MEMORY_ONLY_SER*
*)* ) to reduce the rdd size in memory.
Thanks
Best Regards
On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath