Re: RDD size in memory - Array[String] vs. case classes

Sean Owen Sat, 11 Oct 2014 02:22:46 -0700

Yes of course. If your number is "123456", the this takes 4 bytes as
an int. But as a String in a 64-bit JVM you have an 8-byte reference,
4-byte object overhead, a char count of 4 bytes, and 6 2-byte chars.
Maybe more i'm not thinking of.


On Sat, Oct 11, 2014 at 6:29 AM, Liam Clarke-Hutchinson
<[email protected]> wrote:
> Hi all,
>
> I'm playing with Spark currently as a possible solution at work, and I've
> been recently working out a rough correlation between our input data size
> and RAM needed to cache an RDD that will be used multiple times in a job.
>
> As part of this I've been trialling different methods of representing the
> data, and I came across a result that surprised me, so I just wanted to
> check what I was seeing.
>
> So my data set is comprised of CSV with appx. 17 fields. When I load my
> sample data set locally, and cache it after splitting on the comma as an
> RDD[Array[String]], the Spark UI shows 8% of the RDD can be cached in
> available RAM.
>
> When I cache it as an RDD of a case class, 11% of the RDD is cacheable, so
> case classes are actually taking up less serialized space than an array.
>
> Is it because case class represents numbers as numbers, as opposed to the
> string array keeping them as strings?
>
> Cheers,
>
> Liam Clarke

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: RDD size in memory - Array[String] vs. case classes

Reply via email to