Yes of course. If your number is "123456", the this takes 4 bytes as an int. But as a String in a 64-bit JVM you have an 8-byte reference, 4-byte object overhead, a char count of 4 bytes, and 6 2-byte chars. Maybe more i'm not thinking of.
On Sat, Oct 11, 2014 at 6:29 AM, Liam Clarke-Hutchinson <[email protected]> wrote: > Hi all, > > I'm playing with Spark currently as a possible solution at work, and I've > been recently working out a rough correlation between our input data size > and RAM needed to cache an RDD that will be used multiple times in a job. > > As part of this I've been trialling different methods of representing the > data, and I came across a result that surprised me, so I just wanted to > check what I was seeing. > > So my data set is comprised of CSV with appx. 17 fields. When I load my > sample data set locally, and cache it after splitting on the comma as an > RDD[Array[String]], the Spark UI shows 8% of the RDD can be cached in > available RAM. > > When I cache it as an RDD of a case class, 11% of the RDD is cacheable, so > case classes are actually taking up less serialized space than an array. > > Is it because case class represents numbers as numbers, as opposed to the > string array keeping them as strings? > > Cheers, > > Liam Clarke --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
