Take a look at UnsafeArrayData and UnsafeMapData.
On Sun, Sep 18, 2016 at 9:06 AM, assaf.mendelson <assaf.mendel...@rsa.com> wrote: > Hi, > > I am trying to understand how spark types are kept in memory and accessed. > > I tried to look at the code at the definition of MapType and ArrayType for > example and I can’t seem to find the relevant code for its actual > implementation. > > > > I am trying to figure out how these two types are implemented to > understand how they match my needs. > > In general, it appears the size of a map is the same as two arrays which > is about double the naïve array implementation: if I have 1000 rows, each > with a map from 10K integers to 10K integers, I find through caching the > dataframe that the total is is ~150MB (the naïve implementation of two > arrays would code 1000*10000*(4+4) or a total of ~80MB). I see the same > size if I use two arrays. Second, what would be the performance of updating > the map/arrays as they are immutable (i.e. some copying is required). > > > > The reason I am asking this is because I wanted to do an aggregate > function which calculates a variation of a histogram. > > The most naïve solution for this would be to have a map from the bin to > the count. But since we are talking about an immutable map, wouldn’t that > cost a lot more? > > An even further optimization would be to use a mutable array where we > combine the key and value to a single value (key and value are both int in > my case). Assuming the maximum number of bins is small (e.g. less than 10), > it is often cheaper to just search the array for the right key (and in this > case the size of the data is expected to be significantly smaller than > map). In my case, most of the type (90%) there are less than 3 elements in > the bin and If I have more than 10 bins I basically do a combination to > reduce the number. > > > > For few elements, a map becomes very inefficient - If I create 10M rows > with 1 map from int to int each I get an overall of ~380MB meaning ~38 > bytes per element (instead of just 8). For array, again it is too large > (229MB, i.e. ~23 bytes per element). > > > > Is there a way to implement a simple mutable array type to use in the > aggregation buffer? Where is the portion of the code that handles the > actual type handling? > > Thanks, > > Assaf. > > ------------------------------ > View this message in context: Memory usage for spark types > <http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-usage-for-spark-types-tp18984.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. >