[ https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907758#comment-13907758 ]
Sergey Shelukhin commented on HIVE-6430: ---------------------------------------- Here's the summary of the overhead per entry /after both of the above patches go in/ (before, the overhead in key and value is significantly bigger). HashTable Entry array: 8+ bytes Entry: 32 bytes Key and value objects: 32 bytes Key Byte array object + length: 20 bytes. Field count and null mask: 1 byte. Rounding to 8 bytes: 0-7 bytes. Row Fields: 8 bytes. Object array object + length: 24 bytes. Per-column, writable object: 16 bytes (assuming all the other fields in writables are useful data). "Guaranteed" overhead per entry: 125 bytes, plus writables for row values and padding on key. Example double key, row with one field: additional 21 bytes per entry, ~146 total Example int key, row with 5 fields: additional 87 bytes per entry, ~212 total + some overhead depending on HashMap fullness. So that's a lot of overhead (depends on the data of course, if row contains cat photos in binary then 150-200 bytes is not much). The approach to get rid of per-entry overhead in general involves a hashtable implemented on top of array, with open addressing, and storing the actual variable-length keys and rows in big flat array(s) of byte[]-s or objects. That would get rid of key and rowe object overhead, most of hashmap overhead, most of key overhead, and most/some (see below) of row overhead. The good thing about the table is that it's R/O after initial creation and we never delete, so we don't have to worry about many scenarios. *Details (scroll down for estimates)* Simple case, assuming we can convert both key and row into bytes: Allocate largish fixed size byte arrays to have an infinite write buffer (or array can be reallocated if needed, or combination). Have a flat, custom-made hash table similar to HPPC one, that would store offsets into that array in the key array (of longs), and would have no value or state arrays. Some additional stuff, for example lengths or null bitmasks can be fit into key array values also. When loading, incoming writables would write the keys and values into the write buffer. We know the schema so we don't have to worry about storing types, field offsets etc. Then write a fixed-size tail with e.g. length of key and value, to know what to compare and where value starts, etc. Because there's no requirement to allocate some number of bytes like there is now, v-length format can be used if needed to save space... but it shouldn't be too complicated. Probably it shouldn't use ORC there :) Then, key array uses standard hashtable put to store the offset to the postfix. When getting, the key can still be compared same as now, as a byte array. One extra "dereference" from key array to get to the actual key by index. For values, writables will have to be re-created when the row is requested because everything depends on writables now. Writables will trivially read from byte array at offset. Obviously this has performance cost. Note that this is not like current lazy deserialization: 1) We do not deserialize on demand - final writables are just written to/read from byte array, so creating them should be cheaper than deserializing. 2) Writables are not preserved for future use and are created every time row is accessed, which has perf cost but saves memory. Total overhead per entry would be around 14-16 bytes, plus some fixed or semi-fixed overhead depending on the write buffer allocation scheme. In the above examples overhead will go from 146 and 212 bytes to 16 and 16. Another alternative is similar, but with only keys in byte array, and values in a separate large Object array operating on the same principles, in writables with all their glory. Key array can store indices and length to both, probably 2-3 longs per entry depending on what limitations we can accept. So the total overhead will be around 16-24 bytes + 16 per field in the row, but writables wouldn't need to be re-created. In the above examples overhead will go from 146 and 212 bytes to 32 and 96. *Tl;dr and estimates* The bad thing obviously is that w/o key and row objects all the interfaces around them would cease to exist. This is esp. bad for MR due to convoluted HashTable path with write and read, so in the first cut I think we should go Tez-only and preserve legacy path with objects for MR. There are several good things... * We can essentially copy-paste HPPC long-long hashmap. It probably doesn't fit by itself and we don't need all the features, but it must be simple to convert to above. So we don't need to code up the open-addressing hashmap. * W.r.t. interface difference, I looked at the divergent paths; Tez HT loader obviously would be able to do whatever. MapJoinOperator is the only place where there will be problems - it currently creates the key and then calls get(key). Get can be changed to take the row, so that it would create the key for get as necessary. * Code for byte key creation, compare, validation etc.; and some other code from the above two patches can be reused; plus I know all I need to know and what needs to be done about writables and bytes from them. > MapJoin hash table has large memory overhead > -------------------------------------------- > > Key: HIVE-6430 > URL: https://issues.apache.org/jira/browse/HIVE-6430 > Project: Hive > Issue Type: Improvement > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > > Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 > for row) can take several hundred bytes, which is ridiculous. I am reducing > the size of MJKey and MJRowContainer in other jiras, but in general we don't > need to have java hash table there. We can either use primitive-friendly > hashtable like the one from HPPC (Apache-licenced), or some variation, to map > primitive keys to single row storage structure without an object per row > (similar to vectorization). -- This message was sent by Atlassian JIRA (v6.1.5#6160)