[ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13727102#comment-13727102 ]
Brock Noland commented on HIVE-4838: ------------------------------------ I guess we could go that route. My thought was that the memory consumption was monitored to be conservative? I've always wondered about this. I mean if an admin sets mapred.child.java.opts and io.sort.mb final on the cluster the settings we are using from a client perspective could be completely different therefore it's possible it "works" locally but fails on the cluster. Another question I had about this is that ORC has a memory manager that assumes it can use a certain percentage of ram but that could conflict with our work here? That is the ORC memory manager could use memory while creating the hash table that we won't use when reading the hash table? Additionally I thought it might make sense to only store offsets into a side file in the hash map to reduce memory consumption and then throw say a 25MB LRU cache on lookups into the file. Since the file is small it should be in OS buffer cache when not in the LRU cache. Maybe we should take up memory management during map joins in another jira? > Refactor MapJoin HashMap code to improve testability and readability > -------------------------------------------------------------------- > > Key: HIVE-4838 > URL: https://issues.apache.org/jira/browse/HIVE-4838 > Project: Hive > Issue Type: Bug > Reporter: Brock Noland > Assignee: Brock Noland > Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, > HIVE-4838.patch, HIVE-4838.patch > > > MapJoin is an essential component for high performance joins in Hive and the > current code has done great service for many years. However, the code is > showing it's age and currently suffers from the following issues: > * Uses static state via the MapJoinMetaData class to pass serialization > metadata to the Key, Row classes. > * The api of a logical "Table Container" is not defined and therefore it's > unclear what apis HashMapWrapper > needs to publicize. Additionally HashMapWrapper has many used public methods. > * HashMapWrapper contains logic to serialize, test memory bounds, and > implement the table container. Ideally these logical units could be seperated > * HashTableSinkObjectCtx has unused fields and unused methods > * CommonJoinOperator and children use ArrayList on left hand side when only > List is required > * There are unused classes MRU, DCLLItemm and classes which duplicate > functionality MapJoinSingleKey and MapJoinDoubleKeys -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira