[ 
https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13727102#comment-13727102
 ] 

Brock Noland commented on HIVE-4838:
------------------------------------

I guess we could go that route. My thought was that the memory consumption was 
monitored to be conservative? I've always wondered about this. I mean if an 
admin sets mapred.child.java.opts and io.sort.mb final on the cluster the 
settings we are using from a client perspective could be completely different 
therefore it's possible it "works" locally but fails on the cluster. Another 
question I had about this is that ORC has a memory manager that assumes it can 
use a certain percentage of ram but that could conflict with our work here? 
That is the ORC memory manager could use memory while creating the hash table 
that we won't use when reading the hash table?

Additionally I thought it might make sense to only store offsets into a side 
file in the hash map to reduce memory consumption and then throw say a 25MB LRU 
cache on lookups into the file. Since the file is small it should be in OS 
buffer cache when not in the LRU cache.

Maybe we should take up memory management during map joins in another jira?
                
> Refactor MapJoin HashMap code to improve testability and readability
> --------------------------------------------------------------------
>
>                 Key: HIVE-4838
>                 URL: https://issues.apache.org/jira/browse/HIVE-4838
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Brock Noland
>            Assignee: Brock Noland
>         Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, 
> HIVE-4838.patch, HIVE-4838.patch
>
>
> MapJoin is an essential component for high performance joins in Hive and the 
> current code has done great service for many years. However, the code is 
> showing it's age and currently suffers  from the following issues:
> * Uses static state via the MapJoinMetaData class to pass serialization 
> metadata to the Key, Row classes.
> * The api of a logical "Table Container" is not defined and therefore it's 
> unclear what apis HashMapWrapper 
> needs to publicize. Additionally HashMapWrapper has many used public methods.
> * HashMapWrapper contains logic to serialize, test memory bounds, and 
> implement the table container. Ideally these logical units could be seperated
> * HashTableSinkObjectCtx has unused fields and unused methods
> * CommonJoinOperator and children use ArrayList on left hand side when only 
> List is required
> * There are unused classes MRU, DCLLItemm and classes which duplicate 
> functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to