Optimiza JDBM to make mapjoin faster
------------------------------------

                 Key: HIVE-1700
                 URL: https://issues.apache.org/jira/browse/HIVE-1700
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: He Yongqiang


copied from email:


From: Joydeep Sen Sarma
Sent: Tuesday, October 12, 2010 11:11 AM
To: Yongqiang He; Liyin Tang; Namit Jain
Subject: RE: Optimize jdbm

seems like we should move all deserialization to hive land. jdbm should just 
work on byte arrays for both keys and values. (since the output of the 
serializer used by hive is byte comparable - that seems to suffice)
________________________________________
From: Yongqiang He
Sent: Tuesday, October 12, 2010 10:22 AM
To: Liyin Tang; Namit Jain
Cc: Joydeep Sen Sarma
Subject: Optimize jdbm

  1.  Htree.get() cost 70% total time.  It could help a lot if there is bloom 
filter here to avoid unneeded get() if we know for sure the given key is not in 
JDBM. (we can generate the bloom filter when doing the jdbm sink, and read into 
memory when doing read. )
  2.  HTree.get() will deserialize both key and value until find a matched key. 
We can only de-serialize the key, and de-serialize the value until  the key 
match.

Any others?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to