Hi Michael,
To add to Bejoy's answer:

As far as I understand, the hashing function used in bucketing is not 
persistently stored, so as per your example (SELECT * from T where user_id = 
X), given X, Hive couldn't figure out what bucket for table T should it be 
looking in.

As Bejoy pointed out, bucketing is used in sampling but another place it is 
used in joins (like a Bucketed map join). If the value X is coming from another 
table that is bucketed on the same column (user_id) and the number of buckets 
in that table and table T are the same or a multiple of another, the join would 
happen on corresponding buckets instead of the entire tables, considerably 
improving the performance of the query in some cases.

Not sure if that fits your bill.

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

----- Original Message -----
From: "Bejoy Ks" <bejoy...@yahoo.com>
To: user@hive.apache.org
Sent: Friday, March 2, 2012 5:58:56 AM
Subject: Re: Where clause and bucketized table



Michael 
Unlike partitions, hive doesn't choose buckets by itself based on a where 
clause in HQL. You need to specify the number of buckets that the query should 
be executing on using TABLE SAMPLE clause . 
SELECT * from T TABLESAMPLE (m OUT OF n BUCKETS ON id) where user_id = X; 



https://cwiki.apache.org/Hive/languagemanual-sampling.html 



Buckets is mostly used for sampling and your query doesn't look like one, if 
you are looking for grouping of data into subsets and processing only the same 
intead of the whole data set then may be you should consider using Partitions. 



Regards 
Bejoy.K.S 





From: "mdefoinplatel....@orange.com" <mdefoinplatel....@orange.com> 
To: "user@hive.apache.org" <user@hive.apache.org> 
Sent: Friday, March 2, 2012 3:15 PM 
Subject: Where clause and bucketized table 





Hi folks, 

I have a table T bucketized on user_id and I am surprised to see that all the 
buckets are read during the execution of the following query: 

SELECT * from T where user_id = X 

What should I do to make sure hive will account for the bucket structure to run 
this query ? 

Cheers, 
Michael 
_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this 
message was modified, changed or falsified.
Thank you. 

Reply via email to