Hi Michael, To add to Bejoy's answer: As far as I understand, the hashing function used in bucketing is not persistently stored, so as per your example (SELECT * from T where user_id = X), given X, Hive couldn't figure out what bucket for table T should it be looking in.
As Bejoy pointed out, bucketing is used in sampling but another place it is used in joins (like a Bucketed map join). If the value X is coming from another table that is bucketed on the same column (user_id) and the number of buckets in that table and table T are the same or a multiple of another, the join would happen on corresponding buckets instead of the entire tables, considerably improving the performance of the query in some cases. Not sure if that fits your bill. Good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Bejoy Ks" <bejoy...@yahoo.com> To: user@hive.apache.org Sent: Friday, March 2, 2012 5:58:56 AM Subject: Re: Where clause and bucketized table Michael Unlike partitions, hive doesn't choose buckets by itself based on a where clause in HQL. You need to specify the number of buckets that the query should be executing on using TABLE SAMPLE clause . SELECT * from T TABLESAMPLE (m OUT OF n BUCKETS ON id) where user_id = X; https://cwiki.apache.org/Hive/languagemanual-sampling.html Buckets is mostly used for sampling and your query doesn't look like one, if you are looking for grouping of data into subsets and processing only the same intead of the whole data set then may be you should consider using Partitions. Regards Bejoy.K.S From: "mdefoinplatel....@orange.com" <mdefoinplatel....@orange.com> To: "user@hive.apache.org" <user@hive.apache.org> Sent: Friday, March 2, 2012 3:15 PM Subject: Where clause and bucketized table Hi folks, I have a table T bucketized on user_id and I am surprised to see that all the buckets are read during the execution of the following query: SELECT * from T where user_id = X What should I do to make sure hive will account for the bucket structure to run this query ? Cheers, Michael _________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorization. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange shall not be liable if this message was modified, changed or falsified. Thank you.