These two serve the same purpose and logically are very much alike. The difference is that partitioning may be explicit (partitioning, in pretty much all solid RDMBSs, Hive too) or implicit (hashing/bucketing, just Hive?). In Hive, for some reason, they come with different, mutually exclusive set of optimisations.
Pruning is a good example - available for Partitionined but not for Bucketed tables. You can track full list here: https://issues.apache.org/jira/browse/HIVE-9523 Thank you, Kind Regards ~Maciek On 26 January 2016 at 21:44, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Hi, > > > > There are number of questions brought up about Hive Bucketing. As I see - > it is another name for hash partitioning (assuming that Hive partitioning > is effectively range partitioning). I borrow these terms (range and hash > partitioning) from industry standard as they are commonly used among RDBMS . > > > > Excuse my ignorance, I am at loss to know why hash partitioning is called > bucketing in Hive? Someone may throw light on what are the main differences > if any. > > > > As I see it in RDBMS Partitioning has these benefits: > > > > 1. Availability -- each partition can reside on a different > segment/device. Hence a problem with a device will take out a slice of the > table's data instead of the whole thing. > > 2. Manageability -- partitioning provides a mechanism for splitting > whole table jobs into clear batches. Partition exchange can make it easier > to bulk load data. Getting rid of fragmentation , moving older partitions > to lower tier storage, updating stats etc > > 3. Performance -- Partition elimination > > > > Hash partitioning is where a hashing function is applied. RDBMS will apply > a linear hashing algorithm f(x) like mod (x) *to prevent data from > clustering within specific partitions*. Hashing is very effective if the > column selected for partitioning has very high selectivity like an ID > column, where selectivity (*select count(distinct(column))/count(column)* > ) = 1. In this case, the created partitions will be as evenly sized as > possible. In a nutshell hash partitioning is a method to get data evenly > distributed over many files. One should define the number of hash > partitions by a power of two -- 2^n, like 2, 4, 8, 16 etc. to achieve best > results. *I am pretty sure this definition applies to Hive bucketing > although hashing is far simpler.* > > > > As for performance, physical co-location of records can speed up some > queries- those which are searching records by a defined range of keys. > However, any queries which do not match the grain of the query will not > perform faster (and may even perform slower) than a non-hash-partitioned > (reads bucketing) table. > > > > *IMO, Hash partitioning is unlikely to provide performance benefits, > precisely because it shuffles the keys across the whole table. It will > provide the availability and manageability benefits of partitioning. Unlike > standard range partitioning, the number of buckets is fixed so it does not > fluctuate with data. **It may even allow a partition wise join i.e. a > join between two tables that are hash partitioned (bucketed) on the same > column with the same number of partitions (buckets), thus helping certain > queries.* > > > > > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > >