+1 quite well explained. liked it much regards Dev
On Mon, Apr 13, 2015 at 1:34 AM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Hi, > > > > I will try to have a go at your points but I am sure there are many > experts around. > > > > As you may know already in RDBMS partitioning (dividing a very large table > into sub-tables conceptually) is deployed to address three areast. > > > > 1. Availability -- each partition can reside on a different > tablespace/device. Hence a problem with a tablespace/device will take out a > slice of the table's data instead of the whole thing. This does not really > ap[ply to Hive with 3 block replication as standard > > 2. Manageability -- partitioning provides a mechanism for splitting > whole table jobs into clear batches. Partition exchange can make it easier > to bulk load data. Defragging, moving older partitions to lower tier > storage, updating stats etc Most of these benefits apply to Hive as well. > Please check the docs. > > 3. Performance -- partition elimination > > > > In simplest form (excluding composite partitioning), Hive partitioning > will be similar to “range partitioning” in RDBMS. One can partition a table > (say *partitioned_table* as shown below which is batch loaded from > *non_partitioned_table*) -- by country, year, month etc. Each partition > will be stored in Hive under sub-directory *table/year/month* like below > > > > /user/hive/warehouse/scratchpad.db > */partitioned_table/country=Italy/year=2014/month=Feb* > > > > Hive does not have the concept of indexes local or global as yet. So > without partitioning a simple query in Hive will have to read the entire > table even if it is filtering a smaller result set (WHERE CLAUSE). This > becomes a bottleneck for running multiple MapReduce jobs over a large table. > So > partitioning will help localise the query by hitting the relevant > sub-directory or sub-directories only. There is another important aspect > with Hive as well. The locking granularity will be determined by the lowest > slice in the filing system (sub-directory). So entering data into the above > partition/file, will take an exclusive lock on that partition/file but > crucially the rest of partitions will be available (assuming concurrency in > Hive is enabled). > > > > > +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+ > > | lockid | database | table | > partition | lock_state | lock_type | transaction_id | > last_heartbeat | acquired_at | user | hostname | > > > +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+ > > | Lock ID | Database | Table | > Partition | State | Type | > Transaction ID | Last Hearbeat | Acquired At | User | Hostname | > > | 1711 | scratchpad | non_partitioned_table | > NULL | ACQUIRED | *SHARED_READ* | > NULL | 1428862154670 | 1428862151904 | hduser | rhes564 | > > | 1711 | scratchpad | *partitioned_table | > country=Italy/year=2014/month=Feb* | ACQUIRED | *EXCLUSIVE * | > NULL | 1428862154670 | 1428862151905 | hduser | rhes564 | > > > +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+ > > > > Now your point 2, bucketing in Hive refers to hash partitioning where a > hashing function is applied. Likewise an RDBMS, Hive will apply a linear > hashing algorithm to prevent data from clustering within specific > partitions. Hashing is very effective if the column selected for bucketing > has very high selectivity like an ID column where selectivity (*select > count(distinct(column))/count(column)* ) = 1. In this case, the created > partitions/ files will be as evenly sized as possible. In a nutshell > bucketing is a method to get data evenly distributed over many > partitions/files. One should define the number of buckets by a power of > two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing > will help concurrency in Hive. It may even allow a *partition wise join* > i.e. a join between two tables that are bucketed on the same column with > the same number of buckets (anyone has tried this?) > > > > One more things. When one defines the number of buckets at table creation > level in Hive, the number of partitions/files will be fixed. In contrast, > with partitioning you do not have this limitation. > > > > HTH > > > > Mich > > > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. > > > > *From:* Ashok Kumar [mailto:ashok34...@yahoo.com] > *Sent:* 10 April 2015 17:46 > *To:* user@hive.apache.org > *Subject:* partition and bucket > > > > Greeting all, > > Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL. > > I would like to understand partition and bucketing in Hive and the > difference between. > > Shall be grateful if someone explains where shall I use partition or > bucket for best practices. > > thanks > > > -- Devopam Mittra Life and Relations are not binary