Re: partition and bucket

Devopam Mittra Tue, 14 Apr 2015 20:03:01 -0700

+1
quite well explained. liked it much

regards
Dev


On Mon, Apr 13, 2015 at 1:34 AM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Hi,
>
>
>
> I will try to have a go at your points but I am sure there are many
> experts around.
>
>
>
> As you may know already in RDBMS partitioning (dividing a very large table
> into sub-tables conceptually) is deployed to address three areast.
>
>
>
> 1.     Availability -- each partition can reside on a different
> tablespace/device. Hence a problem with a tablespace/device will take out a
> slice of the table's data instead of the whole thing. This does not really
> ap[ply to Hive with 3 block replication as standard
>
> 2.     Manageability -- partitioning provides a mechanism for splitting
> whole table jobs into clear batches. Partition exchange can make it easier
> to bulk load data. Defragging, moving older partitions to lower tier
> storage, updating stats etc Most of these benefits apply to Hive as well.
> Please check the docs.
>
> 3.     Performance -- partition elimination
>
>
>
> In simplest form (excluding composite partitioning), Hive partitioning
> will be similar to “range partitioning” in RDBMS. One can partition a table
> (say *partitioned_table* as shown below which is batch loaded from
> *non_partitioned_table*) -- by country, year, month etc. Each partition
> will be stored in Hive under sub-directory *table/year/month* like below
>
>
>
> /user/hive/warehouse/scratchpad.db
> */partitioned_table/country=Italy/year=2014/month=Feb*
>
>
>
> Hive does not have the concept of indexes local or global as yet. So
> without partitioning a simple query in Hive will have to read the entire
> table even if it is filtering a smaller result set (WHERE CLAUSE). This
> becomes a bottleneck for running multiple MapReduce jobs over a large table. 
> So
> partitioning will help localise the query by hitting the relevant
> sub-directory or sub-directories only. There is another important aspect
> with Hive as well. The locking granularity will be determined by the lowest
> slice in the filing system (sub-directory). So entering data into the above
> partition/file, will take an exclusive lock on that partition/file but
> crucially the rest of partitions will be available (assuming concurrency in
> Hive is enabled).
>
>
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
> |  lockid  |  database   |         table          |
> partition              | lock_state  |  lock_type   | transaction_id  |
> last_heartbeat  |  acquired_at   |  user   | hostname  |
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
> | Lock ID  | Database    | Table                  |
> Partition                          | State       | Type         |
> Transaction ID  | Last Hearbeat   | Acquired At    | User    | Hostname  |
>
> | 1711     | scratchpad  | non_partitioned_table  |
> NULL                               | ACQUIRED    | *SHARED_READ*  |
> NULL            | 1428862154670   | 1428862151904  | hduser  | rhes564   |
>
> | 1711     | scratchpad  | *partitioned_table      |
> country=Italy/year=2014/month=Feb*  | ACQUIRED    | *EXCLUSIVE *   |
> NULL            | 1428862154670   | 1428862151905  | hduser  | rhes564   |
>
>
> +----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+
>
>
>
> Now your point 2, bucketing in Hive refers to hash partitioning where a
> hashing function is applied. Likewise an RDBMS, Hive will apply a linear
> hashing algorithm to prevent data from clustering within specific
> partitions. Hashing is very effective if the column selected for bucketing
> has very high selectivity like an ID column where selectivity (*select
> count(distinct(column))/count(column)* ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a *partition wise join*
> i.e. a join between two tables that are bucketed on the same column with
> the same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
>
>
> HTH
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>
> *From:* Ashok Kumar [mailto:ashok34...@yahoo.com]
> *Sent:* 10 April 2015 17:46
> *To:* user@hive.apache.org
> *Subject:* partition and bucket
>
>
>
> Greeting all,
>
> Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.
>
> I would like to understand partition and bucketing in Hive and the
> difference between.
>
> Shall be grateful if someone explains where shall I use partition or
> bucket for best practices.
>
> thanks
>
>
>



-- 
Devopam Mittra
Life and Relations are not binary

Re: partition and bucket

Reply via email to