Are there any effective limits on the number of partitions? Partitions is the answer that we choose because it makes logical sense. I.e. I have Days, on a given day I have a number of sources. Sometimes I want to query by day and search all sources, other times, I want to focus on specific sources. With Bucketing, will it prune on the column like partitions do automatically? (Remember, this is specific to ORC files that I am working with here).
On Sat, Aug 10, 2013 at 12:19 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > Bucketing does deal with that if you bucket on column you always get > bucket number of files. Because your hashing the value into a bucket. > > A query scanning many partitions and files is needlessly slow from MR > overhead. > > > On Sat, Aug 10, 2013 at 12:58 PM, John Omernik <j...@omernik.com> wrote: > >> One issue with the bucketing is that the number of sources on any given >> day is dynamic. On some days it's 4, others it's 14 and it's also >> constantly changing. I am hoping to use some of the features of the ORC >> files to almost make virtual partitions, but apparently I am going to run >> into issues either way. >> >> On another note, is there a limit to hive and partitions? I am hovering >> around 10k partitions on one table right now. It's still working, but some >> metadata operations can take a long time. The Sub-Partitions are going to >> hurt me here going forward I am guessing, so it may be worth flattening out >> to only days, even at the expense of read queries... thoughts? >> >> >> >> On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> Agree with Edward, >>> >>> whole purpose of bucketing for me is to prune the data in where clause. >>> Else it totally defeats the purpose of splitting data into finite number of >>> identifiable distributions to improve the performance. >>> >>> But is my understanding correct that it does help in reducing the >>> number of sub partitions we create at the bottom of table can be limited if >>> we identify the pattern does not exceed a finite number of values on that >>> partitions? (even if it cross this limit bucketting does take care of it >>> upto some volume) >>> >>> >>> On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo <edlinuxg...@gmail.com >>> > wrote: >>> >>>> So there is one thing to be really carefully about bucketing. Say you >>>> bucket a table into 10 buckets, select with where does not actually prune >>>> the input buckets so many queries scan all the buckets. >>>> >>>> >>>> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar >>>> <nitinpawar...@gmail.com>wrote: >>>> >>>>> will bucketing help? if you know finite # partiotions ? >>>>> >>>>> >>>>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <j...@omernik.com>wrote: >>>>> >>>>>> I have a table that currently uses RC files and has two levels of >>>>>> partitions. day and source. The table is first partitioned by day, then >>>>>> within each day there are 6-15 source partitions. This makes for a lot >>>>>> of >>>>>> crazy partitions and was wondering if there'd be a way to optimize this >>>>>> with ORC files and some sorting. >>>>>> >>>>>> Specifically, would there be a way in a new table to make source a >>>>>> field (removing the partition)and somehow, as I am inserting into this >>>>>> new >>>>>> setup sort by source in such a way that will help separate the >>>>>> files/indexes in a way that gives me almost the same performance as ORC >>>>>> with the two level partitions? Just trying to optimize here and curious >>>>>> what people think. >>>>>> >>>>>> John >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> >