Re: Optimizing ORC Sorting - Replace two level Partitions with one?

John Omernik Sat, 10 Aug 2013 10:37:05 -0700

Are there any effective limits on the number of partitions? Partitions is
the answer that we choose because it makes logical sense. I.e. I have Days,
on a given day I have a number of sources. Sometimes I want to query by day
and search all sources, other times, I want to focus on specific sources.
 With Bucketing, will it prune on the column like partitions do
automatically? (Remember, this is specific to ORC files that I am working
with here).



On Sat, Aug 10, 2013 at 12:19 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> Bucketing does deal with that if you bucket on column you always get
> bucket number of files. Because your hashing the value into a bucket.
>
> A query scanning many partitions and files is needlessly slow from MR
> overhead.
>
>
> On Sat, Aug 10, 2013 at 12:58 PM, John Omernik <j...@omernik.com> wrote:
>
>> One issue with the bucketing is that the number of sources on any given
>> day is dynamic. On some days it's 4, others it's 14 and it's also
>> constantly changing.  I am hoping to use some of the features of the ORC
>> files to almost make virtual partitions, but apparently I am going to run
>> into issues either way.
>>
>> On another note, is there a limit to hive and partitions? I am hovering
>> around 10k partitions on one table right now. It's still working, but some
>> metadata operations can take a long time. The Sub-Partitions are going to
>> hurt me here going forward I am guessing, so it may be worth flattening out
>> to only days, even at the expense of read queries... thoughts?
>>
>>
>>
>> On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>
>>> Agree with Edward,
>>>
>>> whole purpose of bucketing for me is to prune the data in where clause.
>>> Else it totally defeats the purpose of splitting data into finite number of
>>> identifiable distributions to improve the performance.
>>>
>>> But is my understanding correct that it  does help in reducing the
>>> number of sub partitions we create at the bottom of table can be limited if
>>> we identify the pattern does not exceed a finite number of values on that
>>> partitions? (even if it cross this limit bucketting does take care of it
>>> upto some volume)
>>>
>>>
>>> On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo <edlinuxg...@gmail.com
>>> > wrote:
>>>
>>>> So there is one thing to be really carefully about bucketing. Say you
>>>> bucket a table into 10 buckets, select with where does not actually prune
>>>> the input buckets so many queries scan all the buckets.
>>>>
>>>>
>>>> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar 
>>>> <nitinpawar...@gmail.com>wrote:
>>>>
>>>>> will bucketing help? if you know finite # partiotions ?
>>>>>
>>>>>
>>>>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <j...@omernik.com>wrote:
>>>>>
>>>>>> I have a table that currently uses RC files and has two levels of
>>>>>> partitions.  day and source.  The table is first partitioned by day, then
>>>>>> within each day there are 6-15 source partitions.  This makes for a lot 
>>>>>> of
>>>>>> crazy partitions and was wondering if there'd be a way to optimize this
>>>>>> with ORC files and some sorting.
>>>>>>
>>>>>> Specifically, would there be a way in a new table to make source a
>>>>>> field (removing the partition)and somehow, as I am inserting into this 
>>>>>> new
>>>>>> setup sort by source in such a way that will help separate the
>>>>>> files/indexes in a way that gives me almost the same performance as ORC
>>>>>> with the two level partitions?  Just trying to optimize here and curious
>>>>>> what people think.
>>>>>>
>>>>>> John
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: Optimizing ORC Sorting - Replace two level Partitions with one?

Reply via email to