Hey Dean,
I am not a power user of the sampling feature but my understanding was that
sampling in Hive only works on bucketed tables. I am happy to be corrected
though.

Mark

On Wed, Mar 20, 2013 at 12:20 PM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> Mark,
>
> Aside from what might be wrong here, isn't it true that sampling with the
> bucket clause still works on non-bucketed tables; it's just inefficient
> because it still scans the whole table? Or am I an idiot? ;)
>
> dean
>
> On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover 
> <grover.markgro...@gmail.com>wrote:
>
>> Hi Robert,
>> Sampling in Hive is based on buckets. Therefore, you table needs to be
>> appropriately bucketed.
>>
>> I would recommend storing the results of your inner query in a bucketed
>> table. See how to populate a bucketed table at
>> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>
>> Then you will be able to be sample through it.
>>
>> If the predicate is on partition column, you may be able to get around
>> the intermediate table requirement but in general, as far as I know,
>> intermediate bucketed table might be the only choice.
>>
>> Mark
>>
>> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <robert...@kontagent.com>wrote:
>>
>>> Hi Everyone
>>>
>>> I'm trying to use the TABLESAMPLE function to sample data, however it's
>>> a little more complicated and I am having trouble getting it to run.
>>>
>>> I know that this works fine and it will give me about 25% of the whole
>>> dataset
>>>
>>> select distinct s
>>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
>>> where month <= 201211
>>>
>>> However, in my situation I need to do a TABLESAMPLE on an outer query, a
>>> simple example is
>>>
>>> *select mytest.s *
>>> *from *
>>> * (select distinct s from testtable where month <= 201211)mytest*
>>>
>>> or something like
>>>
>>> *select table1.s*
>>> *from *
>>> * (select distinct s from testtable)table1*
>>> * join*
>>> * (select distinct s from test2table)table2*
>>> * on table1.s=table2.s*
>>>
>>>
>>> How do I use TABLESAMPLE in this case to sample the results of the outer
>>> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
>>> places of my query but it always returns some sort of syntax error and thus
>>> not allowing the query to run.
>>>
>>> Any help is appreciated.
>>>
>>> Robert
>>> **
>>>
>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Reply via email to