Hey Dean, I am not a power user of the sampling feature but my understanding was that sampling in Hive only works on bucketed tables. I am happy to be corrected though.
Mark On Wed, Mar 20, 2013 at 12:20 PM, Dean Wampler < dean.wamp...@thinkbiganalytics.com> wrote: > Mark, > > Aside from what might be wrong here, isn't it true that sampling with the > bucket clause still works on non-bucketed tables; it's just inefficient > because it still scans the whole table? Or am I an idiot? ;) > > dean > > On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover > <grover.markgro...@gmail.com>wrote: > >> Hi Robert, >> Sampling in Hive is based on buckets. Therefore, you table needs to be >> appropriately bucketed. >> >> I would recommend storing the results of your inner query in a bucketed >> table. See how to populate a bucketed table at >> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html >> >> Then you will be able to be sample through it. >> >> If the predicate is on partition column, you may be able to get around >> the intermediate table requirement but in general, as far as I know, >> intermediate bucketed table might be the only choice. >> >> Mark >> >> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li <robert...@kontagent.com>wrote: >> >>> Hi Everyone >>> >>> I'm trying to use the TABLESAMPLE function to sample data, however it's >>> a little more complicated and I am having trouble getting it to run. >>> >>> I know that this works fine and it will give me about 25% of the whole >>> dataset >>> >>> select distinct s >>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) >>> where month <= 201211 >>> >>> However, in my situation I need to do a TABLESAMPLE on an outer query, a >>> simple example is >>> >>> *select mytest.s * >>> *from * >>> * (select distinct s from testtable where month <= 201211)mytest* >>> >>> or something like >>> >>> *select table1.s* >>> *from * >>> * (select distinct s from testtable)table1* >>> * join* >>> * (select distinct s from test2table)table2* >>> * on table1.s=table2.s* >>> >>> >>> How do I use TABLESAMPLE in this case to sample the results of the outer >>> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various >>> places of my query but it always returns some sort of syntax error and thus >>> not allowing the query to run. >>> >>> Any help is appreciated. >>> >>> Robert >>> ** >>> >> >> > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > >