Re: ORC file tuning

2013-12-30 Thread Yin Huai
Hi Avrilia, In org.apache.hadoop.hive.ql.io.orc.WriterImpl, the block size is determined by Math.min(1.5GB, 2 * stripeSize). Also, you can use "orc.block.padding" in the table property to control whether the writer to pad HDFS blocks to prevent stripes from straddling blocks. The default value of

Re: Question on correlation optimizer

2013-12-10 Thread Yin Huai
Hi Avrilia, It is caused by distinct aggregations in TPC-H Q21. Because Hive adds those distinct columns in the key columns of ReduceSinkOperators and correlation optimizer only check exact same key columns right now, this query will not be optimized. The jira of this issue is https://issues.apach

Re: TPC-H queries on Hive 0.12

2013-11-22 Thread Yin Huai
I remember that textfiles are used in those scripts. With 0.12, I think ORC should be used. Also, I think those sub-queries should be merged into a single query. With a single query, if a reduce join is converted to a map join, this map join can be merged to its child job. But, if this join is eval

Re: ORC Tuning - Examples?

2013-11-13 Thread Yin Huai
That is exactly the type of explanation of settings I'd like to > see. More than just what it does, but the tradeoffs, and how things are > applied in the real world. Have you played with the stride length at all? > > > On Wed, Nov 13, 2013 at 1:13 PM, Yin Huai wrote: > >

Re: ORC Tuning - Examples?

2013-11-13 Thread Yin Huai
Hi John, Here is my experience on the stripe size. For a given table, when the stripe size is increased, the size of a column in a stripe increases, which means the ORC reader can read a column from disks in a more efficient way because the reader can sequentially read more data (assuming the read

Re: [ANNOUNCE] New Hive PMC Members - Thejas Nair and Brock Noland

2013-10-24 Thread Yin Huai
Congratulations, Brock and Thejas! On Thu, Oct 24, 2013 at 6:36 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > Congrats Thejas and Brock!! > > Thanks > Prasanth Jayachandran > > On Oct 24, 2013, at 3:29 PM, Vaibhav Gumashta > wrote: > > Congrats Brock and Thejas! > > > On T

Re: ArrayIndexOutOfBoundsException while writing MapReduce output as RCFile

2013-10-21 Thread Yin Huai
Seems you did not set the number of columns (RCFileOutputFormat.setColumnNumber(Configuration conf, int columnNum)). Can you set it in your main method and see if your MR program works? Thanks, Yin On Mon, Oct 21, 2013 at 2:38 PM, Krishnan K wrote: > Hi All, > > I have a scenario where I've t

Re: Custom SerDe: Initialize() passes a null configuration to my Custom SerDe

2013-10-14 Thread Yin Huai
Can you try to set serde properties? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddSerDeProperties I have not tried it, but seems it is the right way to pass configurations to serde class. Thanks, Yin On Mon, Oct 14, 2013 at 8:20 AM, Rui Martins wrot

Re: NPE org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable

2013-10-11 Thread Yin Huai
Hello Xinyang, Can you attach the query plan (the output of "EXPLAIN")? I think a bad plan caused the error. Also, can you try hive trunk? Looks like it is a bug fixed after the release of 0.11. Thanks, Yin On Fri, Oct 11, 2013 at 9:21 AM, xinyan Yang wrote: > Development environment,hive 0

Re: Use distribute to spread across reducers

2013-10-03 Thread Yin Huai
Hello Keith, Hive will not launch a MR job for your query because it basically reads all columns from a table. Hive will fetch the data for you directly from the underlying filesystem. Thanks, Yin On Wed, Oct 2, 2013 at 2:48 PM, Keith Wiley wrote: > I'm trying to create a subset of a large

Re: Duplicate rows when using group by in subquery

2013-09-19 Thread Yin Huai
eKeyTextOutputFormat > > Stage: Stage-0 > Fetch Operator > limit: -1 > > Using set hive.optimize.reducededuplication=false; > I get 2 mapreduce jobs and the correct number of rows (24). > > Can I verify somehow, maybe through looking in the source code, tha

Re: Duplicate rows when using group by in subquery

2013-09-17 Thread Yin Huai
ncorrectly assumes one job is enough? > > I will get back with results from your suggestions ASAP; unfortunately I > don't have the machines available until Thursday. > > / Sincerely Mikael > >*Från:* Yin Huai > *Till:* user@hive.apache.org; Mikael Öhman > *Skickat:*

Re: Duplicate rows when using group by in subquery

2013-09-16 Thread Yin Huai
Hello Mikael, Seems your case is related to the bug reported in https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses a single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are used to partitioning data, and thus, rows with the same value of "c.Symbol" are not

Re: UNION ALL query behaving strangely; WHERE CLAUSE is also not being honored

2013-09-12 Thread Yin Huai
Hi, Can you also attach the query plan (the result of EXPLAIN)? It may help to find where is the problem. Thanks, Yin On Thu, Sep 12, 2013 at 1:00 PM, Chuck Hardin wrote: > Please bear with me, because this is a pretty large query. > > TL;DR: I'm doing a UNION ALL on a bunch of subqueries.

Re: Problems with 0.11, count(DISTINCT), and NPE

2013-09-05 Thread Yin Huai
> set hive.auto.convert.join.noconditionaltask=false; > > > makes it work (though it does way more map reduce jobs than it should). > When I get some time I will test against the latest trunk. > > Thanks, > Nate > > > On Sep 3, 2013, at 6:09 PM, Yin Huai wrote: >

Re: Problems with 0.11, count(DISTINCT), and NPE

2013-09-03 Thread Yin Huai
Based on the log, it may be also related to https://issues.apache.org/jira/browse/HIVE-4927. To make it work (in a not very optimized way), can you try "set hive.auto.convert.join.noconditionaltask=false;" ? If you still get the error, give "set hive.auto.convert.join=false;" a try (it will turn of

Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases

2013-08-26 Thread Yin Huai
forgot to add in my last reply To generate correct results, you can set hive.optimize.reducededuplication to false to turn off ReduceSinkDeDuplication On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai wrote: > Created a jira https://issues.apache.org/jira/browse/HIVE-5149 > > > On

Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases

2013-08-25 Thread Yin Huai
Created a jira https://issues.apache.org/jira/browse/HIVE-5149 On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai wrote: > Seems ReduceSinkDeDuplication picked the wrong partitioning columns. > > > On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP wrote: > >> I think the problem lies

Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases

2013-08-25 Thread Yin Huai
Seems ReduceSinkDeDuplication picked the wrong partitioning columns. On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP wrote: > I think the problem lies with in the group by operation. For this > optimization to work the group bys partitioning should be on the column 1 > only. > > It wont effect th

Re: single MR stage for join and group by

2013-08-01 Thread Yin Huai
If the join is a reduce side join, https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query and generate a single MR job. The optimizer introduced by HIVE-2206 is in trunk. Currently, it only handles the same column(s). If the join is a MapJoin, hive 0.11 can generate a single MR j

Re: 回复: BUG IN HIVE-4650 seems not fixed

2013-07-31 Thread Yin Huai
I just uploaded a patch to https://issues.apache.org/jira/browse/HIVE-4968. You can try it and see if the problem has been resolved for your query. On Wed, Jul 31, 2013 at 11:21 AM, Yin Huai wrote: > Seems it is another problem. > Can you try > > > SELECT * > FROM

Re: 回复: BUG IN HIVE-4650 seems not fixed

2013-07-31 Thread Yin Huai
my hadoop version is 1.0.1. I use default hive configuration. > > > -- > wzc1...@gmail.com > 已使用 Sparrow <http://www.sparrowmailapp.com/?sig> > > 已使用 Sparrow <http://www.sparrowmailapp.com/?sig> > > 在 2013年7月29日星期一,下午1:08,Yin Huai 写道: > > Hi, > > Can yo

Re: 回复: BUG IN HIVE-4650 seems not fixed

2013-07-28 Thread Yin Huai
Hi, Can you also post the output of EXPLAIN? The execution plan may be helpful to locate the problem. Thanks, Yin On Sun, Jul 28, 2013 at 8:06 PM, wrote: > What I mean by "not pass the testcase in HIVE-4650" is that I compile the > trunk code and run the query in HIVE-4650: > SELECT * > FROM

Re: ColumnarSerDe and LazyBinaryColumnarSerDe

2012-03-07 Thread Yin Huai
ces, but is cpu efficient. > You tests aligns with our internal tests long time ago. > > On Tue, Mar 6, 2012 at 8:58 AM, Yin Huai wrote: > > Hi, > > > > Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in > > general? > > > > Let me make my

ColumnarSerDe and LazyBinaryColumnarSerDe

2012-03-06 Thread Yin Huai
Hi, Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in general? Let me make my question more specific. I generated two tables from the table lineitem of TPC-H using ColumnarSerDe and LazyBinaryColumnarSerDe as follows... CREATE TABLE lineitem_rcfile_lazybinary ROW FORMAT SERDE

Re: RCFile in java MapReduce

2012-01-09 Thread Yin Huai
I have some experiences using RCFile with new MapReduce API from the project HCatalog ( http://incubator.apache.org/hcatalog/ ). For the output part, In your main, you need ... > job.setOutputFormatClass(RCFileMapReduceOutputFormat.class); > > RCFileMapReduceOutputFormat.setColumnNumber(job.getCo