Re: Optimized Hive query

2016-06-14 Thread Gopal Vijayaraghavan
> So I was hoping of using internal Hive CBO to somehow change the AST >generated for the query somehow. Hive does have an "explain rewrite" but that prints out the query before CBO runs. For CBO, you need to dig all the way down to the ASTBuilder class and work upwards from there. Perhaps add

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
hm, I am on Hive 2 and still the same hive> create table testme as select * from oraclehadoop.sales_staging where 1 = 2; hive> insert into testme select * from sales_staging limit 10; hive> desc formatted testme; OK # col_name data_type comment prod_id

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi Mich, as we are still on Hive 1.2.1, it is only working like this for basic stats. I would welcome it though, if it would work for column statistics as well - and it seems this feature is coming via HIVE-11160. Best Michael > On 2016-06-14, at 23:42, Mich Talebzadeh wrote: > > Hi, > > Is

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi Pengcheng, (1) Filtering on IS NULL / IS NOT NULL is imho a very frequent use case in all kind of queries, including joins. We also see it frequently in queries like SELECT COUNT(*) FROM foo WHERE foo.someComplex IS NULL; SELECT COUNT(*) FROM foo WHERE foo.someComplex IS NOT NULL; We also se

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi, Is this automatic stats update is basic statistics or for all columns? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://taleb

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi, there is another approach to reduce time for analyzing stats and that is sampling, i.e. looking at the fraction of data. For example in Oracle one can do that EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname=>user,tabname=>'T1',*estimate_percent=> 50%*) In general Hive statistics is pretty straig

Re: column statistics for non-primitive types

2016-06-14 Thread Pengcheng Xiong
Hi Michael, Thanks for your reply. (1) We have considered your first point before as well. However, join reordering is now done by Calcite in Hive optimizer. Calcite can not deal with complex/structure types. Thus, we may need to first make improvement in Calcite and then make the column

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi Mich, I agree with Pengcheng here. Automatic stats gathering can be extremely useful - and it is configurable. E.g., initial import into Hive happens as CSV or AVRO. Then you might want to do a conversion to ORC within Hive via create-table-as-select. At that point Hive is reading all the r

Re: column statistics for non-primitive types

2016-06-14 Thread Pengcheng Xiong
Hi Mich, I agree with you that column stats gathering in Hive is not cheap and comes with overheads. This is due to the large volume of data that Hive has to process. However, this is the price you have to pay anyway even with current "analyze table" solution. The new feature not only prov

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi there, > On 2016-06-14, at 22:42, Michael Häusler wrote: > > (2) > Do you happen to know, whether HIVE-11160 also works for CTAS? > Because a quick test of the configuration property did not work for me: My bad - HIVE-11160 has target version 2.2.0, so it won't work yet on my installation

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi Pengcheng, (1) statistics on non-primitive columns can be just as useful as on primitive columns, e.g., DROP TABLE IF EXISTS foo; CREATE TABLE foo (id BIGINT, someArray ARRAY, someStruct STRUCT); a) query optimization Let foo be a huge table that needs to be joined with another huge table ba

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
hi, (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details. Not without its overheads. Automatic gather stats is not new. Has

Re: column statistics for non-primitive types

2016-06-14 Thread Pengcheng Xiong
Exactly, "the useful or meaningful these statistics is going to be" (The motivation behind). Best Pengcheng On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh wrote: > > Hi, > > My point was we are where we are and in this juncture there is no > collection of statistics for complex columns. That

Re: column statistics for non-primitive types

2016-06-14 Thread Pengcheng Xiong
Hi Michael, (1) We collect columns stats for the following purpose (a) Query optimization, esp. join reordering and big/small table size estimation. More recently, we also use it to remove filters. You can refer to Calcite rules. (b) Answer query directly through metaStore. You can refer to th

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi, My point was we are where we are and in this juncture there is no collection of statistics for complex columns. That may be a future enhancement. But then the obvious question is how useful or meaningful these statistics is going to be? HTH Dr Mich Talebzadeh LinkedIn * https://www.link

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi there, there might be two topics here: 1) feasibility of stats for non-primitive columns 2) ease of use 1) feasibility of stats for non-primitive columns: Hive currently collects different kind of statistics for different kind of types: numeric values: min, max, #nulls, #distincts boolean

Re: Optimized Hive query

2016-06-14 Thread Mich Talebzadeh
Thank you for cut and pace monologue. very impressive. I will try to remember it Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebza

RE: Optimized Hive query

2016-06-14 Thread Markovitz, Dudu
1) Cost-based optimization in Hive https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive Calcite is an open source, Apache Licensed, query planning and execution framework. Many pieces of

Re: ORC does not support type conversion from INT to STRING.

2016-06-14 Thread Mich Talebzadeh
you must excuse my ignorance can you please elaborate on this as there seems something has gone wrong somewhere? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: ORC does not support type conversion from INT to STRING.

2016-06-14 Thread Mahender Sarangam
Yes Mich. We have restored cluster from metastore. On 6/14/2016 11:35 AM, Mich Talebzadeh wrote: Hi Mahendar, Did you load the meta-data DB/schema from backup and now seeing this error Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOAB

Re: ORC does not support type conversion from INT to STRING.

2016-06-14 Thread Mich Talebzadeh
Hi Mahendar, Did you load the meta-data DB/schema from backup and now seeing this error Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http

Re: ORC does not support type conversion from INT to STRING.

2016-06-14 Thread Mahender Sarangam
ping. On 6/13/2016 1:19 PM, Mahender Sarangam wrote: Hi, We are facing issue while reading data from ORC table. We have created ORC table and dumped data into it. We have deleted cluster due to some reason. When we recreated cluster (using Metastore) and table pointing to same location. When

Re: Optimized Hive query

2016-06-14 Thread Mich Talebzadeh
Amazing. that is the first time I have heard that an optimizer does not have the concept of flattened query? So what is the definition of syntax tree? Are you referring to the industry notation "access path". This is the first time I have heard of such notation called syntax tree. Are you stating

RE: Optimized Hive query

2016-06-14 Thread Markovitz, Dudu
It’s not the query that is being optimized but the syntax tree that is created upon the query (execute “explain extended select …”) In no point do we have a “flattened query” Dudu From: Aviral Agarwal [mailto:aviral12...@gmail.com] Sent: Tuesday, June 14, 2016 10:37 AM To: user@hive.apache.org S

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi Michael, Statistics for columns in Hive are kept in Hive metadata table tab_col_stats. When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns. I don't thin

Re: column statistics for non-primitive types

2016-06-14 Thread Michael Häusler
Hi there, you can reproduce the messages below with Hive 1.2.1. Best regards Michael > On 2016-06-13, at 22:21, Mich Talebzadeh wrote: > > which version of Hive are you using? > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOAB

Re: Optimized Hive query

2016-06-14 Thread Mich Talebzadeh
I presume the user is concerned with performance? The whole use case of a CBO is to take care of queries by finding the optimum access path. otherwise we would have a RBO as is in the old days of Hive. If you are in the more recent version of Hive CBO does the job. However, you may think of mov

Re: Optimized Hive query

2016-06-14 Thread Aviral Agarwal
Hi, Thanks for the replies. I already knew that the optimizer already does that. My usecase is a bit different though. I want to display the flattened query back to the user. So I was hoping of using internal Hive CBO to somehow change the AST generated for the query somehow. Thanks, Aviral On Tu

Re: Optimized Hive query

2016-06-14 Thread Gopal Vijayaraghavan
> You can see that you get identical execution plans for the nested query >and the flatten one. Wasn't that always though. Back when I started with Hive, before Stinger, it didn't have the identity project remover. To know if your version has this fix, try looking at hive> set hive.optimize.rem

RE: Issue in Insert Overwrite directory operation

2016-06-14 Thread Markovitz, Dudu
There seems to be a known bug fixed on version 1.3 https://issues.apache.org/jira/browse/HIVE-12364 Dudu From: Udit Mehta [mailto:ume...@groupon.com] Sent: Tuesday, June 14, 2016 2:55 AM To: user@hive.apache.org Subject: Issue in Insert Overwrite directory operation Hi All, I see a weird issue

RE: same hdfs location with different schema exception

2016-06-14 Thread Markovitz, Dudu
Hi Can you please share the query? Thanks Dudu From: 赵升/赵荣生 [mailto:roncenz...@qq.com] Sent: Tuesday, June 14, 2016 5:26 AM To: user Subject: same hdfs location with different schema exception Hi all: I have a question when using hive. It's described as follows: Firstly, I create two tab