Hi Mich, I agree with Pengcheng here. Automatic stats gathering can be extremely useful - and it is configurable.
E.g., initial import into Hive happens as CSV or AVRO. Then you might want to do a conversion to ORC within Hive via create-table-as-select. At that point Hive is reading all the records anyway and we might just as well get as much useful information as possible for stats. Best Michael > On 2016-06-14, at 23:05, Pengcheng Xiong <pxi...@apache.org> wrote: > > Hi Mich, > > I agree with you that column stats gathering in Hive is not cheap and > comes with overheads. This is due to the large volume of data that Hive has > to process. However, this is the price you have to pay anyway even with > current "analyze table" solution. > > The new feature not only provides a way to make users have column stats > automatically, but also saves overhead for the "insert into" case. In this > case, the new column stats are generated incrementally, i.e., by merging with > the existing stats. Without this feature, you have to scan the whole table > and compute stats. > > In conclusion, this new feature should not have any more overhead than the > current solution. > > Best > Pengcheng > > On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > hi, > > (2) There is a configuration "hive.stats.fetch.column.stats". If you set it > to true, it will automatically collect column stats for you when you insert > into/overwrite a new table. You can refer to HIVE-11160 for more details. > > Not without its overheads. > > Automatic gather stats is not new. Has been around for a good time in RDBMS > and can impact the performance of other queries running. So I am not sure it > can be considered as blessing. > > HTH > > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 14 June 2016 at 21:25, Pengcheng Xiong <pxi...@apache.org > <mailto:pxi...@apache.org>> wrote: > Exactly, "the useful or meaningful these statistics is going to be" (The > motivation behind). > > Best > Pengcheng > > > On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > > Hi, > > My point was we are where we are and in this juncture there is no collection > of statistics for complex columns. That may be a future enhancement. > > But then the obvious question is how useful or meaningful these statistics is > going to be? > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 14 June 2016 at 21:03, Michael Häusler <mich...@akatose.de > <mailto:mich...@akatose.de>> wrote: > Hi there, > > there might be two topics here: > > 1) feasibility of stats for non-primitive columns > 2) ease of use > > > 1) feasibility of stats for non-primitive columns: > > Hive currently collects different kind of statistics for different kind of > types: > numeric values: min, max, #nulls, #distincts > boolean values: #nulls, #trues, #falses > string values: #nulls, #distincts, avgLength, maxLength > > So, it seems quite possible to also collect at least partial stats for > top-level non-primitive columns, e.g.: > array values: #nulls, #distincts, avgLength, maxLength > map values: #nulls, #distincts, avgLength, maxLength > struct values: #nulls, #distincts > union values: #nulls, #distincts > > > 2) ease of use > > The presence of a single non-primitive column currently breaks the use of the > convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo > COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column > statistics for hive users. > > Best regards > Michael > > > >> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> Hi Michael, >> >> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats. >> >> When I am looking at this table in Oracle, I only see statistics for >> primitives columns here. STRUCT columns do not have it as a STRUCT column >> will have to be broken into its primitive columns. I don't think Hive has >> the means to do that. >> >> desc tab_col_stats; >> Name >> Null? Type >> ------------------------------------------------------------------------ >> -------- ------------------------------------------------- >> CS_ID >> NOT NULL NUMBER >> DB_NAME >> NOT NULL VARCHAR2(128) >> TABLE_NAME >> NOT NULL VARCHAR2(128) >> COLUMN_NAME >> NOT NULL VARCHAR2(1000) >> COLUMN_TYPE >> NOT NULL VARCHAR2(128) >> TBL_ID >> NOT NULL NUMBER >> LONG_LOW_VALUE >> NUMBER >> LONG_HIGH_VALUE >> NUMBER >> DOUBLE_LOW_VALUE >> NUMBER >> DOUBLE_HIGH_VALUE >> NUMBER >> BIG_DECIMAL_LOW_VALUE >> VARCHAR2(4000) >> BIG_DECIMAL_HIGH_VALUE >> VARCHAR2(4000) >> NUM_NULLS >> NOT NULL NUMBER >> NUM_DISTINCTS >> NUMBER >> AVG_COL_LEN >> NUMBER >> MAX_COL_LEN >> NUMBER >> NUM_TRUES >> NUMBER >> NUM_FALSES >> NUMBER >> LAST_ANALYZED >> NOT NULL NUMBER >> >> >> >> So in summary although column type STRUCT do exit, I don't think Hive can >> cater for their statistics. Actually I don't think Oracle itself does it. >> >> HTH >> >> P.S. I am on Hive 2 and it does not. >> >> hive> analyze table foo compute statistics for columns; >> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted >> but array<bigint> is passed. >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> >> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de >> <mailto:mich...@akatose.de>> wrote: >> Hi there, >> >> you can reproduce the messages below with Hive 1.2.1. >> >> Best regards >> Michael >> >> >>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> >>> which version of Hive are you using? >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> >>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de >>> <mailto:mich...@akatose.de>> wrote: >>> Hi there, >>> >>> >>> when testing column statistics I stumbled upon the following error message: >>> >>> DROP TABLE IF EXISTS foo; >>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar >>> STRUCT<key:STRING,value:STRING>); >>> >>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS; >>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted >>> but array<bigint> is passed. >>> >>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar; >>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted >>> but struct<key:string,value:string> is passed. >>> >>> >>> 1) Basically, it seems that column statistics don't work for non-primitive >>> types. Are there any workarounds or any plans to change this? >>> >>> 2) Furthermore, the convenience syntax to compute statistics for all >>> columns does not work as soon as there is a non-supported column. Are there >>> any plans to change this, so it is easier to compute statistics for all >>> supported columns? >>> >>> 3) ANALYZE TABLE will only provide the first failing *type* in the error >>> message. Especially for wide tables it would be much easier if all >>> non-supported column *names* would be printed. >>> >>> >>> Best regards >>> Michael >>> >>> >> >> > > > > >