hi, (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details.
Not without its overheads. Automatic gather stats is not new. Has been around for a good time in RDBMS and can impact the performance of other queries running. So I am not sure it can be considered as blessing. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 14 June 2016 at 21:25, Pengcheng Xiong <pxi...@apache.org> wrote: > Exactly, "the useful or meaningful these statistics is going to be" (The > motivation behind). > > Best > Pengcheng > > > On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> >> Hi, >> >> My point was we are where we are and in this juncture there is no >> collection of statistics for complex columns. That may be a future >> enhancement. >> >> But then the obvious question is how useful or meaningful these >> statistics is going to be? >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 14 June 2016 at 21:03, Michael Häusler <mich...@akatose.de> wrote: >> >>> Hi there, >>> >>> there might be two topics here: >>> >>> 1) feasibility of stats for non-primitive columns >>> 2) ease of use >>> >>> >>> 1) feasibility of stats for non-primitive columns: >>> >>> Hive currently collects different kind of statistics for different kind >>> of types: >>> numeric values: min, max, #nulls, #distincts >>> boolean values: #nulls, #trues, #falses >>> string values: #nulls, #distincts, avgLength, maxLength >>> >>> So, it seems quite possible to also collect at least partial stats for >>> top-level non-primitive columns, e.g.: >>> array values: #nulls, #distincts, avgLength, maxLength >>> map values: #nulls, #distincts, avgLength, maxLength >>> struct values: #nulls, #distincts >>> union values: #nulls, #distincts >>> >>> >>> 2) ease of use >>> >>> The presence of a single non-primitive column currently breaks the use >>> of the convenience shorthand to gather statistics for all columns (ANALYZE >>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption >>> of column statistics for hive users. >>> >>> Best regards >>> Michael >>> >>> >>> >>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> Hi Michael, >>> >>> Statistics for columns in Hive are kept in Hive metadata table >>> tab_col_stats. >>> >>> When I am looking at this table in Oracle, I only see statistics for >>> primitives columns here. STRUCT columns do not have it as a STRUCT column >>> will have to be broken into its primitive columns. I don't think Hive has >>> the means to do that. >>> >>> desc tab_col_stats; >>> Name >>> Null? Type >>> ------------------------------------------------------------------------ >>> -------- ------------------------------------------------- >>> CS_ID >>> NOT NULL NUMBER >>> DB_NAME >>> NOT NULL VARCHAR2(128) >>> TABLE_NAME >>> NOT NULL VARCHAR2(128) >>> COLUMN_NAME >>> NOT NULL VARCHAR2(1000) >>> COLUMN_TYPE >>> NOT NULL VARCHAR2(128) >>> TBL_ID >>> NOT NULL NUMBER >>> LONG_LOW_VALUE >>> NUMBER >>> LONG_HIGH_VALUE >>> NUMBER >>> DOUBLE_LOW_VALUE >>> NUMBER >>> DOUBLE_HIGH_VALUE >>> NUMBER >>> BIG_DECIMAL_LOW_VALUE >>> VARCHAR2(4000) >>> BIG_DECIMAL_HIGH_VALUE >>> VARCHAR2(4000) >>> NUM_NULLS >>> NOT NULL NUMBER >>> NUM_DISTINCTS >>> NUMBER >>> AVG_COL_LEN >>> NUMBER >>> MAX_COL_LEN >>> NUMBER >>> NUM_TRUES >>> NUMBER >>> NUM_FALSES >>> NUMBER >>> LAST_ANALYZED >>> NOT NULL NUMBER >>> >>> >>> >>> So in summary although column type STRUCT do exit, I don't think Hive >>> can cater for their statistics. Actually I don't think Oracle itself does >>> it. >>> >>> HTH >>> >>> P.S. I am on Hive 2 and it does not. >>> >>> hive> analyze table foo compute statistics for columns; >>> FAILED: UDFArgumentTypeException Only primitive type arguments are >>> accepted but array<bigint> is passed. >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de> wrote: >>> >>>> Hi there, >>>> >>>> you can reproduce the messages below with Hive 1.2.1. >>>> >>>> Best regards >>>> Michael >>>> >>>> >>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> which version of Hive are you using? >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de> wrote: >>>> >>>>> Hi there, >>>>> >>>>> >>>>> when testing column statistics I stumbled upon the following error >>>>> message: >>>>> >>>>> DROP TABLE IF EXISTS foo; >>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar >>>>> STRUCT<key:STRING,value:STRING>); >>>>> >>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS; >>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are >>>>> accepted but array<bigint> is passed. >>>>> >>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar; >>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are >>>>> accepted but struct<key:string,value:string> is passed. >>>>> >>>>> >>>>> 1) Basically, it seems that column statistics don't work for >>>>> non-primitive types. Are there any workarounds or any plans to change >>>>> this? >>>>> >>>>> 2) Furthermore, the convenience syntax to compute statistics for all >>>>> columns does not work as soon as there is a non-supported column. Are >>>>> there >>>>> any plans to change this, so it is easier to compute statistics for all >>>>> supported columns? >>>>> >>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the >>>>> error message. Especially for wide tables it would be much easier if all >>>>> non-supported column *names* would be printed. >>>>> >>>>> >>>>> Best regards >>>>> Michael >>>>> >>>>> >>>> >>>> >>> >>> >> >