Re: column statistics for non-primitive types

Mich Talebzadeh Tue, 14 Jun 2016 13:42:35 -0700

hi,

(2) There is a configuration "hive.stats.fetch.column.stats". If you set it
to true, it will automatically collect column stats for you when you insert
into/overwrite a new table. You can refer to HIVE-11160 for more details.


Not without its overheads.

Automatic gather stats is not new. Has been around for a good time in
RDBMS and can impact the performance of other queries running. So I am not
sure it can be considered as blessing.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 21:25, Pengcheng Xiong <pxi...@apache.org> wrote:

> Exactly, "the useful or meaningful these statistics is going to be" (The
> motivation behind).
>
> Best
> Pengcheng
>
>
> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> My point was we are where we are and in this juncture there is no
>> collection of statistics for complex columns. That may be a future
>> enhancement.
>>
>> But then the obvious question is how useful or meaningful these
>> statistics is going to be?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:03, Michael Häusler <mich...@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>> there might be two topics here:
>>>
>>> 1) feasibility of stats for non-primitive columns
>>> 2) ease of use
>>>
>>>
>>> 1) feasibility of stats for non-primitive columns:
>>>
>>> Hive currently collects different kind of statistics for different kind
>>> of types:
>>> numeric values: min, max, #nulls, #distincts
>>> boolean values: #nulls, #trues, #falses
>>> string values: #nulls, #distincts, avgLength, maxLength
>>>
>>> So, it seems quite possible to also collect at least partial stats for
>>> top-level non-primitive columns, e.g.:
>>> array values: #nulls, #distincts, avgLength, maxLength
>>> map values: #nulls, #distincts, avgLength, maxLength
>>> struct values: #nulls, #distincts
>>> union values: #nulls, #distincts
>>>
>>>
>>> 2) ease of use
>>>
>>> The presence of a single non-primitive column currently breaks the use
>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>> of column statistics for hive users.
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>>
>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Hi Michael,
>>>
>>> Statistics for columns in Hive are kept in Hive metadata table
>>> tab_col_stats.
>>>
>>> When I am looking at this table in Oracle, I only see statistics for
>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>> will have to be broken into its primitive columns.  I don't think Hive has
>>> the means to do that.
>>>
>>> desc tab_col_stats;
>>>  Name
>>> Null?    Type
>>>  ------------------------------------------------------------------------
>>> -------- -------------------------------------------------
>>>  CS_ID
>>> NOT NULL NUMBER
>>>  DB_NAME
>>> NOT NULL VARCHAR2(128)
>>>  TABLE_NAME
>>> NOT NULL VARCHAR2(128)
>>>  COLUMN_NAME
>>> NOT NULL VARCHAR2(1000)
>>>  COLUMN_TYPE
>>> NOT NULL VARCHAR2(128)
>>>  TBL_ID
>>> NOT NULL NUMBER
>>>  LONG_LOW_VALUE
>>> NUMBER
>>>  LONG_HIGH_VALUE
>>> NUMBER
>>>  DOUBLE_LOW_VALUE
>>> NUMBER
>>>  DOUBLE_HIGH_VALUE
>>> NUMBER
>>>  BIG_DECIMAL_LOW_VALUE
>>> VARCHAR2(4000)
>>>  BIG_DECIMAL_HIGH_VALUE
>>> VARCHAR2(4000)
>>>  NUM_NULLS
>>> NOT NULL NUMBER
>>>  NUM_DISTINCTS
>>> NUMBER
>>>  AVG_COL_LEN
>>> NUMBER
>>>  MAX_COL_LEN
>>> NUMBER
>>>  NUM_TRUES
>>> NUMBER
>>>  NUM_FALSES
>>> NUMBER
>>>  LAST_ANALYZED
>>> NOT NULL NUMBER
>>>
>>>
>>>
>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>> can cater for their statistics. Actually I don't think Oracle itself does
>>> it.
>>>
>>> HTH
>>>
>>> P.S. I am on Hive 2 and it does not.
>>>
>>> hive> analyze table foo compute statistics for columns;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but array<bigint> is passed.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> which version of Hive are you using?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>>
>>>>> when testing column statistics I stumbled upon the following error
>>>>> message:
>>>>>
>>>>> DROP TABLE IF EXISTS foo;
>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>
>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but array<bigint> is passed.
>>>>>
>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>
>>>>>
>>>>> 1) Basically, it seems that column statistics don't work for
>>>>> non-primitive types. Are there any workarounds or any plans to change 
>>>>> this?
>>>>>
>>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>>> columns does not work as soon as there is a non-supported column. Are 
>>>>> there
>>>>> any plans to change this, so it is easier to compute statistics for all
>>>>> supported columns?
>>>>>
>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>> error message. Especially for wide tables it would be much easier if all
>>>>> non-supported column *names* would be printed.
>>>>>
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: column statistics for non-primitive types

Reply via email to