Re: column statistics for non-primitive types

Michael Häusler Tue, 14 Jun 2016 14:11:24 -0700

Hi Mich,

I agree with Pengcheng here. Automatic stats gathering can be extremely useful 
- and it is configurable.


E.g., initial import into Hive happens as CSV or AVRO. Then you might want to 
do a conversion to ORC within Hive via create-table-as-select. At that point 
Hive is reading all the records anyway and we might just as well get as much 
useful information as possible for stats.

Best
Michael


> On 2016-06-14, at 23:05, Pengcheng Xiong <pxi...@apache.org> wrote:
> 
> Hi Mich,
> 
>     I agree with you that column stats gathering in Hive is not cheap and 
> comes with overheads. This is due to the large volume of data that Hive has 
> to process. However, this is the price you have to pay anyway even with 
> current "analyze table" solution.
> 
>    The new feature not only provides a way to make users have column stats 
> automatically, but also saves overhead for the "insert into" case. In this 
> case, the new column stats are generated incrementally, i.e., by merging with 
> the existing stats. Without this feature, you have to scan the whole table 
> and compute stats. 
> 
>    In conclusion, this new feature should not have any more overhead than the 
> current solution.  
> 
> Best
> Pengcheng
> 
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> hi,
> 
> (2) There is a configuration "hive.stats.fetch.column.stats". If you set it 
> to true, it will automatically collect column stats for you when you insert 
> into/overwrite a new table. You can refer to HIVE-11160 for more details.
> 
> Not without its overheads.
> 
> Automatic gather stats is not new. Has been around for a good time in RDBMS 
> and can impact the performance of other queries running. So I am not sure it 
> can be considered as blessing.
> 
> HTH
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 21:25, Pengcheng Xiong <pxi...@apache.org 
> <mailto:pxi...@apache.org>> wrote:
> Exactly, "the useful or meaningful these statistics is going to be" (The 
> motivation behind).
> 
> Best
> Pengcheng
> 
> 
> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> 
> Hi,
> 
> My point was we are where we are and in this juncture there is no collection 
> of statistics for complex columns. That may be a future enhancement.
> 
> But then the obvious question is how useful or meaningful these statistics is 
> going to be?
> 
> HTH 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 21:03, Michael Häusler <mich...@akatose.de 
> <mailto:mich...@akatose.de>> wrote:
> Hi there,
> 
> there might be two topics here:
> 
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
> 
> 
> 1) feasibility of stats for non-primitive columns:
> 
> Hive currently collects different kind of statistics for different kind of 
> types:
> numeric values:       min, max, #nulls, #distincts
> boolean values:       #nulls, #trues, #falses
> string values:                #nulls, #distincts, avgLength, maxLength
> 
> So, it seems quite possible to also collect at least partial stats for 
> top-level non-primitive columns, e.g.:
> array values:         #nulls, #distincts, avgLength, maxLength 
> map values:           #nulls, #distincts, avgLength, maxLength
> struct values:                #nulls, #distincts
> union values:         #nulls, #distincts
> 
> 
> 2) ease of use
> 
> The presence of a single non-primitive column currently breaks the use of the 
> convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo 
> COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column 
> statistics for hive users.
> 
> Best regards
> Michael
> 
> 
> 
>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi Michael,
>> 
>> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
>> 
>> When I am looking at this table in Oracle, I only see statistics for 
>> primitives columns here. STRUCT columns do not have it as a STRUCT column 
>> will have to be broken into its primitive columns.  I don't think Hive has 
>> the means to do that.
>> 
>> desc tab_col_stats;
>>  Name                                                                     
>> Null?    Type
>>  ------------------------------------------------------------------------ 
>> -------- -------------------------------------------------
>>  CS_ID                                                                    
>> NOT NULL NUMBER
>>  DB_NAME                                                                  
>> NOT NULL VARCHAR2(128)
>>  TABLE_NAME                                                               
>> NOT NULL VARCHAR2(128)
>>  COLUMN_NAME                                                              
>> NOT NULL VARCHAR2(1000)
>>  COLUMN_TYPE                                                              
>> NOT NULL VARCHAR2(128)
>>  TBL_ID                                                                   
>> NOT NULL NUMBER
>>  LONG_LOW_VALUE                                                              
>>       NUMBER
>>  LONG_HIGH_VALUE                                                             
>>       NUMBER
>>  DOUBLE_LOW_VALUE                                                            
>>       NUMBER
>>  DOUBLE_HIGH_VALUE                                                           
>>       NUMBER
>>  BIG_DECIMAL_LOW_VALUE                                                       
>>       VARCHAR2(4000)
>>  BIG_DECIMAL_HIGH_VALUE                                                      
>>       VARCHAR2(4000)
>>  NUM_NULLS                                                                
>> NOT NULL NUMBER
>>  NUM_DISTINCTS                                                               
>>       NUMBER
>>  AVG_COL_LEN                                                                 
>>       NUMBER
>>  MAX_COL_LEN                                                                 
>>       NUMBER
>>  NUM_TRUES                                                                   
>>       NUMBER
>>  NUM_FALSES                                                                  
>>       NUMBER
>>  LAST_ANALYZED                                                            
>> NOT NULL NUMBER
>> 
>> 
>> 
>>  So in summary although column type STRUCT do exit, I don't think Hive can 
>> cater for their statistics. Actually I don't think Oracle itself does it.
>> 
>> HTH
>> 
>> P.S. I am on Hive 2 and it does not.
>> 
>> hive> analyze table foo compute statistics for columns;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted 
>> but array<bigint> is passed.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de 
>> <mailto:mich...@akatose.de>> wrote:
>> Hi there,
>> 
>> you can reproduce the messages below with Hive 1.2.1.
>> 
>> Best regards
>> Michael
>> 
>> 
>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> which version of Hive are you using?
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de 
>>> <mailto:mich...@akatose.de>> wrote:
>>> Hi there,
>>> 
>>> 
>>> when testing column statistics I stumbled upon the following error message:
>>> 
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar 
>>> STRUCT<key:STRING,value:STRING>);
>>> 
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted 
>>> but array<bigint> is passed.
>>> 
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted 
>>> but struct<key:string,value:string> is passed.
>>> 
>>> 
>>> 1) Basically, it seems that column statistics don't work for non-primitive 
>>> types. Are there any workarounds or any plans to change this?
>>> 
>>> 2) Furthermore, the convenience syntax to compute statistics for all 
>>> columns does not work as soon as there is a non-supported column. Are there 
>>> any plans to change this, so it is easier to compute statistics for all 
>>> supported columns?
>>> 
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error 
>>> message. Especially for wide tables it would be much easier if all 
>>> non-supported column *names* would be printed.
>>> 
>>> 
>>> Best regards
>>> Michael
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
>

Re: column statistics for non-primitive types

Reply via email to