Re: column statistics for non-primitive types

Mich Talebzadeh Tue, 14 Jun 2016 15:11:59 -0700

hm,

I am on Hive 2 and still the same


hive> create table testme as select * from oraclehadoop.sales_staging where
1 = 2;

hive> insert into testme select * from sales_staging limit  100000;

hive> desc formatted testme;
OK
# col_name              data_type               comment
prod_id                 bigint
cust_id                 bigint
time_id                 timestamp
channel_id              bigint
promo_id                bigint
quantity_sold           decimal(10,0)
amount_sold             decimal(10,0)
# Detailed Table Information
Database:               test
Owner:                  hduser
CreateTime:             Tue Jun 14 23:00:55 BST 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/testme
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                2
        numRows                 100000
        rawDataSize             3848068
        totalSize               3948068
        transient_lastDdlTime   1465941690

hive> analyze table testme compute statistics
*for columns;*OK
hive> desc formatted testme;
OK
# col_name              data_type               comment
prod_id                 bigint
cust_id                 bigint
time_id                 timestamp
channel_id              bigint
promo_id                bigint
quantity_sold           decimal(10,0)
amount_sold             decimal(10,0)
# Detailed Table Information
Database:               test
Owner:                  hduser
CreateTime:             Tue Jun 14 23:00:55 BST 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/testme
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"prod_id\":\"true\",\"cust_id\":\"true\",\"time_id\":\"true\",\"channel_id\":\"true\",\"promo_id\":\"true\",\"quantity_sold\":\"true\",\"amount_sold\":\"true\"}}
        numFiles                2
        numRows                 100000
        rawDataSize             3848068
        totalSize               3948068
        transient_lastDdlTime   1465941690


Although there is gain to be made by having up-to-date stats, your quickest
performance buck is going ton come by running Hive on Spark engine (order
of magnitude) or using Spark on Hive tables. As ever your mileage varies
depending on the availability of RAM on your cluster. Having external
indexes visible to Hive optimizer will help but I suppose that is another
discussion

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:51, Michael Häusler <mich...@akatose.de> wrote:

> Hi Mich,
>
> as we are still on Hive 1.2.1, it is only working like this for basic
> stats.
> I would welcome it though, if it would work for column statistics as well
> - and it seems this feature is coming via HIVE-11160.
>
> Best
> Michael
>
> On 2016-06-14, at 23:42, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi,
>
> Is this automatic stats update is basic statistics or for all columns?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 22:10, Michael Häusler <mich...@akatose.de> wrote:
>
>> Hi Mich,
>>
>> I agree with Pengcheng here. Automatic stats gathering can be extremely
>> useful - and it is configurable.
>>
>> E.g., initial import into Hive happens as CSV or AVRO. Then you might
>> want to do a conversion to ORC within Hive via create-table-as-select. At
>> that point Hive is reading all the records anyway and we might just as well
>> get as much useful information as possible for stats.
>>
>> Best
>> Michael
>>
>>
>> On 2016-06-14, at 23:05, Pengcheng Xiong <pxi...@apache.org> wrote:
>>
>> Hi Mich,
>>
>>     I agree with you that column stats gathering in Hive is not cheap and
>> comes with overheads. This is due to the large volume of data that Hive has
>> to process. However, this is the price you have to pay anyway even with
>> current "analyze table" solution.
>>
>>    The new feature not only provides a way to make users have column
>> stats automatically, but also saves overhead for the "insert into" case. In
>> this case, the new column stats are generated incrementally, i.e., by
>> merging with the existing stats. Without this feature, you have to scan the
>> whole table and compute stats.
>>
>>    In conclusion, this new feature should not have any more overhead than
>> the current solution.
>>
>> Best
>> Pengcheng
>>
>> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> hi,
>>>
>>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>>> it to true, it will automatically collect column stats for you when you
>>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>>> details.
>>>
>>> Not without its overheads.
>>>
>>> Automatic gather stats is not new. Has been around for a good time in
>>> RDBMS and can impact the performance of other queries running. So I am not
>>> sure it can be considered as blessing.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 June 2016 at 21:25, Pengcheng Xiong <pxi...@apache.org> wrote:
>>>
>>>> Exactly, "the useful or meaningful these statistics is going to be"
>>>> (The motivation behind).
>>>>
>>>> Best
>>>> Pengcheng
>>>>
>>>>
>>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> My point was we are where we are and in this juncture there is no
>>>>> collection of statistics for complex columns. That may be a future
>>>>> enhancement.
>>>>>
>>>>> But then the obvious question is how useful or meaningful these
>>>>> statistics is going to be?
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 June 2016 at 21:03, Michael Häusler <mich...@akatose.de> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> there might be two topics here:
>>>>>>
>>>>>> 1) feasibility of stats for non-primitive columns
>>>>>> 2) ease of use
>>>>>>
>>>>>>
>>>>>> 1) feasibility of stats for non-primitive columns:
>>>>>>
>>>>>> Hive currently collects different kind of statistics for different
>>>>>> kind of types:
>>>>>> numeric values: min, max, #nulls, #distincts
>>>>>> boolean values: #nulls, #trues, #falses
>>>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>>>
>>>>>> So, it seems quite possible to also collect at least partial stats
>>>>>> for top-level non-primitive columns, e.g.:
>>>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>>>> struct values: #nulls, #distincts
>>>>>> union values: #nulls, #distincts
>>>>>>
>>>>>>
>>>>>> 2) ease of use
>>>>>>
>>>>>> The presence of a single non-primitive column currently breaks the
>>>>>> use of the convenience shorthand to gather statistics for all columns
>>>>>> (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows 
>>>>>> down
>>>>>> adoption of column statistics for hive users.
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>>>> tab_col_stats.
>>>>>>
>>>>>> When I am looking at this table in Oracle, I only see statistics for
>>>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>>>> will have to be broken into its primitive columns.  I don't think Hive 
>>>>>> has
>>>>>> the means to do that.
>>>>>>
>>>>>> desc tab_col_stats;
>>>>>>  Name
>>>>>> Null?    Type
>>>>>>  ------------------------------------------------------------------------
>>>>>> -------- -------------------------------------------------
>>>>>>  CS_ID
>>>>>> NOT NULL NUMBER
>>>>>>  DB_NAME
>>>>>> NOT NULL VARCHAR2(128)
>>>>>>  TABLE_NAME
>>>>>> NOT NULL VARCHAR2(128)
>>>>>>  COLUMN_NAME
>>>>>> NOT NULL VARCHAR2(1000)
>>>>>>  COLUMN_TYPE
>>>>>> NOT NULL VARCHAR2(128)
>>>>>>  TBL_ID
>>>>>> NOT NULL NUMBER
>>>>>>  LONG_LOW_VALUE
>>>>>> NUMBER
>>>>>>  LONG_HIGH_VALUE
>>>>>> NUMBER
>>>>>>  DOUBLE_LOW_VALUE
>>>>>> NUMBER
>>>>>>  DOUBLE_HIGH_VALUE
>>>>>> NUMBER
>>>>>>  BIG_DECIMAL_LOW_VALUE
>>>>>> VARCHAR2(4000)
>>>>>>  BIG_DECIMAL_HIGH_VALUE
>>>>>> VARCHAR2(4000)
>>>>>>  NUM_NULLS
>>>>>> NOT NULL NUMBER
>>>>>>  NUM_DISTINCTS
>>>>>> NUMBER
>>>>>>  AVG_COL_LEN
>>>>>> NUMBER
>>>>>>  MAX_COL_LEN
>>>>>> NUMBER
>>>>>>  NUM_TRUES
>>>>>> NUMBER
>>>>>>  NUM_FALSES
>>>>>> NUMBER
>>>>>>  LAST_ANALYZED
>>>>>> NOT NULL NUMBER
>>>>>>
>>>>>>
>>>>>>
>>>>>>  So in summary although column type STRUCT do exit, I don't think
>>>>>> Hive can cater for their statistics. Actually I don't think Oracle itself
>>>>>> does it.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> P.S. I am on Hive 2 and it does not.
>>>>>>
>>>>>> hive> analyze table foo compute statistics for columns;
>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>> accepted but array<bigint> is passed.
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>>>
>>>>>>> Best regards
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> which version of Hive are you using?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>>
>>>>>>>> when testing column statistics I stumbled upon the following error
>>>>>>>> message:
>>>>>>>>
>>>>>>>> DROP TABLE IF EXISTS foo;
>>>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>>>
>>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>>> accepted but array<bigint> is passed.
>>>>>>>>
>>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>>>> non-primitive types. Are there any workarounds or any plans to change 
>>>>>>>> this?
>>>>>>>>
>>>>>>>> 2) Furthermore, the convenience syntax to compute statistics for
>>>>>>>> all columns does not work as soon as there is a non-supported column. 
>>>>>>>> Are
>>>>>>>> there any plans to change this, so it is easier to compute statistics 
>>>>>>>> for
>>>>>>>> all supported columns?
>>>>>>>>
>>>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>>>> error message. Especially for wide tables it would be much easier if 
>>>>>>>> all
>>>>>>>> non-supported column *names* would be printed.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Re: column statistics for non-primitive types

Reply via email to