Hi Pau and Sungwoo,

Thanks for sharing the information.

We tested a set of simple queries which just referenced the Hive table and
didn't execute any Hive jobs. The result is below.

No. Version rawstore.impl connectionPoolingType HIVE-14187 QueryTime
--------------------------------------------------
1   1.2.1   ObjectStore    None                Not Applied 11:38
2   3.1.3   ObjectStore    None                Applied     34:00
3   3.1.3   CachedStore    None                Applied     25:00
4   3.1.3   ObjectStore    HikariCP            Applied     21:10
5   3.1.3   CachedStore    HikariCP            Applied     14:30
6   3.1.3   ObjectStore    None                Reverted    13:00
7   3.1.3   ObjectStore    HikariCP            Reverted    11:23
--------------------------------------------------

Initially, we encountered an issue of Hive MetaStore slowness when we
upgraded from environment No.1 to No.2. As shown in the table, environment
No.2 showed the worst test results.

A unique aspect of our environment is that we don't use connection pooling.
After some investigation, we thought that the combination of HIVE-14187 and
connectionPoolingType=None was negatively impacting performance.
The fastest case in our tests was when we reverted HIVE-14187 and set
connectionPoolingType=HikariCP (see No.7). Even with connectionPoolingType
set to None, the environment where we reverted HIVE-14187 still performed
reasonably well (see No.6).

Please note our investigation is still ongoing and we haven't yet come to a
conclusion.

Regards,
- Takanobu

2024年2月29日(木) 12:18 Sungwoo Park <glap...@gmail.com>:

> We didn't make any other attempt to fix the problem and just decided not
> to use CachedStore. However, I think our installation of Metastore based on
> Hive 3.1.3 is running without any serious problems.
>
> Could you share how long it takes to compile typical queries in your
> environment (with Hive 1 and with Hive 3)?
>
> FYI, in our environment, sometimes it takes about 10 seconds to compile a
> query on TPC-DS 10TB datasets. Specifically, the average compilation time
> of 103 queries is 1.7 seconds (as reported by Hive), and the longest
> compilation time is 9.6 seconds (query 49). The compilation time includes
> the time for accessing Metastore.
>
> Thanks,
>
> --- Sungwoo
>
>
> On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma <tasan...@apache.org>
> wrote:
>
>> Thanks for your detailed answer!
>>
>> In the original email, you reported "the query compilation takes long" in
>> Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
>> Thank you for sharing the issue with CachedStore and the JIRA tickets.
>> I will also try out metastore.stats.fetch.bitvector=true.
>>
>> Regards,
>> - Takanobu
>>
>> 2024年2月28日(水) 18:49 Sungwoo Park <glap...@gmail.com>:
>>
>>> Hello Takanobu,
>>>
>>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
>>> different, so I don't know why Metastore responses are very slow. I can
>>> only share some results of testing CachedStore in Metastore. Please note
>>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>>> Hive 3.1.3 (which applies many additional patches).
>>>
>>> 1.
>>> When CachedStore is enabled, column stats are not computed. As a result,
>>> some queries generate very inefficient plans because of wrong/inaccurate
>>> stats.
>>>
>>> Perhaps this is because not all patches for CachedStore have been merged
>>> to Hive 3.1.3. For example, these patches are not merged. Or, there might
>>> be some way to properly configure CachedStore so that it correctly computes
>>> column stats.
>>>
>>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>>> HIVE-21063: Support statistics in cachedStore for transactional table
>>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>>> constraint
>>>
>>> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>>>
>>> (If anyone is running Hive Metastore 3.1.3 in production with
>>> CachedStore enabled, please let us know how you configure it.)
>>>
>>> 2.
>>> Setting metastore.stats.fetch.bitvector=true can also help generate more
>>> efficient query plans.
>>>
>>> --- Sungwoo
>>>
>>>
>>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma <tasan...@apache.org>
>>> wrote:
>>>
>>>> Hi Sungwoo Park,
>>>>
>>>> I'm sorry for the late reply to this old email.
>>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>>>> noticed that the response of the Hive3 MetaStore is very slow.
>>>> We suspect that HIVE-14187 might be causing this slowness.
>>>> Could you tell me if you have resolved this problem? Are there still
>>>> any problems when you enable CachedStore?
>>>>
>>>> Regards,
>>>> - Takanobu
>>>>
>>>> 2018年6月13日(水) 0:37 Sungwoo Park <glap...@gmail.com>:
>>>>
>>>>> Hello Hive users,
>>>>>
>>>>> I am experience a problem with MetaStore in Hive 3.0.
>>>>>
>>>>> 1. Start MetaStore
>>>>> with 
>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>>>>
>>>>> 2. Generate TPC-DS data.
>>>>>
>>>>> 3. TPC-DS queries run okay and produce correct results. E.g., from
>>>>> query 1:
>>>>> +-------------------+
>>>>> |   c_customer_id   |
>>>>> +-------------------+
>>>>> | AAAAAAAAAAAACHAA  |
>>>>> | AAAAAAAAAAAADCAA  |
>>>>> | AAAAAAAAAAAADDAA  |
>>>>> ...
>>>>> | AAAAAAAAAAAILIAA  |
>>>>> +-------------------+
>>>>> 100 rows selected (69.901 seconds)
>>>>>
>>>>> However, the query compilation takes long (
>>>>> https://issues.apache.org/jira/browse/HIVE-16520).
>>>>>
>>>>> 4. Now, restart MetaStore with
>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>>>>
>>>>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from query
>>>>> 1:
>>>>> +----------------+
>>>>> | c_customer_id  |
>>>>> +----------------+
>>>>> +----------------+
>>>>> No rows selected (37.448 seconds)
>>>>>
>>>>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
>>>>> HiveServer2 produces such log messages:
>>>>>
>>>>> 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>>>> 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>>>> 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>>>> 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>>>> 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>>>> c_customer_id
>>>>> 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>>>> c_customer_id
>>>>>
>>>>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>> Invalid column stats: No of nulls > cardinality
>>>>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>> Invalid column stats: No of nulls > cardinality
>>>>> 2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>> Invalid column stats: No of nulls > cardinality
>>>>>
>>>>> However, even after computing column stats, queries still return wrong
>>>>> results, despite the fact that the above log messages disappear.
>>>>>
>>>>> I guess I am missing some configuration parameters (because I imported
>>>>> hive-site.xml from Hive 2). Any suggestion would be appreciated.
>>>>>
>>>>> Thanks a lot,
>>>>>
>>>>> --- Sungwoo Park
>>>>>
>>>>>

Reply via email to