I should have mentioned earlier, but we encountered our problem with
queries between Trino and Hive3 MetaStore.
The tests I reported were also querying Hive1/3 MetaStore using Trino. The
problem might only exist between Trino and Hive3 MetaStore.

- Takanobu

2024年3月1日(金) 14:53 Takanobu Asanuma <tasan...@apache.org>:

> Yes, for now, we believe that HIVE-14187 has caused performance
> degradation in Hive3 MetaStore.
>
> We also use HiveServer2, but our HiveServer2 directly accesses the backend
> DB without going through the Hive MetaStore, because it enhances
> performance to directly access the DB in a heavily loaded cluster.
> Therefore, we might not encounter the issue of HIVE-20600. We provide the
> Hive MetaStore only for Trino/SparkSQL/etc.
>
> - Takanobu
>
> 2024年3月1日(金) 11:32 Sungwoo Park <glap...@gmail.com>:
>
>> Thank you for sharing the result. (Does your result imply that HIVE-14187
>> is introducing an intended bug?)
>>
>> Another issue that could be of your interest is the connection leak
>> problem reported in HIVE-20600. Do you see the connection leak problem, or
>> is it not relevant to your environment (e.g., because you don't use
>> HiveServer2)?
>>
>> --- Sungwoo
>>
>> On Fri, Mar 1, 2024 at 9:45 AM Takanobu Asanuma <
>> takanobu.asan...@gmail.com> wrote:
>>
>>> Hi Pau and Sungwoo,
>>>
>>> Thanks for sharing the information.
>>>
>>> We tested a set of simple queries which just referenced the Hive table
>>> and didn't execute any Hive jobs. The result is below.
>>>
>>> No. Version rawstore.impl connectionPoolingType HIVE-14187 QueryTime
>>> --------------------------------------------------
>>> 1   1.2.1   ObjectStore    None                Not Applied 11:38
>>> 2   3.1.3   ObjectStore    None                Applied     34:00
>>> 3   3.1.3   CachedStore    None                Applied     25:00
>>> 4   3.1.3   ObjectStore    HikariCP            Applied     21:10
>>> 5   3.1.3   CachedStore    HikariCP            Applied     14:30
>>> 6   3.1.3   ObjectStore    None                Reverted    13:00
>>> 7   3.1.3   ObjectStore    HikariCP            Reverted    11:23
>>> --------------------------------------------------
>>>
>>> Initially, we encountered an issue of Hive MetaStore slowness when we
>>> upgraded from environment No.1 to No.2. As shown in the table, environment
>>> No.2 showed the worst test results.
>>>
>>> A unique aspect of our environment is that we don't use connection
>>> pooling. After some investigation, we thought that the combination of
>>> HIVE-14187 and connectionPoolingType=None was negatively impacting
>>> performance.
>>> The fastest case in our tests was when we reverted HIVE-14187 and set
>>> connectionPoolingType=HikariCP (see No.7). Even with connectionPoolingType
>>> set to None, the environment where we reverted HIVE-14187 still performed
>>> reasonably well (see No.6).
>>>
>>> Please note our investigation is still ongoing and we haven't yet come
>>> to a conclusion.
>>>
>>> Regards,
>>> - Takanobu
>>>
>>> 2024年2月29日(木) 12:18 Sungwoo Park <glap...@gmail.com>:
>>>
>>>> We didn't make any other attempt to fix the problem and just decided
>>>> not to use CachedStore. However, I think our installation of Metastore
>>>> based on Hive 3.1.3 is running without any serious problems.
>>>>
>>>> Could you share how long it takes to compile typical queries in your
>>>> environment (with Hive 1 and with Hive 3)?
>>>>
>>>> FYI, in our environment, sometimes it takes about 10 seconds to compile
>>>> a query on TPC-DS 10TB datasets. Specifically, the average compilation time
>>>> of 103 queries is 1.7 seconds (as reported by Hive), and the longest
>>>> compilation time is 9.6 seconds (query 49). The compilation time includes
>>>> the time for accessing Metastore.
>>>>
>>>> Thanks,
>>>>
>>>> --- Sungwoo
>>>>
>>>>
>>>> On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma <tasan...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks for your detailed answer!
>>>>>
>>>>> In the original email, you reported "the query compilation takes long"
>>>>> in Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
>>>>> Thank you for sharing the issue with CachedStore and the JIRA tickets.
>>>>> I will also try out metastore.stats.fetch.bitvector=true.
>>>>>
>>>>> Regards,
>>>>> - Takanobu
>>>>>
>>>>> 2024年2月28日(水) 18:49 Sungwoo Park <glap...@gmail.com>:
>>>>>
>>>>>> Hello Takanobu,
>>>>>>
>>>>>> We did not test with vanilla Hive 3.1.3 and Metastore databases can
>>>>>> be different, so I don't know why Metastore responses are very slow. I 
>>>>>> can
>>>>>> only share some results of testing CachedStore in Metastore. Please note
>>>>>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>>>>>> Hive 3.1.3 (which applies many additional patches).
>>>>>>
>>>>>> 1.
>>>>>> When CachedStore is enabled, column stats are not computed. As a
>>>>>> result, some queries generate very inefficient plans because of
>>>>>> wrong/inaccurate stats.
>>>>>>
>>>>>> Perhaps this is because not all patches for CachedStore have been
>>>>>> merged to Hive 3.1.3. For example, these patches are not merged. Or, 
>>>>>> there
>>>>>> might be some way to properly configure CachedStore so that it correctly
>>>>>> computes column stats.
>>>>>>
>>>>>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>>>>>> HIVE-21063: Support statistics in cachedStore for transactional table
>>>>>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>>>>>> constraint
>>>>>>
>>>>>> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>>>>>>
>>>>>> (If anyone is running Hive Metastore 3.1.3 in production with
>>>>>> CachedStore enabled, please let us know how you configure it.)
>>>>>>
>>>>>> 2.
>>>>>> Setting metastore.stats.fetch.bitvector=true can also help generate
>>>>>> more efficient query plans.
>>>>>>
>>>>>> --- Sungwoo
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma <tasan...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Sungwoo Park,
>>>>>>>
>>>>>>> I'm sorry for the late reply to this old email.
>>>>>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>>>>>>> noticed that the response of the Hive3 MetaStore is very slow.
>>>>>>> We suspect that HIVE-14187 might be causing this slowness.
>>>>>>> Could you tell me if you have resolved this problem? Are there still
>>>>>>> any problems when you enable CachedStore?
>>>>>>>
>>>>>>> Regards,
>>>>>>> - Takanobu
>>>>>>>
>>>>>>> 2018年6月13日(水) 0:37 Sungwoo Park <glap...@gmail.com>:
>>>>>>>
>>>>>>>> Hello Hive users,
>>>>>>>>
>>>>>>>> I am experience a problem with MetaStore in Hive 3.0.
>>>>>>>>
>>>>>>>> 1. Start MetaStore
>>>>>>>> with 
>>>>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>>>>>>>
>>>>>>>> 2. Generate TPC-DS data.
>>>>>>>>
>>>>>>>> 3. TPC-DS queries run okay and produce correct results. E.g., from
>>>>>>>> query 1:
>>>>>>>> +-------------------+
>>>>>>>> |   c_customer_id   |
>>>>>>>> +-------------------+
>>>>>>>> | AAAAAAAAAAAACHAA  |
>>>>>>>> | AAAAAAAAAAAADCAA  |
>>>>>>>> | AAAAAAAAAAAADDAA  |
>>>>>>>> ...
>>>>>>>> | AAAAAAAAAAAILIAA  |
>>>>>>>> +-------------------+
>>>>>>>> 100 rows selected (69.901 seconds)
>>>>>>>>
>>>>>>>> However, the query compilation takes long (
>>>>>>>> https://issues.apache.org/jira/browse/HIVE-16520).
>>>>>>>>
>>>>>>>> 4. Now, restart MetaStore with
>>>>>>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>>>>>>>
>>>>>>>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from
>>>>>>>> query 1:
>>>>>>>> +----------------+
>>>>>>>> | c_customer_id  |
>>>>>>>> +----------------+
>>>>>>>> +----------------+
>>>>>>>> No rows selected (37.448 seconds)
>>>>>>>>
>>>>>>>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
>>>>>>>> HiveServer2 produces such log messages:
>>>>>>>>
>>>>>>>> 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats 
>>>>>>>> for
>>>>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>>>>>>> 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>>>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>>>>>>> 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats 
>>>>>>>> for
>>>>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>>>>>>> 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>>>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>>>>>>> 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats 
>>>>>>>> for
>>>>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>>>>>>> c_customer_id
>>>>>>>> 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>>>>>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>>>>>>> c_customer_id
>>>>>>>>
>>>>>>>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>>>>> Invalid column stats: No of nulls > cardinality
>>>>>>>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>>>>> Invalid column stats: No of nulls > cardinality
>>>>>>>> 2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>>>>>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>>>>>>> Invalid column stats: No of nulls > cardinality
>>>>>>>>
>>>>>>>> However, even after computing column stats, queries still return
>>>>>>>> wrong results, despite the fact that the above log messages disappear.
>>>>>>>>
>>>>>>>> I guess I am missing some configuration parameters (because I
>>>>>>>> imported hive-site.xml from Hive 2). Any suggestion would be 
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>>
>>>>>>>> --- Sungwoo Park
>>>>>>>>
>>>>>>>>

Reply via email to