[ https://issues.apache.org/jira/browse/HIVE-28094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Soumyakanti Das updated HIVE-28094: ----------------------------------- Description: Currently we cache calls to {{getTableInternal}} method in HMS client cache and query cache. We also cache table ids in the query cache, but not in the HMS client cache. To cache {{{}getTableInternal{}}}, we create a CacheKey containing the {{GetTableRequest}} object. However, we do not check if all the necessary fields are set in the key. This results in a lot of cache misses, especially because we rely on {{validWriteIdList}} not being null and {{tableId}} not being -1. {{GetTableRequest}} object also contains `catName` which is not always set. All these things result in creating duplicate keys and not using the caches efficiently. Moreover, {{getTableInternal}} is called from other APIs that are getting cached, e.g. {{{}getPartitionsByExprInternal{}}}, so improvements in its performance will positively affect other APIs too. *RESULTS:* I ran all TPCDS explain cbo queries on my local machine, after cherry-picking [HIVE-28083: Enable HMS client cache and HMS query cache for Explain plans|https://github.com/apache/hive/pull/5092/commits/41a766d6a51480edb505fd53661a03c63ef3937a]. Then I analyzed the logs with a simple python script to get min, 25th percentile, median, 75th percentile, and max for PERFLOG logs with this pattern: {code:java} </PERFLOG method=(\w+) start=\d+ end=\d+ duration=(\d+) from=.* HS2-cache>' {code} Here are the results. *WITHOUT the improvements to {{getTableInternal}} method:* |*API*|*MIN*|*25th*|*MEDIAN*|*75th*|*MAX*| |*getTable*|2|3|3|4|233| |*getTableConstraints*|2|4|4|5|22| |*getPartitionsByExpr*|19|22|25|27|2396| |*getAggrColStatsFor*|0|125.5|186|284|910| |*getTableColumnStatistics*|4|6|7|8|454| Cache Stats: {code:java} CacheStats{hitCount=77464, missCount=11919, loadSuccessCount=0, loadFailureCount=0, totalLoadTime=0, evictionCount=0, evictionWeight=0} {code} *WITH the improvements to {{getTableInternal}} method:* |*API*|*MIN*|*25th*|*MEDIAN*|*75th*|*MAX*| |*getTable*|0|0|0|0|33| |*getTableConstraints*|3|4|4|5|20| |*getPartitionsByExpr*|14|16|19|21|2247| |*getAggrColStatsFor*|0|124.5|187|272.5|936| |*getTableColumnStatistics*|0|0|0|1|16| Cache Stats: {code:java} CacheStats{hitCount=81044, missCount=11943, loadSuccessCount=0, loadFailureCount=0, totalLoadTime=0, evictionCount=0, evictionWeight=0} {code} We can see that latency for the APIs, and the cache {{hitCount}} improves with this patch. was: Currently we cache calls to {{getTableInternal}} method in HMS client cache and query cache. We also cache table ids in the query cache, but not in the HMS client cache. To cache {{{}getTableInternal{}}}, we create a CacheKey containing the {{GetTableRequest}} object. However, we do not check if all the necessary fields are set in the key. This results in a lot of cache misses, especially because we rely on {{validWriteIdList}} not being null and {{tableId}} not being -1. {{GetTableRequest}} object also contains `catName` which is not always set. All these things result in creating duplicate keys and not using the caches efficiently. Moreover, {{getTableInternal}} is called from other APIs that are getting cached, e.g. {{{}getPartitionsByExprInternal{}}}, so improvements in its performance will positively affect other APIs too. RESULTS: I ran all TPCDS explain cbo queries on my local machine, after cherry-picking [HIVE-28083: Enable HMS client cache and HMS query cache for Explain plans|https://github.com/apache/hive/pull/5092/commits/41a766d6a51480edb505fd53661a03c63ef3937a]. Then I analyzed the logs with a simple python script to get min, 25th percentile, median, 75th percentile, and max for PERFLOG logs with this pattern: {code:java} </PERFLOG method=(\w+) start=\d+ end=\d+ duration=(\d+) from=.* HS2-cache>' {code} Here are the results. Without > Improve HMS client cache and query cache performance for getTableInternal > ------------------------------------------------------------------------- > > Key: HIVE-28094 > URL: https://issues.apache.org/jira/browse/HIVE-28094 > Project: Hive > Issue Type: Improvement > Components: Hive > Affects Versions: 4.0.0-beta-1 > Reporter: Soumyakanti Das > Assignee: Soumyakanti Das > Priority: Major > > Currently we cache calls to {{getTableInternal}} method in HMS client cache > and query cache. We also cache table ids in the query cache, but not in the > HMS client cache. > > To cache {{{}getTableInternal{}}}, we create a CacheKey containing the > {{GetTableRequest}} object. However, we do not check if all the necessary > fields are set in the key. This results in a lot of cache misses, especially > because we rely on {{validWriteIdList}} not being null and {{tableId}} not > being -1. {{GetTableRequest}} object also contains `catName` which is not > always set. All these things result in creating duplicate keys and not using > the caches efficiently. > > Moreover, {{getTableInternal}} is called from other APIs that are getting > cached, e.g. {{{}getPartitionsByExprInternal{}}}, so improvements in its > performance will positively affect other APIs too. > > *RESULTS:* > I ran all TPCDS explain cbo queries on my local machine, after cherry-picking > [HIVE-28083: Enable HMS client cache and HMS query cache for Explain > plans|https://github.com/apache/hive/pull/5092/commits/41a766d6a51480edb505fd53661a03c63ef3937a]. > Then I analyzed the logs with a simple python script to get min, 25th > percentile, median, 75th percentile, and max for PERFLOG logs with this > pattern: > {code:java} > </PERFLOG method=(\w+) start=\d+ end=\d+ duration=(\d+) from=.* HS2-cache>' > {code} > Here are the results. > *WITHOUT the improvements to {{getTableInternal}} method:* > |*API*|*MIN*|*25th*|*MEDIAN*|*75th*|*MAX*| > |*getTable*|2|3|3|4|233| > |*getTableConstraints*|2|4|4|5|22| > |*getPartitionsByExpr*|19|22|25|27|2396| > |*getAggrColStatsFor*|0|125.5|186|284|910| > |*getTableColumnStatistics*|4|6|7|8|454| > Cache Stats: > {code:java} > CacheStats{hitCount=77464, missCount=11919, loadSuccessCount=0, > loadFailureCount=0, totalLoadTime=0, evictionCount=0, evictionWeight=0} {code} > *WITH the improvements to {{getTableInternal}} method:* > |*API*|*MIN*|*25th*|*MEDIAN*|*75th*|*MAX*| > |*getTable*|0|0|0|0|33| > |*getTableConstraints*|3|4|4|5|20| > |*getPartitionsByExpr*|14|16|19|21|2247| > |*getAggrColStatsFor*|0|124.5|187|272.5|936| > |*getTableColumnStatistics*|0|0|0|1|16| > Cache Stats: > {code:java} > CacheStats{hitCount=81044, missCount=11943, loadSuccessCount=0, > loadFailureCount=0, totalLoadTime=0, evictionCount=0, evictionWeight=0} {code} > We can see that latency for the APIs, and the cache {{hitCount}} improves > with this patch. -- This message was sent by Atlassian Jira (v8.20.10#820010)