[ 
https://issues.apache.org/jira/browse/HIVE-29312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061014#comment-18061014
 ] 

Stamatis Zampetakis commented on HIVE-29312:
--------------------------------------------

In general, it is best to avoid tables with that many partitions since they 
tend to cause issues in all kind of places but let's leave this aside for the 
moment.

I did some experiments with MySQL 8.4.3 and I do see the impact on the plan 
with the proposed transformation. However, in order to see it I used the JSON 
formatting for displaying the explain (using 
{{{}explain_json_format_version=2{}}}) .
{code:sql}
explain format=json select * from PARTITIONS where TBL_ID=1616 and PART_NAME 
like 'appid=0/%' and PART_NAME like '%/usageid=0/%' and PART_NAME like 
'%/grassregion=AR/%' and PART_NAME like '%/grassdate=2026-02-24/%';
{code}
{code:json}
{
  "query": "...",
  "ranges": [
    "(TBL_ID = 1616 AND 'appid=0/' <= PART_NAME <= 
'appid=0/???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????')"
  ],
  "covering": false,
  "operation": "Index range scan on PARTITIONS using UNIQUEPARTITION over 
(TBL_ID = 1616 AND 'appid=0/' <= PART_NAME <= 
'appid=0/???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'),
 with index condition: ((`PARTITIONS`.TBL_ID = 1616) and 
(`PARTITIONS`.PART_NAME like 'appid=0/%') and (`PARTITIONS`.PART_NAME like 
'%/usageid=0/%') and (`PARTITIONS`.PART_NAME like '%/grassregion=AR/%') and 
(`PARTITIONS`.PART_NAME like '%/grassdate=2026-02-24/%'))",
  "index_name": "UNIQUEPARTITION",
  "query_type": "select",
  "table_name": "PARTITIONS",
  "access_type": "index",
  "schema_name": "hivedb",
  "used_columns": [
    "PART_ID",
    "CREATE_TIME",
    "LAST_ACCESS_TIME",
    "PART_NAME",
    "SD_ID",
    "TBL_ID",
    "WRITE_ID"
  ],
  "estimated_rows": 1.0,
  "index_access_type": "index_range_scan",
  "estimated_total_cost": 0.71,
  "pushed_index_condition": "((`PARTITIONS`.TBL_ID = 1616) and 
(`PARTITIONS`.PART_NAME like 'appid=0/%') and (`PARTITIONS`.PART_NAME like 
'%/usageid=0/%') and (`PARTITIONS`.PART_NAME like '%/grassregion=AR/%') and 
(`PARTITIONS`.PART_NAME like '%/grassdate=2026-02-24/%'))"
}

{code}
{code:sql}
explain format=json select * from PARTITIONS where TBL_ID=1616 and PART_NAME 
like 'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/%';
{code}
{code:json}
{
  "query": "...",
  "ranges": [
    "(TBL_ID = 1616 AND 
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/' <= PART_NAME <= 
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????')"
  ],
  "covering": false,
  "operation": "Index range scan on PARTITIONS using UNIQUEPARTITION over 
(TBL_ID = 1616 AND 'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/' <= 
PART_NAME <= 
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'),
 with index condition: ((`PARTITIONS`.TBL_ID = 1616) and 
(`PARTITIONS`.PART_NAME like 
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/%'))",
  "index_name": "UNIQUEPARTITION",
  "query_type": "select",
  "table_name": "PARTITIONS",
  "access_type": "index",
  "schema_name": "hivedb",
  "used_columns": [
    "PART_ID",
    "CREATE_TIME",
    "LAST_ACCESS_TIME",
    "PART_NAME",
    "SD_ID",
    "TBL_ID",
    "WRITE_ID"
  ],
  "estimated_rows": 1.0,
  "index_access_type": "index_range_scan",
  "estimated_total_cost": 0.71,
  "pushed_index_condition": "((`PARTITIONS`.TBL_ID = 1616) and 
(`PARTITIONS`.PART_NAME like 
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/%'))"
}
{code}
It becomes evident that when there is a single LIKE predicate the range scan 
boundaries are more specific so it can have an impact in performance.

After checking the plans in MySQL and Postgres it seems that the impact of the 
optimization is somewhat database specific. It does have an effect on MySQL but 
not sure if the same holds for Postgres or other systems.

Another aspect that affects the efficiency of the optimization is the ordering 
of the partitioning columns and their respective NDVs. It seems that if the 
first partitioning column has a high NDV then the transformation proposed here 
is not gonna have a big impact on performance while the for low NDV it really 
matters.

In addition, if the filter condition does not contain a predicate for the first 
partitioning column the whole optimization will not make much of a difference 
even when the underlying DBMS is MySQL. If the LIKE predicate starts with a 
wildcard then the index range scan will not be usable. Similar observations 
hold when the filter condition does not have predicates on every partitioning 
column.

Consider for example a table partitioned by (a,b,c,d) and partitioning filter 
that says (a=1 and d=1); the generated SQL query will be PART_NAME LIKE 'a=1/%' 
AND PART_NAME LIKE '%/d=1' and the only way to collapse this is PART_NAME LIKE 
'a=1%d=1' but the latter is not gonna make any difference for the index range 
scan.

I am now starting to understand better the impact of the proposed optimization 
and indeed there seem to exist some use-cases that can benefit from it. 
Nevertheless, there are many preconditions that need to hold in order to 
observe a noticable performance gain so I haven't fully made up my mind on how 
we should proceed. My initial feeling is that this kind of optimizations are 
pretty low-level and we shouldn't worry too much at the Hive/HMS level.
There is a trade-off between code simplicity/maintenance & performance so we 
need to carefully weight the pros & cons of each side.

[~wechar] It would be nice to share a few insights on the use-case that 
motivated this work in order to understand better how many end-users can 
benefit from it.

> Concatenate equality conditions in AND nodes
> --------------------------------------------
>
>                 Key: HIVE-29312
>                 URL: https://issues.apache.org/jira/browse/HIVE-29312
>             Project: Hive
>          Issue Type: Improvement
>          Components: Standalone Metastore
>    Affects Versions: 4.1.0
>            Reporter: Wechar
>            Assignee: Wechar
>            Priority: Major
>              Labels: pull-request-available
>
> Current partition filtert tree is a binary tree with single condition in the 
> leaf node, actually the adjacent AND leaf nodes can be combined together to 
> match longer index prefix. For example, a table has partition key: a, b, c, 
> d, and a filter is {{a=1 and b=1 and c=1}}, when visit as a binary tree 
> *AND(AND(a=1, b=1), c=1)*,  the filter is:
> {code:sql}
> PART_NAME like 'a=1/%' and PART_NAME like '%/b=2/%' and PART_NAME like 
> '%/c=3/%'
> {code}
> If we combine the adjacent AND conditions, the tree will be *MultiAnd(a=1, 
> b=2, c=3)*, aand the filte can be:
> {code:bash}
> PART_NAME like 'a=1/b=2/c=3/%'
> {code}
> Obviously, the latter could match a longer index prefix and is more efficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to