On Fri, 20 Dec 2024 03:22:26 +0900
Yugo Nagata <nag...@sraoss.co.jp> wrote:

> Hi,
> 
> Currently, btree indexes cannot used for ILIKE (~~*) operator if the pattern
> has case-varying characters although LIKE (~~) expression can be converted
> to indexable clauses by the planner support function (if the collation
> is "C" or operator class 'text_pattern_ops' is used).
> 
> For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND t < 
> '123fop')" 
> and index scan can be used for this condition. On the other hand, "t ~~* 
> '123foo'"
> cannot be converted and sequential scan is used. 
> 
> Even in this case, we can use a bitmap index scan for the condition 
> "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')" followed by
> recheck by the original condition "t ~~* '123foo'", and this could be faster
> than seqscan. 
> 
> However, even if the support function would return OR clauses, the current
> planner implementation cannot not build bitmap scan paths using them. 
> 
> 
> The attached patches allow to ILIKE (~~*) forward matching to use btree 
> index. 
> 
> The patch 0001 enables get_index_clause_from_support to receive OR clauses
> from support functions and use them to build bitmap index scan later. OR 
> clauses
> returned by support functions are collected in the new argument 'orclauses" of
> match_restriction_clauses_to_index(), and they are passed to
> generate_bitmap_or_paths() later to build bitmap scan paths.
> 
> The patch 0002 modifies the support function to return OR clauses as an 
> example
> above when ILIKE's pattern has case-varying characters in forward matching. 
> The
> OR clauses contains two index conditions for the upper and the lower letter of
> the first case-varying character in the pattern respectively. Although the
> subsequent characters are not considered in the index scans, it could enough 
> be
> faster then sequent scan.
> 
> Here is an example. 
> 
> 1. Create a table with random text records
> 
> =# CREATE TABLE tbl (t text);
> =# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x END
>                   FROM (SELECT i, md5(i::text) x FROM 
> generate_series(1,5000000) i);
> 
> 2. Create an index
> =# CREATE INDEX ON tbl (t text_pattern_ops);
> 
> 3. Before applying patch: Seq Scan was selected. It takes about 4 sec.
> 
> =# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';

I am sorry, the example in my previous main was wrong. I showed the plan
with enable_index_scan = off. Indead, if the pattern starts with
case-insensitive characters like '12', an index scan can be used even with
ILIKE. 

postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
                                                       QUERY PLAN               
                                         
-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..69129.61 rows=500 width=33) (actual 
time=36.317..4034.770 rows=1188 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=99 read=41939
   ->  Parallel Seq Scan on tbl  (cost=0.00..68079.61 rows=208 width=33) 
(actual time=19.908..4023.668 rows=396 loops=3)
         Filter: (t ~~* 'abc%'::text)
         Rows Removed by Filter: 1666271
         Buffers: shared hit=99 read=41939
 Planning Time: 0.726 ms
 Execution Time: 4035.101 ms
(10 rows)

4. After applying patch: Bitmap Index Scan was selected. 

postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
                                                                          QUERY 
PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=12563.66..58314.53 rows=500 width=33) (actual 
time=156.485..1266.789 rows=1188 loops=1)
   Recheck Cond: (((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text) AND (t ~~* 
'abc%'::text)) OR ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text) AND (t ~~* 
'abc%'::text)))
   Filter: (t ~~* 'abc%'::text)
   Rows Removed by Filter: 311473
   Heap Blocks: exact=42010
   Buffers: shared hit=1 read=44600
   ->  BitmapOr  (cost=12563.66..12563.66 rows=297029 width=0) (actual 
time=136.979..136.980 rows=0 loops=1)
         Buffers: shared hit=1 read=2590
         ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71 rows=148515 
width=0) (actual time=80.548..80.549 rows=158375 loops=1)
               Index Cond: ((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text))
               Buffers: shared read=1272
         ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71 rows=148515 
width=0) (actual time=56.427..56.427 rows=157042 loops=1)
               Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
               Buffers: shared hit=1 read=1318
 Planning Time: 0.592 ms
 Execution Time: 1267.162 ms
(16 rows)

 
> What do you think about it?
> 
> I think another application of OR-clause returning support function might be
> allowing to use an index for NOT LIKE (!~~) expression  because, for example, 
> "t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >= '123fooz')". 
> (The second condition is a bit loose but this would be safe and not problem
> since tuples are filtered by the original condition after the index scan.) 
> However, it would not very useful unless the distribution is much skew so that
> NOT LIKE expression's selectivity is enough small.


Regards,
Yugo Nagata

-- 
Yugo Nagata <nag...@sraoss.co.jp>


Reply via email to