from:"Maxim Ivanov"

optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-22 Thread Maxim Ivanov

Hi All,

It is known, that  collation "C" significantly speeds up string comparisons and 
as a result sorting. I was wondering, whether it is possible to use it 
regardless of collation set on a column in sorts not visible to users?

Example I have in  mind is sorting performed for GroupAggregate. Purpose of 
that sort is to bring equal values next to each other, so as long as:
   1) user didn't request ORDER BY in addition to GROUP BY
   2) source column has any deterministic collation (as per docs all builtin 
collations are deterministic)

it seems to be possible to do sorting with any deterministic collation, 
regardless of what user specifid for the column being sorted. "C" collation is 
deterministic and fastest.

In other words, couldn't PostgreSQL convert this:
->  GroupAggregate  (cost=15726557.87..22944558.69 rows=721 width=176) 
(actual time=490103.209..771536.389 rows=3600 loops=1)
  Group Key: ec_180days.msn, ec_180days.to_date_time
  ->  Sort  (cost=15726557.87..15906557.89 rows=7208 width=113) (actual 
time=490094.849..524854.662 rows=7200 loops=1)
Sort Key: ec_180days.msn, ec_180days.to_date_time
Sort Method: external merge  Disk: 7679136kB

To this:
->  GroupAggregate  (cost=14988274.87..22206275.69 rows=721 width=155) 
(actual time=140497.729..421510.001 rows=3600 loops=1)

  Group Key: ec_180days.msn, ec_180days.to_date_time

  ->  Sort  (cost=14988274.87..15168274.89 rows=7208 width=92) (actual 
time=140489.807..174228.722 rows=7200 loops=1)

Sort Key: ec_180days.msn COLLATE "C", ec_180days.to_date_time

Sort Method: external merge  Disk: 7679136kB

which is 3 times faster in my tests.

Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-23 Thread Maxim Ivanov



> Perhaps this is what you mean by "deterministic", but isn't it
> possible for some collations to treat multiple byte sequences as equal
> values? And those multiple byte sequences wouldn't necessarily occur
> sequentially in C collation, so it wouldn't be possible to work around
> that by having the grouping node use one collation but the sorting
> node use the C one.
>
> If my memory is incorrect, then this sounds like an intriguing idea.


Yes, as per doc 
(https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC) 
some collations can result in symbols(chars? codes? runes?) to be equal, while 
their byte representations is not. This optimisation should check for source 
table collation and do not change sorting collation if columns being sorted use 
non deterministic collation.

Luckily in practice it is probably to be very rare, all builtin collations are 
deterministic.

Use of additional index columns in rows filtering

2023-02-15 Thread Maxim Ivanov

Hi All,

I'd like to report what seems to be a missing optimization opportunity or 
understand why it is not possible to achieve.

TLDR; additional index column B specified in CREATE INDEX .. (A) INCLUDE(B) is 
not used to filter rows in queries like WHERE B = $1 ORDER BY A during 
IndexScan. https://dbfiddle.uk/iehtq44L


Take following database:


  CREATE TABLE t(
a integer NOT NULL,
b integer NOT NULL,
d integer NOT NULL
  );

  CREATE UNIQUE INDEX t_a_include_b ON t (a) INCLUDE (b);
  -- I'd expect index above to behave as index below for the purpose
  -- of this query
  -- CREATE UNIQUE INDEX ON t(a,b);

  INSERT INTO t(
SELECT random() * 1 as a,
   random() * 3 as b,
   generate_series as d FROM generate_series(1,20)
  ) ON CONFLICT DO NOTHING;


If we filter on `a` and `b` columns while scanning index created as `(a) 
INCLUDE (b)` it seems to be fetching tuples from heap to check for condition `b 
= 4` despite both columns available in the index:

SELECT * FROM  t WHERE a > 100 and b = 4 ORDER BY a ASC LIMIT 10;


Here is the plan (notice high "shared hit"):

  Limit  (cost=0.42..10955.01 rows=1 width=12) (actual time=84.283..84.284 
rows=0 loops=1)
    Output: a, b, d
    Buffers: shared hit=198307
    ->  Index Scan using t_a_include_b on public.t (cost=0.42..10955.01 rows=1 
width=12) (actual time=84.280..84.281 rows=0 loops=1)
  Output: a, b, d
      Index Cond: (t.a > 100)
      Filter: (t.b = 4)
      Rows Removed by Filter: 197805
  Buffers: shared hit=198307
  Planning:
    Buffers: shared hit=30
  Planning Time: 0.201 ms
  Execution Time: 84.303 ms


And here is the plan with index on (a,b).

  Limit  (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 
loops=1)
    Output: a, b, d
    Buffers: shared hit=613
    ->  Index Scan using t_a_b_idx on public.t  (cost=0.42..4447.90 rows=1 
width=12) (actual time=6.880..6.881 rows=0 loops=1)
      Output: a, b, d
  Index Cond: ((t.a > 100) AND (t.b = 4))
      Buffers: shared hit=613
  Planning:
    Buffers: shared hit=41
  Planning Time: 0.314 ms
  Execution Time: 6.910 ms


Because query doesn't sort on `b`, only filters on it while sorting on `a`, I'd 
expect indexes `(a) INCLUDE (b)` and `(a,b)` behave exactly the same with this 
particular query.

Interestingly, IndexOnlyScan is capable of using additional columns to filter 
rows without fetching them from the heap, but only for visibible tuples:

  VACUUM FREEZE t;
  SELECT a,b FROM  t WHERE a > 100 and b = 4 ORDER BY a ASC LIMIT 10;

  Limit  (cost=0.42..6619.76 rows=1 width=8) (actual time=18.479..18.480 rows=0 
loops=1)
    Output: a, b
    Buffers: shared hit=662
    ->  Index Only Scan using t_a_include_b on public.t  (cost=0.42..6619.76 
rows=1 width=8) (actual time=18.477..18.477 rows=0 loops=1)
      Output: a, b
  Index Cond: (t.a > 100)
      Filter: (t.b = 4)
  Rows Removed by Filter: 197771
      Heap Fetches: 0
  Buffers: shared hit=662

Removing VACUUM makes it behave like IndexScan and fetch candidate tuples from 
heap all while returning zero rows in the result.


To make query plan comparable I had to force index scan on both with:

  SET enable_bitmapscan to off;
  SET enable_seqscan to off;
  SET max_parallel_workers_per_gather = 0;

Self contained fully reproducible example is in https://dbfiddle.uk/iehtq44L

Regards,
Maxim

Re: Use of additional index columns in rows filtering

2023-02-15 Thread Maxim Ivanov

> This isn't a problem for operators found in operator families, because
> we trust those to not have undesirable side effects like raising
> data-dependent errors. But it'd be an issue if we started to apply
> quals that aren't index quals directly to index rows before doing
> the heap liveness check. (And, of course, once you've fetched the
> heap row there's no point in having a special path for columns
> available from the index.)

Assuming operators are pure and don't have global side effects, is it possible 
to ignore any error during that check? If tuple is not visible it shouldn't 
matter, if it is visible then error will be reported by the same routine which 
does filtering now (ExecQual?).


If not, then limiting this optimization to builtin ops is something I can live 
with :)

optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

Use of additional index columns in rows filtering

Re: Use of additional index columns in rows filtering

4 matches

Site Navigation

Mail list logo

Footer information