Index-only scan for "f(x)" without "x"

2020-01-22 Thread Malthe
Referencing the example given in the documentation for index-only
scans [0], we consider an index:

CREATE INDEX tab_f_x ON tab (f(x));

This index currently will not be used for an index-scan for the
following query since the planner isn't smart enough to know that "x"
is not needed:

SELECT f(x) FROM tab WHERE f(x) < 1;

However, any function applied to a column for an index expression is
required to be immutable so as far as I can tell the planner doesn't
have to be very smart to know that the index can indeed be used for an
index-only scan (without having "x" included).

One interesting use-case for this is to be able to create
space-efficient indexes for raw log data. For example, for each type
of message (which might be encoded as JSON), one could create a
partial index with the relevant fields extracted and converted into
native data types and use index-only scanning to query. This is not
particularly attractive today because the message itself would need to
be added to the index effectively duplicating the log data.

In the same vein, being able to add this auxiliary data (which is
basically immutable expressions on one or more columns) explicitly
using INCLUDE would make the technique actually reliable. This is not
possible right now since expression are not supported as included
columns.

What's required in order to move forward on these capabilities?

[0] https://www.postgresql.org/docs/current/indexes-index-only-scans.html




Re: Index-only scan for "f(x)" without "x"

2020-01-22 Thread Malthe
On Thu, 23 Jan 2020 at 00:00, Tom Lane  wrote:
> The problem is that the planner's initial analysis of the query tree
> concludes that the scan of "tab" has to return "x", because it looks
> through the tree for plain Vars, and "x" is what it's going to find.
> This conclusion is in fact true for any plan that doesn't involve
> scanning a suitable index on f(x), so it's not something we can just
> dispense with.

If f(x) was actually added to the table as a virtual generated column
V then the planner could perhaps reasonably find that a particular
expression depends in part on V (rather than the set of real columns
underlying the virtual generated column).

That is, the planner now knows that it needs to return V = "f(x)"
rather than "x" and so it can match the requirement to our index and
decide to use an index-only scan.

> To back up from that and conclude that the indexscan doesn't really
> need to return "x" because every use of it is in the context "f(x)"
> seems like a pretty expensive proposition, especially once you start
> considering index expressions that are more complex than a single
> function call.  A brute-force matching search could easily be O(N^2)
> or worse in the size of the query tree, which is a lot to pay when
> there's a pretty high chance of failure.  (IOW, we have to expect
> that most queries on "tab" are in fact not going to be able to use
> that index, so we can't afford to spend a lot of planning time just
> because the index exists.)

In the approach outlined above, the set of expressions to match on
would be limited by the set of virtual generated columns defined on
the table — some sort of prefix tree that allows efficient matching on
a given expression syntax tree? There would be no cost to this on a
table that had no virtual generated columns.

The question is whether this is a reasonable take on virtual generated columns.

   ---regards




Re: regexp_replace weirdness amounts to a bug?

2023-08-16 Thread Malthe
Hi Erik,

The regexp doesn't match your string because you're not allowing for
any repeat characters, try adding a '+'.

On Wed, 16 Aug 2023 at 09:07, Erik Rijkers  wrote:
>
> Hello,
>
> The following surprised me enough to think it might be a bug:
> (17devel)
>
> select
> regexp_replace('Abc Def'
>, '([a-z]) ([A-Z])'
>, '\1 ' || lower('\2')  );
>
> regexp_replace
> 
>   Abc Def
> (1 row)
>
> -- 'Abc Def' got
> -- 'Abc def' expected
>
> What do you think?
>
> Thanks,
>
> Erik Rijkers
>
>
>