domain cast in parameterized vs. non-parameterized query

2017-12-19 Thread David Kamholz
I've recently come across a query that produces different plans depending
on whether it's parameterized or not. The parameterized query takes ~50ms
while the non-parameterized query takes ~4s. The issue seems to be that the
query contains a STABLE function (uid_langvar) whose parameter is a domain
over text (uid).

The parameterized query is able to use the return value of uid_langvar to
choose a better plan:

2017-12-19 23:13:21 GMT LOG:  duration: 0.063 ms  plan:
Query Text:
SELECT id FROM langvar WHERE uid(lang_code, var_code) = uid

Index Scan using langvar_uid_idx on langvar  (cost=0.29..2.59
rows=1 width=4) (actual time=0.058..0.058 rows=1 loops=1)
  Index Cond: ((lang_code)::text || '-'::text) ||
lpad((var_code)::text, 3, '0'::text)))::uid)::text = ($1)::text)
2017-12-19 23:13:21 GMT CONTEXT:  SQL function "uid_langvar" statement 1
2017-12-19 23:13:21 GMT LOG:  duration: 150.634 ms  plan:
Query Text: select expr.id, expr.langvar, expr.txt, expr.txt_degr,
exprx.score as expr_score
from expr
inner join exprx on expr.id = exprx.id
where expr.langvar = uid_langvar($1)
order by exprx.score desc
limit 10

Limit  (cost=0.88..426.76 rows=10 width=45) (actual
time=133.378..150.611 rows=10 loops=1)
  ->  Nested Loop  (cost=0.88..23706962.21 rows=556656 width=45)
(actual time=133.376..150.598 rows=10 loops=1)
->  Index Scan Backward using exprx_score_langvar_idx on
exprx  (cost=0.44..2973934.39 rows=25583602 width=8) (actual
time=0.052..13.479 rows=5589 loops=1)
->  Index Scan using expr_pkey on expr  (cost=0.44..0.81
rows=1 width=41) (actual time=0.023..0.023 rows=0 loops=5589)
  Index Cond: (id = exprx.id)
  Filter: (langvar = uid_langvar('spa-000'::uid))
  Rows Removed by Filter: 1

Note that "SELECT id FROM langvar..." is the body of the uid_langvar
function. Also note that in the filter condition, 'spa-000' is cast
directly to uid. However, the non-parameterized query, where 'spa-000' is
passed directly, produces a different plan:

2017-12-19 23:18:01 GMT LOG:  duration: 0.066 ms  plan:
Query Text:
SELECT id FROM langvar WHERE uid(lang_code, var_code) = uid

Index Scan using langvar_uid_idx on langvar  (cost=0.29..2.59
rows=1 width=4) (actual time=0.062..0.062 rows=1 loops=1)
  Index Cond: ((lang_code)::text || '-'::text) ||
lpad((var_code)::text, 3, '0'::text)))::uid)::text = ($1)::text)
2017-12-19 23:18:01 GMT CONTEXT:  SQL function "uid_langvar" statement 1
2017-12-19 23:18:05 GMT LOG:  duration: 3950.817 ms  plan:
Query Text: select expr.id, expr.langvar, expr.txt, expr.txt_degr,
exprx.score as expr_score
from expr
inner join exprx on expr.id = exprx.id
where expr.langvar = uid_langvar('spa-000')
order by exprx.score desc
limit 10
;
Limit  (cost=12842.48..12842.51 rows=10 width=45) (actual
time=3950.777..3950.793 rows=10 loops=1)
  ->  Sort  (cost=12842.48..12850.44 rows=3182 width=45) (actual
time=3950.775..3950.780 rows=10 loops=1)
Sort Key: exprx.score DESC
Sort Method: top-N heapsort  Memory: 26kB
->  Nested Loop  (cost=1.13..12773.72 rows=3182 width=45)
(actual time=1.524..3541.873 rows=561076 loops=1)
  ->  Index Scan using expr_langvar_id_idx on expr
(cost=0.69..3823.68 rows=3183 width=41) (actual time=1.480..717.547
rows=561293 loops=1)
Index Cond: (langvar =
uid_langvar(('spa-000'::text)::uid))
  ->  Index Scan using exprx_id_idx on exprx
(cost=0.44..2.71 rows=1 width=8) (actual time=0.002..0.003 rows=1
loops=561293)
Index Cond: (id = expr.id)

Note that in the above plan, 'spa-000' is cast to text before it's cast to
uid. This is apparently connected to why postgresql can't choose the better
plan.

This difference between plans of parameterized and non-parameterized
queries seems strange to me. Is it actually expected/correct or is it a bug?

Here's the definition of the domain and the functions, in case it's
relevant:

CREATE DOMAIN uid AS text
CONSTRAINT uid_check CHECK ((VALUE ~ '^[a-z]{3}-\d{3}$'::text));

CREATE FUNCTION uid_langvar(uid uid) RETURNS integer
LANGUAGE sql STABLE PARALLEL SAFE
AS $$
SELECT id FROM langvar WHERE uid(lang_code, var_code) = uid
$$;

CREATE FUNCTION uid(lang_code alpha3, var_code smallint) RETURNS uid
LANGUAGE sql IMMUTABLE PARALLEL SAFE
AS $$select (lang_code || '-' || lpad(var_code::text, 3, '0'))::uid;$$;


Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread David Kamholz
>
> That's not too surprising.  PostgreSQL can't choose a plan based on
> the parameter value when it doesn't know the parameter value


I thought that since 9.2, postgresql could "generate plans based on the
parameter value even when using prepared statements" (paraphrase of 9.2
release notes). I'm running version 10. That's why I was surprised to find
the different plans. If you're right that taking the value into account
causes the planner to guess wrong, I agree that's a separate issue -- but
is that really what's going on?


> > Note that in the above plan, 'spa-000' is cast to text before it's cast
> to uid. This
> > is apparently connected to why postgresql can't choose the better plan.
>
> It's slightly hard for me to follow what's going on with the
> auto_explain output you provided because you didn't specify what SQL
> you ran to produce that output, but I suspect that's not the case.


The queries included in the output (after "Query Text:"), which is why I
didn't include them separately.


> I think the deeper problem here may be that the planner has no idea
> what value uid_langvar() will return, so its selectivity estimates are
> probably fairly bogus.  If you looked up that id first and then
> searched for the resulting value, it might do better.
>

I was under the impression, possibly incorrect, that the planner would
sometimes or always call a stable/immutable function in the planning stage
in order to consider its return value for planning. RhodiumToad on
#postgresql mentioned that functions returning constant values will be
folded in. He thought the planner should call uid_langvar() even though it
wasn't constant. Changing it from stable to immutable makes no difference,
and neither does reducing the cost to 10. Looking up the id first is an
obvious option but I thought there was a way to do this within a single
query. I guess not?

In any case, I still don't understand why prepared vs. not makes a
difference.

Dave


Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread David Kamholz
>
> You might consider whether you can write 'spa-000'::uid explicitly in your
> query; that results in immediate application of the domain coercion, so
> that the planner no longer sees that as a run-time operation it has to
> avoid.
>

I should have mentioned that I tried an explicit cast and found that
'spa-000' and 'spa-000'::uid produce identical results. As far as I can
tell, there is *no* way to get the planner to constant-fold in this case
without using prepared statements.


> It's tempting to wonder whether we could somehow constant-fold
> CoerceToDomain, at least in the estimation case, but I'm afraid that
> would lead to domain constraint failures that would not necessarily occur
> at runtime.  Or we could skip the domain check for estimation purposes,
> but then we're possibly feeding a value that fails the domain check to a
> function that might not work right under such conditions.  So on the
> whole I'm afraid to monkey with that decision.


OK, I can see how that makes sense. But shouldn't an explicit cast still
work?


Re: domain cast in parameterized vs. non-parameterized query

2017-12-21 Thread David Kamholz
>
> > Why the rewrite doesn't reduce it? Or why parser does it?
>
> Because ALTER DOMAIN can change what would be a valid value.


In the view case that makes sense, but I don't see how ALTER DOMAIN is
relevant to my original example. You can't alter a domain between the time
the query is parsed and executed, can you?

I also don't understand why prepared statements are different. The future
values are not known to pass the domain check at the time the statement is
prepared, and at the time it's executed, I'd think the value is known
exactly to the extent that a literal (non-parameterized) value is known.
That is, I'd expect the ability to run CoerceToDomain and fold to a
constant to be identical in both cases -- either possible in both or
impossible in both. Why the difference?

Dave