Different results from identical matviews

2020-07-01 Thread Anders Steinlein
Hi folks,

We have a materialized view from which a customer reported some
confusing/invalid results, leading us to inspect the query and not finding
anything wrong. Running the query defining the matview manually, or
creating a new (identical) materialized view returns the correct result.
Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
comparison, and all runs are in the same schema.

It's a pretty big query, but let's describe the two matviews to see that
they are identical. The first is the original returning invalid results,
the one with _2 name postfix is the re-created one.

mm_prod=> \d+ segments_with_contacts
   Materialized view "aakpnews.segments_with_contacts"
 Column | Type  | Collation | Nullable | Default | Storage  | Stats
target | Description
+---+---+--+-+--+--+-
 lid| integer   |   |  | | plain|
   |
 sid| integer   |   |  | | plain|
   |
 email  | public.citext |   |  | | extended |
   |
Indexes:
"segments_with_contacts_sid_lid_email_idx" UNIQUE, btree (sid, lid,
email)
View definition:
 WITH tagged_contacts AS (
 SELECT cl.lid,
cl.email,
cl.skip_preexisting_campaigns AS skip_subscribed,
ct.skip_preexisting_campaigns AS skip_tags,
cl.ladded,
ct.tagname,
ct.created
   FROM contacts_lists cl
 LEFT JOIN contacts_tags ct USING (email)
  WHERE cl.lstatus::bpchar = 'a'::bpchar
), tagged_segments AS (
 SELECT s.lid,
s.cid,
s.sid,
sp.type,
sp.mid,
sp.matchdelay,
sp.tagname,
sp.event,
count(*) OVER (PARTITION BY s.sid) AS requirements,
campaigns.activated_at
   FROM segments s
 LEFT JOIN campaigns USING (cid)
 JOIN segments_predicates sp USING (sid)
  WHERE s.archived_at IS NULL AND (s.cid IS NULL OR
campaigns.activated_at IS NOT NULL)
), segments_contacts AS (
 SELECT s.lid,
s.sid,
s.requirements,
CASE
WHEN s.type = 'subscribed'::public.predicate THEN (
SELECT array_agg(DISTINCT tagged_contacts.email::public.citext) AS array_agg
   FROM tagged_contacts
  WHERE tagged_contacts.lid = s.lid AND
tagged_contacts.ladded >= s.activated_at AND (s.matchdelay IS NULL OR
(tagged_contacts.ladded + s.matchdelay) < now()) AND NOT
COALESCE(tagged_contacts.skip_subscribed, false))
WHEN s.type = 'has_tag'::public.predicate THEN ( SELECT
array_agg(DISTINCT tagged_contacts.email::public.citext) AS array_agg
   FROM tagged_contacts
  WHERE tagged_contacts.lid = s.lid AND
tagged_contacts.tagname OPERATOR(public.=) s.tagname AND (s.matchdelay IS
NULL OR (tagged_contacts.created + s.matchdelay) < now()) AND (s.cid IS
NULL OR tagged_contacts.created >= s.activated_at AND NOT
COALESCE(tagged_contacts.skip_tags, false)))
WHEN s.type = 'not_has_tag'::public.predicate THEN (
SELECT array_agg(aggregated_tags.email::public.citext) AS array_agg
   FROM ( SELECT tagged_contacts.email,
array_agg(tagged_contacts.tagname) AS tags
   FROM tagged_contacts
  WHERE tagged_contacts.lid = s.lid AND (s.cid
IS NULL OR tagged_contacts.ladded >= s.activated_at AND NOT
COALESCE(tagged_contacts.skip_subscribed, false))
  GROUP BY tagged_contacts.email)
aggregated_tags
  WHERE NOT aggregated_tags.tags @> ARRAY[s.tagname])
WHEN s.type = 'received'::public.predicate THEN (
SELECT array_agg(DISTINCT mails_contacts_sent.email::public.citext) AS
array_agg
   FROM mails_contacts_sent
  WHERE mails_contacts_sent.mid = s.mid AND
(s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) <
now()))
WHEN s.type = 'not_received'::public.predicate THEN (
SELECT array_agg(x.email::public.citext) AS array_agg
   FROM ( SELECT tagged_contacts.email
   FROM tagged_contacts
  WHERE tagged_contacts.lid = s.lid
EXCEPT
 SELECT DISTINCT mails_contacts_sent.email
   FROM mails_contacts_sent
  WHERE mails_contacts_sent.mid = s.mid AND
(s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) <
now())) x)
WHEN s.type = 'opened'::public.predicate THEN ( SELECT
array_agg(DISTINCT mails_contacts_opens.email::public.ci

Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 2:02 AM Tom Lane  wrote:

> Anders Steinlein  writes:
> > We have a materialized view from which a customer reported some
> > confusing/invalid results, leading us to inspect the query and not
> finding
> > anything wrong. Running the query defining the matview manually, or
> > creating a new (identical) materialized view returns the correct result.
> > Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
> > comparison, and all runs are in the same schema.
>
> I suspect the query underlying the matviews is less deterministic than
> you think it is.  I did not study that query in any detail, but just
> from a quick eyeball: the array_agg() calls with no attempt to enforce a
> particular aggregation order are concerning, and so is grouping by
> a citext column (where you'll get some case-folding of a common value,
> but who knows which).


Thanks for the tip, but I'm having a hard time thinking that's the case,
seeing as I'm unable to trigger the wrong result no matter how hard I try
with a new definition/manual query. I've introduced random ordering to the
first CTE-clause (where the initial citext values comes from, and casing
thus could differ in some order) which doesn't change the result.

When the citext type is used throughout the query, shouldn't the grouping
result be deterministic? The citext values are first "rolled up" with
array_agg() and later unnested and finally grouped. Shouldn't the end
result be the same, regardless of what particular case-folded version of
the value it chooses to group on?

I've simplified the query for this particular customer case that, again,
always returns the correct result no matter how often I try:

mm_prod=> SELECT sid, count(*) FROM (
WITH tagged_contacts AS (
SELECT lid, email, cl.skip_preexisting_campaigns AS skip_subscribed,
ct.skip_preexisting_campaigns AS skip_tags, ladded,
tagname, created
FROM contacts_lists cl
LEFT JOIN contacts_tags ct USING (email)
WHERE lstatus = 'a'
ORDER BY random()
),
tagged_segments AS (
SELECT s.lid, cid, sid, sp.type, sp.mid, matchdelay, tagname, event,
count(*) OVER (PARTITION BY sid) AS requirements,
activated_at
FROM segments s
LEFT JOIN campaigns USING (cid)
INNER JOIN segments_predicates sp USING (sid)
WHERE
s.archived_at IS NULL
AND (cid IS NULL OR activated_at IS NOT NULL)
),
segments_contacts AS (
SELECT lid, sid, requirements,
CASE
WHEN type = 'has_tag' THEN (
SELECT array_agg(DISTINCT email::citext)
FROM tagged_contacts
WHERE
lid = s.lid
AND tagname = s.tagname
AND (matchdelay IS NULL OR created + matchdelay
< now())
AND (
cid IS NULL
OR (
created >= activated_at
AND NOT COALESCE(skip_tags, false)
)
)
)
END AS emails
FROM tagged_segments s
),
unnested AS (
SELECT lid, sid, requirements, unnest(emails) AS email
FROM segments_contacts
)
SELECT lid, sid, email
FROM unnested
GROUP BY lid, sid, email, requirements
HAVING count(email) = requirements
) x
WHERE sid = 42259
GROUP BY sid;
  sid  | count
---+---
 42259 |98
(1 row)

This is stale data for this customer, so no data changes are occurring to
change the results. I can REFRESH MATERIALIZED VIEW as many times I was on
the original segments_with_contacts matview, and I never see different
results. If it were not deterministic, shouldn't I expect to see different
results one in at least 100 times tried?

Thanks again for any insight to try and figure this out. Again, I could
just re-create the matview we use in production and it would likely work
(since I'm unable to get wrong results with a newly created case), but I
would rather try to find out the root cause here first.

Best,
-- a.


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 12:12 PM Magnus Hagander  wrote:

>
> On Thu, Jul 2, 2020 at 2:02 AM Tom Lane  wrote:
>
>> Anders Steinlein  writes:
>> > We have a materialized view from which a customer reported some
>> > confusing/invalid results, leading us to inspect the query and not
>> finding
>> > anything wrong. Running the query defining the matview manually, or
>> > creating a new (identical) materialized view returns the correct result.
>> > Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
>> > comparison, and all runs are in the same schema.
>>
>> I suspect the query underlying the matviews is less deterministic than
>> you think it is.  I did not study that query in any detail, but just
>> from a quick eyeball: the array_agg() calls with no attempt to enforce a
>> particular aggregation order are concerning, and so is grouping by
>> a citext column (where you'll get some case-folding of a common value,
>> but who knows which).
>>
>
> Also not having looked at the query in detail -- but are there concurrent
> changes in the database?
>

Yes, the database is in production so changes do occur, however the data
involved for this particular customer and the tables involved in this query
is not changing while we've looked into the case these days. It's SELECTED
quite a bit though.


> Because since you're creating your transaction in READ COMMITTED, other
> transactions finishing in between your two REFRESH commands can alter the
> data. To make sure that's not what's happening, you may want to try doing
> the same thing with a BEGIN  TRANSACTION ISOLATION LEVEL SERIALIZABLE
> instead, and see if the problem still occurs.
>

Thanks for the suggestion, but the issue remains. The results are the same
(that is, invalid for the one matview and correct for new matviews or
manual queries) whether I run the queries in the same transaction or
separate transactions. After many many many attempts. I'm quite baffled,
really...

Best,
-- a.

>


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 1:26 PM Anders Steinlein  wrote:

> On Thu, Jul 2, 2020 at 2:02 AM Tom Lane  wrote:
>
>> Anders Steinlein  writes:
>> > We have a materialized view from which a customer reported some
>> > confusing/invalid results, leading us to inspect the query and not
>> finding
>> > anything wrong. Running the query defining the matview manually, or
>> > creating a new (identical) materialized view returns the correct result.
>> > Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
>> > comparison, and all runs are in the same schema.
>>
>> I suspect the query underlying the matviews is less deterministic than
>> you think it is.  I did not study that query in any detail, but just
>> from a quick eyeball: the array_agg() calls with no attempt to enforce a
>> particular aggregation order are concerning, and so is grouping by
>> a citext column (where you'll get some case-folding of a common value,
>> but who knows which).
>
>
> Thanks for the tip, but I'm having a hard time thinking that's the case,
> seeing as I'm unable to trigger the wrong result no matter how hard I try
> with a new definition/manual query. I've introduced random ordering to the
> first CTE-clause (where the initial citext values comes from, and casing
> thus could differ in some order) which doesn't change the result.
>

I just wanted to add that we're on Postgres 12.3. This matview has been
with us since 9.4 days, and we have not experienced any such issues before
(could be customers who haven't noticed or reported it to us, of course...).
 version

-
 PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Best,
-- a.

>


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 3:01 PM Jeremy Smith  wrote:

> It looks like you are using now() fairly often in that query.  That would,
> of course, give different results in different transactions, but it could
> also give different results if a) the things you are comparing now() to are
> timestamp without time zone and b) the session time zone of the user doing
> the refresh is different from the session time zone of the user running the
> query.  I'd also be suspicious about any other timestamp comparisons that
> aren't comparing timestamp to timestamp or timestamptz to timestamptz.
>

Thanks, but as mentioned in another part of the thread this data is stale
(unmodified during these days we're looking into this). And the result is
different between the erroneous matview and identical manual query
regardless, whether I run them in the same transaction or not. And
the results were the same yesterday, same days before that when the issue
came up, and same today. So I can't see how times are the issue,
unfortunately. :-/

Best,
-- a.


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 3:44 PM Tom Lane  wrote:

> Anders Steinlein  writes:
> > On Thu, Jul 2, 2020 at 2:02 AM Tom Lane  wrote:
> >> I suspect the query underlying the matviews is less deterministic than
> >> you think it is.
>
> > Thanks for the tip, but I'm having a hard time thinking that's the case,
> > seeing as I'm unable to trigger the wrong result no matter how hard I try
> > with a new definition/manual query.
>
> Well, another line of thought is that there actually is some difference
> between the stored query for the original matview and the ones you enter
> afresh.  You said they were the same, but I surely didn't attempt to
> verify that.  Comparing pg_get_viewdef() output for equality would be
> a good first step.


I used a manual `diff` earlier, but this sure was easier. But yes, the
stored queries are identical:

mm_prod=> select pg_get_viewdef('aakpnews.segments_with_contacts') =
pg_get_viewdef('aakpnews.segments_with_contacts_2');
 ?column?
--
 t
(1 row)

Even that perhaps isn't conclusive, so you could
> also try comparing the pg_rewrite.ev_action fields for the views'
> ON SELECT rules.  (That might be a bit frustrating because of likely
> inconsistencies in node "location" fields; but any other difference
> is cause for suspicion.)
>

You're right, ev_action is indeed different:

mm_prod=> select x1.ev_type = x2.ev_type as ev_type_equal, x1.ev_enabled =
x2.ev_enabled as enabled_equal, x1.is_instead = x2.is_instead as
is_instead_equal, x1.ev_qual = x2.ev_qual as ev_qual_equal, x1.ev_action =
x2.ev_action as ev_action_equal
from
(select pr.* from pg_namespace pn inner join pg_class pc on pc.relnamespace
= pn.oid inner join pg_rewrite pr on pr.ev_class = pc.oid where pn.nspname
= 'aakpnews' and pc.relname = 'segments_with_contacts') x1,
(select pr.* from pg_namespace pn inner join pg_class pc on pc.relnamespace
= pn.oid inner join pg_rewrite pr on pr.ev_class = pc.oid where pn.nspname
= 'aakpnews' and pc.relname = 'segments_with_contacts_2') x2;
 ev_type_equal | enabled_equal | is_instead_equal | ev_qual_equal |
ev_action_equal
---+---+--+---+-
 t | t | t| t | f
(1 row)

Is there somehow I can format them to make it easier to compare? My basic
attempts didn't help me much. I put them up in all their glories in
pastebins, since they are rather large. Please let me know if there is
somehow I can make this easier to look into.

ev_action for segments_with_contacts - the origial matview:
https://pastebin.com/MBJ45prC
ev_action for segments_with_contacts_2 - the similar newly created matview:
https://pastebin.com/sL4WjzBj

Best,
-- a.


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 3:55 PM David G. Johnston 
wrote:

> On Thursday, July 2, 2020, Anders Steinlein  wrote:
>>
>>
>> I just wanted to add that we're on Postgres 12.3. This matview has been
>> with us since 9.4 days, and we have not experienced any such issues before
>> (could be customers who haven't noticed or reported it to us, of course...).
>>  version
>>
>>
>> -
>>  PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu,
>> compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
>>
>
>
>  I concur that the determinism doesn’t seem like a problem - but not much
> else does either.  As a shot in the dark does pg_depend show any
> differences between the dependencies for the two views?
>

Could be worth checking, yes. Could you give me any guidance as to how to
compare this? Never looked at pg_depend before -- which of the columns
should have the oid for the matview I want to look up dependencies for?

How did this migrate from 9.4 to 12?
>

pg_dump and pg_restore. It's been a few months, so unfortunately I can't
recall which pg_dump version was used. Another thing to possibly note is
that the citext extension has subsequently been updated as well; I'm unsure
if the matview has been recreated after that (if that could have any
effect).


> It would be helpful if “Explain analyze refresh materialized view” were a
> thing (is it?)
>

Yeah, I was looking for that too initially when investigating.
Unfortunately, "Utility statements have no plan structure" is the response
given.

If you can backup and restore the existing database (basebackup is more
> likely, but pg_dump would be more useful) and still observe the problem
> then maybe I see hope for digging down into the cause.  Otherwise I’d limit
> my decision to testing for the symptom with the solution being to rebuild
> any problem views.
>

I //think// I have the dump laying around, but if simply rebuilding the
view fixes the problem I'm inclined to just do that, although the issue is
a bit concerning. If anyone here suspects an actual bug with a possible
avenue for further investigation, in which case I would be happy to help.

Best,
-- a


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 5:43 PM Tom Lane  wrote:

> Anders Steinlein  writes:
> >> Even that perhaps isn't conclusive, so you could
> >> also try comparing the pg_rewrite.ev_action fields for the views'
> >> ON SELECT rules.  (That might be a bit frustrating because of likely
> >> inconsistencies in node "location" fields; but any other difference
> >> is cause for suspicion.)
>
> > You're right, ev_action is indeed different:
> > ...
> > Is there somehow I can format them to make it easier to compare? My basic
> > attempts didn't help me much. I put them up in all their glories in
> > pastebins, since they are rather large. Please let me know if there is
> > somehow I can make this easier to look into.
>
> Yeah, expression trees are pretty unreadable :-(.  I downloaded these,
> changed all the "location" fields to -1 to make them more comparable,
> and behold there are still a bunch of diffs.  Here's one:
>
> [...]
>
> This is the internal form of a "JOIN ... USING (email)" construct.
> I didn't try to trace this back to exactly where it was in the source
> queries.  The important thing here is that we have a couple of Vars
> of type 106893, which I gather must be citext or a domain over it.
>

Yes, we have a domain called `email` over citext. The reason for the
multiple casts to citext, IIRC, was that i.e. array_agg() didn't accept the
email domain directly.


> In the first tree, those are coerced via a no-op RelabelType operation
> into plain text (type OID 25) and then compared with the built-in texteq
> operator.  In the second tree, they are coerced to some other non-built-in
> type (maybe plain citext?) and then compared with operator 106108.
>
> I am betting that 106084 is citext, 106108 is citext's equality operator,
> and the net implication of all this is that the original matview is doing
> the JOIN using case-sensitive equality whereas the new one is using
> case-insensitive equality.
>

This makes sense. We've narrowed the different results down to this exact
case; that is, for the rows that are missing in the old matview, they have
"email" entries in different case between some of the tables. So
case-sensitive equality checks on these will naturally be lost somewhere.
How this came to be is another question...

A plausible explanation for how things got that way is that citext's
> equality operator wasn't in your search_path when you created the original
> matview, but it is in view when you make the new one, allowing that
> equality operator to capture the interpretation of USING.


Possibly, since this view has existed for many years. However in general,
our multi-tenant migration system does SET search_path TO , public for
all DDL we do, so unless we had an issue whey back when it should've been
present.


> Unfortunately,
> since the reverse-listing of this join is just going to say "USING
> (email)", there's no way to detect from human-readable output that the
> interpretation of the USING clauses is different.  (We've contemplated
> introducing not-SQL-standard syntax to allow flagging such cases, but
> haven't pulled the trigger on that.)
>

If I'm reading this correctly, would this be a "reason" to be more explicit
when doing joins involving non-standard data types? I.e. would it be
"safer" to do ON x1.email::citext == x2.email::citext instead of USING (if
there is any difference at all...)?

I count five places in the query with similar operator substitutions.
> There are some other diffs in the trees that are a bit odd, but might be
> explained if the new view was made by dump/reload rather than from the
> identical SQL text the original view was made from; they all look like
> they are references to JOIN output columns rather than the underlying
> table columns or vice versa.  That's probably harmless, but the different
> join operators certainly are not.
>

I looked over the procedure we used for the upgrade, and it was this:
Postgres 9.4 server backup using WAL-E, restored into 9.4 on a new box,
where we did some data cleanup and upgraded extensions including citext.
Then used pg_dump from 12, directory format, and restored into 12. Since
that time no DDL or extension changes have been made on the tables/view
involved here. Could the citext upgrade have had any effect here, messing
with dependencies somehow?

Anyway, I think I'll just recreate these views for all tenants to be on the
safe side. Still curious about how this came to be, but since it's about 10
years of history in this database I guess it will be hard to figure
anything out for sure.

Thanks a lot for looking into this, let me know if there's any reason to
dig further.

Best,
-- a.


Re: Different results from identical matviews

2020-07-02 Thread Anders Steinlein
On Thu, Jul 2, 2020 at 11:44 PM Tom Lane  wrote:

> Anders Steinlein  writes:
> > I'm reading this correctly, would this be a "reason" to be more explicit
> > when doing joins involving non-standard data types? I.e. would it be
> > "safer" to do ON x1.email::citext == x2.email::citext instead of USING
> (if
> > there is any difference at all...)?
>
> Yes, it would be.  Of course then you don't get the "merging" of the two
> join output columns into one, so you might have to qualify references a
> bit more.
>
> You might find this thread interesting:
>
>
> https://www.postgresql.org/message-id/flat/ffefc172-a487-aa87-a0e7-472bf29735c8%40gmail.com


Indeed interesting, thanks!

Am I right in thinking that we should actually go over (i.e. re-create) all
functions and views defined before this dump/restore where we're using JOIN
... USING (citext_column)? We most definitely have many more such cases,
since this is the common (perhaps naive) way we've written joins (unless
there are obvious reasons to be explicit). :-/

Best,
-- a.