Dean Rasheed <dean.a.rash...@gmail.com> writes: > Yeah, that makes sense. I still can't see what might be causing those > failures. The tests that were doing an ALTER COLUMN and then expecting > to see the results of a non-analysed table ought to be fixed by > 0936d1b6f, but that doesn't match the buildfarm failures. Possibly > 0936d1b6f will help with those anyway, but if so, it'll be annoying > not understanding why.
Quite :-(. While it's too early to declare victory, we've seen no more failures of this ilk since 0936d1b6f, so it's sure looking like autovacuum did have something to do with it. Just to save people repeating the search I did, these are the buildfarm failures of interest so far: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2006%3A33%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2009%3A20%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-28%2013%3A20%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2020-03-28%2020%3A03%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-03-28%2022%3A00%3A19 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-03-29%2006%3A45%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2020-03-30%2002%3A20%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2020-03-30%2006%3A00%3A06 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2006%3A10%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2020-03-30%2023%3A10%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-03-31%2005%3A00%3A35 The first of those is unlike the rest, and I'm not trying to account for it here. In the rest, what we see is that sometimes the estimates are off by a little bit from what's expected, up or down just a percent or two. And those deltas kick at inconsistent spots partway through a series of similar tests, so it's hard to deny that *something* asynchronous to the test script is causing it. After contemplating the failures for awhile, I have a theory that at least partially matches the data. What I think is happening is that autovacuum (NOT auto-analyze) launches on the table, and since it is running concurrently with the foreground test script, it fails to immediately acquire buffer lock on one or more of the table pages. Since this isn't an aggressive vacuum scan, it just politely backs off and doesn't scan those pages. And that translates to not getting a perfectly accurate reltuples estimate at the end of the vacuum. On my x86_64 machine, which matches the buildfarm critters having trouble, the actual contents of both of the troublesome tables will be 5000 tuples in 31 pages --- which comes out to be 30 pages with 162 tuples each and then 140 tuples in the last page. Working through the math in vac_estimate_reltuples (and assuming that the "old" values were accurate numbers from the test script's own ANALYZE), what I find is that autovacuum will conclude there are 4999 tuples if it misses scanning one of the first 30 pages, or 5021 tuples if it misses scanning the last page, because its interpolation from the old tuple density figure will underestimate or overestimate the number of missed tuples accordingly. Once that slightly-off number gets pushed into pg_class, we start to get slightly-off rowcount estimates in the test cases. So what I'm hypothesizing is that the pg_statistic data is perfectly fine but pg_class.reltuples goes off a little bit after autovacuum. The percentage changes in reltuples that I predict this way don't quite square with the percentage changes we see in the overall rowcount estimates, which is a problem for this theory. But the test cases are exercising some fairly complex estimation logic, and it wouldn't surprise me much if the estimates aren't linearly affected by reltuples. (Tomas, do you want to comment further on that point?) regards, tom lane