Hi hackers,

Attached script reproduces the problem with incorrect results of `select count(*)` (it returns larger number of records than really available in the table). It is not always reproduced, so you may need to repeat it multiple times - at my system it failed 3 times from 10.

The problem takes place with pg16/17/18 (other versions I have not checked).

The test is called `test_ios` (index-only-scan), but it is not correct. Index-only scan is not used in this case. And this is actually the first question to PG17/18: IOS is not used when number of records is less than 100k (for this particular table):

postgres=# create table t(pk integer primary key); CREATE TABLE postgres=# set enable_seqscan = off; SET postgres=# set enable_indexscan = off; SET postgres=# insert into t values (generate_series(1,1000)); INSERT 0 1000 postgres=# vacuum t; VACUUM postgres=# explain select count(*) from t; QUERY PLAN ------------------------------------------------------------------------------- Aggregate (cost=43.02..43.03 rows=1 width=8) -> Bitmap Heap Scan on t (cost=25.52..40.52 rows=1000 width=0) -> Bitmap Index Scan on t_pkey (cost=0.00..25.27 rows=1000 width=0) (3 rows) postgres=# set enable_bitmapscan = off; SET postgres=# explain select count(*) from t; QUERY PLAN ------------------------------------------------------------ Aggregate (cost=17.50..17.51 rows=1 width=8) -> Seq Scan on t (cost=0.00..15.00 rows=1000 width=0) Disabled: true (3 rows)

So, as you can see, Postgres prefers to use disabled seqscan, but not IOS. It is different from pg16 where disabling bitmap scan makes optimizer to choose index-only scan:

postgres=# explain select count(*) from t; QUERY PLAN ----------------------------------------------------------- Aggregate (cost=41.88..41.88 rows=1 width=8) -> Seq Scan on t (cost=0.00..35.50 rows=2550 width=0) (2 rows) postgres=# set enable_seqscan = off; SET postgres=# explain select count(*) from t; QUERY PLAN ------------------------------------------------------------------------------- Aggregate (cost=75.54..75.55 rows=1 width=8) -> Bitmap Heap Scan on t (cost=33.67..69.17 rows=2550 width=0) -> Bitmap Index Scan on t_pkey (cost=0.00..33.03 rows=2550 width=0) (3 rows) postgres=# set enable_bitmapscan = off; SET postgres=# explain select count(*) from t; QUERY PLAN ------------------------------------------------------------------------------- Aggregate (cost=45.77..45.78 rows=1 width=8) -> Index Only Scan using t_pkey on t (cost=0.28..43.27 rows=1000 width=0) (2 rows)

This is strange behavior of pg17 which for some reasons rejects IOS (but it is used if number of records in the table is 100k or more). But the main problem is that used plan Bitmap Heap Scan + Bitmap Index Scan may return incorrect result.

Replacing `select count(*)` with `select count(pk)` eliminates the problem, as well as disabling of autovacuum. It seems to be clear that the problem is with visibility map.

We have the following code in heap bitmap scan: /* * We can skip fetching the heap page if we don't need any fields from the * heap, the bitmap entries don't need rechecking, and all tuples on the * page are visible to our transaction. */ if (!(scan->rs_flags & SO_NEED_TUPLES) && !tbmres->recheck && VM_ALL_VISIBLE(scan->rs_rd, tbmres->blockno, &hscan->rs_vmbuffer)) { /* can't be lossy in the skip_fetch case */ Assert(tbmres->ntuples >= 0); Assert(hscan->rs_empty_tuples_pending >= 0); hscan->rs_empty_tuples_pending += tbmres->ntuples; return true; }

So if we do not need tuples (|count(*)|case) and page is marked as all-visible in VM, then we just count|tbmres->ntuples|elements without extra checks. I almost not so familiar with internals of executor, but it is not clear to me how we avoid race condition between VM update and heap bitmap scan?

Assume that bitmap scan index marks all tids available in index. Some elements in this bitmap can refer old (invisible) versions. Then vacuum comes, removes dead elements and mark page as all-visible. After it we start heap bitmap scan, see that page is all-visible and count all marked elements on this page including dead (which are not present in the page any more).
Which lock or check should prevent such scenario?
import random
import threading
import time
import psycopg2

def test_ios():
    con = psycopg2.connect(database="postgres")
    n_records = 1000
    n_oltp_writers = 10
    n_oltp_readers = 5
    n_olap_readers = 1

    con.autocommit = True
    cur = con.cursor()

    cur.execute("DROP table if exists t")
    cur.execute("CREATE TABLE t(pk bigint not null)")
    cur.execute(f"insert into t values (generate_series(1,{n_records}))")
    cur.execute("create index on t(pk)")
    cur.execute("vacuum t")

    running = True

    def oltp_writer():
        con = psycopg2.connect(database="postgres")
        con.autocommit = True
        cur = con.cursor()
        while running:
            pk = random.randrange(1, n_records)
            cur.execute(f"update t set pk={n_records-pk+1} where pk={pk}")

    def oltp_reader():
        con = psycopg2.connect(database="postgres")
        con.autocommit = True
        cur = con.cursor()
        cur.execute("set effective_io_concurrency=0")
        cur.execute("set max_parallel_workers_per_gather=0")
        cur.execute("set enable_seqscan=off")
        cur.execute("set enable_indexscan=off")
        while running:
            pk = random.randrange(1, n_records)
            cur.execute(f"select count(*) from t where pk between {pk} and {pk+1000}")

    def olap_reader():
        con = psycopg2.connect(database="postgres")
        con.autocommit = True
        cur = con.cursor()
        cur.execute("set effective_io_concurrency=0")
        cur.execute("set max_parallel_workers_per_gather=0")
        cur.execute("set enable_seqscan=off")
        cur.execute("set enable_indexscan=off")
        while running:
            cur.execute("select count(*) from t")
            count = cur.fetchall()[0][0]
            assert count == n_records

    oltp_writers = []
    for _ in range(n_oltp_writers):
        t = threading.Thread(target=oltp_writer)
        oltp_writers.append(t)
        t.start()

    oltp_readers = []
    for _ in range(n_oltp_readers):
        t = threading.Thread(target=oltp_reader)
        oltp_readers.append(t)
        t.start()

    olap_readers = []
    for _ in range(n_olap_readers):
        t = threading.Thread(target=olap_reader)
        olap_readers.append(t)
        t.start()

    time.sleep(100)
    running = False

    for t in oltp_writers:
        t.join()
    for t in oltp_readers:
        t.join()
    for t in olap_readers:
        t.join()

if __name__=="__main__":
    test_ios()

Reply via email to