ERROR: could not open relation with OID XXXX

2024-08-25 Thread Marcelo Zabani
Hi all,

I can reproduce the error in the subject from time to time when querying
catalog tables while DDL is happening concurrently. Here's a bash script
that reproduces it (not always, you might have to run it many times until
you see ERROR:  could not open relation with OID ):

#!/usr/bin/env bash
psql -c "create table test(x serial primary key); select oid, relname from
pg_class where relname='test'"
# The next two queries will run concurrently
psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) from
pg_class join pg_index on indexrelid=pg_class.oid WHERE
relname='test_pkey';" 2>&1 1>/tmp/pgbug.log &
sleep 1
psql -c "drop table test"
cat /tmp/pgbug.log
wait

I am confused as to how this is possible. I assume if the row with the
test_pkey index exists in the pg_index catalog table, that the snapshot of
the catalog tables contains the test table itself and is generally
consistent, so querying the catalog should not run into such errors.

I've seen this happen in Production without pg_sleep in the mix, too. I
added pg_sleep to the example above only because it makes the error easier
to reproduce.

Is there something I can do to avoid this? Is my understanding of how the
catalog tables work wrong?

Thanks,
Marcelo.


Re: ERROR: could not open relation with OID XXXX

2024-08-25 Thread Marcelo Zabani
> What version (including minor number)?

On my computer I can reproduce the error with v16.1. At my job I know it's
v15 but I can't access it right now so don't know the minor version. In any
case, at my job it's much bigger queries we run; I'll elaborate why below.

> Just out of curiosity, *WHY* do you do this?  It's never occurred to me
to do that; maybe it's something useful that I've been overlooking.

A project developed by me, codd - https://github.com/mzabani/codd - applies
postgresql migrations and checks tables, columns names/order/types,
indexes, etc. ("the full schema") to ensure databases in every environment
match what developers have on their computers. This is why the queries are
bigger, e.g.
https://github.com/mzabani/codd/blob/master/src/Codd/Representations/Database/Pg12.hs#L523-L540

One yet untested hypothesis is that codd is picking up temporary tables
when we deploy that are soon destroyed by the application, and that maybe
postgres is running into this error because it executes some of these
functions _before_ filtering out temporary relations (WHERE relpersistence
<> 't'). This might be possible depending on query plan, I believe. But
again, an untested hypothesis.

On Sun, Aug 25, 2024 at 11:31 AM Ron Johnson 
wrote:

> On Sun, Aug 25, 2024 at 9:42 AM Marcelo Zabani  wrote:
>
>> Hi all,
>>
>> I can reproduce the error in the subject from time to time when querying
>> catalog tables while DDL is happening concurrently. Here's a bash script
>> that reproduces it (not always, you might have to run it many times until
>> you see ERROR:  could not open relation with OID ):
>>
> [snip]
>
>> I've seen this happen in Production without pg_sleep in the mix, too. I
>> added pg_sleep to the example above only because it makes the error easier
>> to reproduce.
>>
>
> What version (including minor number)?
>
>
>> Is there something I can do to avoid this? Is my understanding of how the
>> catalog tables work wrong?
>>
>
> Just out of curiosity, *WHY* do you do this?  It's never occurred to me
> to do that; maybe it's something useful that I've been overlooking.
>
> --
> Death to America, and butter sauce.
> Iraq lobster!
>


Re: ERROR: could not open relation with OID XXXX

2024-08-25 Thread Marcelo Zabani
> we do some special stuff for catalogs

That is good to know, thanks!

> I believe you could actually lock the pg_class rows for update. Just add
FOR UPDATE at the end of the query.

Thanks, but I tried that and got "ERROR:  permission denied for table
pg_class", even if I try it only for tables the user owns.

At least considering the use-case of avoiding this error due to temporary
tables/indexes (which are a part of normal application execution), I was
thinking of using materialized CTEs that filters those out, and only after
that using other functions that for example take OIDs and return
definitions. Other kinds of DDL that create non-temporary tables can be
"blamed" on developers in my case.
Do you think using those materialized CTEs could help? And do you think
this can be considered a bug that I should report or is it just too
edge-casey to consider?

Regards.

On Sun, Aug 25, 2024 at 12:06 PM Tomas Vondra  wrote:

> On 8/25/24 15:42, Marcelo Zabani wrote:
> > Hi all,
> >
> > I can reproduce the error in the subject from time to time when querying
> > catalog tables while DDL is happening concurrently. Here's a bash script
> > that reproduces it (not always, you might have to run it many times
> > until you see ERROR:  could not open relation with OID ):
> >
> > #!/usr/bin/env bash
> > psql -c "create table test(x serial primary key); select oid, relname
> > from pg_class where relname='test'"
> > # The next two queries will run concurrently
> > psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) from
> > pg_class join pg_index on indexrelid=pg_class.oid WHERE
> > relname='test_pkey';" 2>&1 1>/tmp/pgbug.log &
> > sleep 1
> > psql -c "drop table test"
> > cat /tmp/pgbug.log
> > wait
> >
> > I am confused as to how this is possible. I assume if the row with the
> > test_pkey index exists in the pg_index catalog table, that the snapshot
> > of the catalog tables contains the test table itself and is generally
> > consistent, so querying the catalog should not run into such errors.
> >
>
> I think you're assuming the whole query runs with a single snapshot, and
> AFAIK that's not quite accurate - we do some special stuff for catalogs,
> for example. There's also the additional complexity of maintaining a
> cache on catalogs, invalidating it, etc.
>
> I don't have a great simple "this happens because X" explanation, but a
> lot of this relies on proper locking - in particular, that we lock all
> the objects before execution, which also invalidates all the caches etc.
>
> But that can't happen here, because we only realize we need to access
> the OID very late in the execution, when we get to pg_get_indexdef.
>
> > I've seen this happen in Production without pg_sleep in the mix, too. I
> > added pg_sleep to the example above only because it makes the error
> > easier to reproduce.
> >
>
> It's a race condition, essentially. The sleep just makes it easier to
> hit, but it can happen without it.
>
> > Is there something I can do to avoid this? Is my understanding of how
> > the catalog tables work wrong?
> >
>
> I believe you could actually lock the pg_class rows for update. Just add
> FOR UPDATE at the end of the query.
>
>
> regards
>
> --
> Tomas Vondra
>


Re: ERROR: could not open relation with OID XXXX

2024-08-25 Thread Marcelo Zabani
Thank you all for your comments.

> I would think that the queries in that case would be running as a
superuser in order to do the migrations.

Users of codd can choose the role that applies their migrations. Codd even
supports individual migrations running with ad-hoc users (so that a
migration can use the _postgres_ user to create the application's database,
for example) and users are free to add statements like `SET ROLE` inside
their migrations, too. So it's sadly not possible AFAICT to force superuser
onto them.

But I think I have plenty of things to try to avoid this problem, from
retrying like Tomas suggested to materialized CTEs that filter out
temporary relations before functions like pg_get_indexdef are called.

I will give these things a shot.

Regards.

On Sun, Aug 25, 2024 at 1:13 PM Adrian Klaver 
wrote:

> On 8/25/24 08:36, Marcelo Zabani wrote:
> >  > we do some special stuff for catalogs
> >
> > That is good to know, thanks!
> >
> >  > I believe you could actually lock the pg_class rows for update. Just
> > add FOR UPDATE at the end of the query.
> >
> > Thanks, but I tried that and got "ERROR:  permission denied for table
> > pg_class", even if I try it only for tables the user owns.
> >
>
> As I understand it this issue came up in:
>
> https://github.com/mzabani/codd
>
> I would think that the queries in that case would be running as a
> superuser in order to do the migrations.
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>


Re: RESET, NULL and empty-string valued settings and transaction isolation

2024-10-19 Thread Marcelo Zabani
Thank you both for the explanations and the link to the discussion of
Pavel's patches.


RESET, NULL and empty-string valued settings and transaction isolation

2024-10-19 Thread Marcelo Zabani
Hi, according to the docs, RESET sets the value of a setting to
"The default value is defined as the value that the parameter would have
had, if no SET had ever been issued for it in the current session"

Which confuses me given that the value starts being NULL in the session and
then turns into an empty string:
$ psql
=> select current_setting('my.test', true) is null; -- true
=> set my.test = 'abc';
=> reset my.test;
=> select current_setting('my.test', true) is null; -- false
=> select current_setting('my.test', true)=''; -- true

A similar effect happens with transactions and SET LOCAL:
=> begin;
=*> set local my.test='abc';
=*> rollback; -- commit works too
=> select current_setting('my.test', true) = ''; -- true

Is this expected? I thought even if I misunderstand the docs, the effect
isn't very nice because SQL like
current_setting('my.some_boolean_setting')::boolean will fail after a
transaction with SET LOCAL sets it, a side-effect that can be particularly
confusing and basically requires usage of nullif(.., '') or other explicit
checks around every current_setting call-site in practice.

Thanks in advance,
Marcelo.