On Sun, Jul 3, 2022 at 12:59 AM Andres Freund <and...@anarazel.de> wrote:
> Hm. Now that I think about it, isn't the XlogFlush() in > XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the > cost across a number of GetNewRelFileNumber() calls, but still, an additional > f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's > zero fsyncs when creating a sequence or table inside a transaction (there are > some for indexes, but there's patches to fix that). > > Not that I really see an obvious alternative. I think to see the impact we need a workload which frequently creates the relfilenode, maybe we can test where parallel to pgbench we are frequently creating the relation/indexes and check how much performance hit we see. And if we see the impact then increasing VAR_RFN_PREFETCH value can help in resolving that. > I guess we could try to invent a flush-log-before-write type logic for > relfilenodes somehow? So that the first block actually written to a file needs > to ensure the WAL record that created the relation is flushed. But getting > that to work reliably seems nontrivial. > > One thing that would be good is to add an assertion to a few places ensuring > that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere > in the recovery path. Yes, it makes sense. > Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH > is 8192, but you chose 64 for VAR_RFN_PREFETCH? Earlier it was 8192, then there was a comment from Robert that we can use Oid for many other things like an identifier for a catalog tuple or a TOAST chunk, but a RelFileNumber requires a filesystem operation, so the amount of work that is needed to use up 8192 RelFileNumbers is a lot bigger than the amount of work required to use up 8192 OIDs. I think that make sense so I reduced it to 64, but now I tends to think that we also need to consider the point that after consuming VAR_RFN_PREFETCH we are going to do XlogFlush(), so it's better that we keep it high. And as Robert told upthread, even with keeping it 8192 we can still crash 2^41 (2 trillian) times before we completely run out of the number. So I think we can easily keep it up to 8192 and I don't think that we really need to worry much about the performance impact by XlogFlush()? > I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to > relfilenode. Okay. > This is embarassing. Trying to keep alignment match between C and catalog > alignment on AIX, without actually making the system understand the alignment > rules, is a remarkably shortsighted approach. > > I started a separate thread about it, since it's not really relevant to this > thread: > https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de > > Maybe we could at least make the field order to be something like > oid, relam, relfilenode, relname Yeah that we can do. > that should be ok alignment wise, keep oid first, and seems to make sense from > an "importance" POV? Can't really interpret later fields without knowing relam > etc. Right. > > I think due to wraparound if relfilenode gets reused by another > > relation in the same checkpoint then there was an issue in crash > > recovery with wal level minimal. But the root of the issue is a > > wraparound, right? > > I'm not convinced the tombstones were required solely in the oid wraparound > case before, despite the comment saying so, with wal_level=minimal. I gotta do > some non-work stuff for a bit, so I need to stop pondering this now :) > > I think it might be a good idea to have a few weeks in which we do *not* > remove the tombstones, but have assertion checks against such files existing > when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4 > a bit later. I think this is a good idea. > > Okay, so you mean to say that we can first drop the remaining segment > > and at last we drop the segment 0 right? > > I'd use the approach Robert suggested and delete from the end, going down. Yeah, I got it, thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com