On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman <melanieplage...@gmail.com> wrote: > I have taken a stab at doing some of the tasks listed in this email.
Cool. > I have made the new files rmgr_utils.c/h. > > I have come up with a standard format that I like for the output and > used it in all the heap record types. > > Examples below: That seems like a reasonable approach. > I started documenting it in the rmgr_utils.h header file in a comment, > however it may be worth a README? > > I haven't polished this description of the format (or added examples, > etc) or used it in the btree-related functions because I assume the > format and helper function API will need more discussion. I think that standardization is good, but ISTM that we need clarity on what the scope is -- what is *not* being standardized? It may or may not be useful to call the end result an API. Or it may not make sense to do so in the first committed version, even though we may ultimately end up as something that deserves to be called an API. The obligation to not break tools that are scraping the output in whatever way seems kind of onerous right now -- just not having any gratuitous inconsistencies (e.g., fixing totally inconsistent punctuation, making the names for fields across WAL records consistent when they serve exactly the same purpose) would be a big improvement. As I mentioned in passing already, I actually don't think that the B-Tree WAL records are all that special, as far as this stuff goes. For example, the DELETE Btree record type is very similar to heapam's PRUNE record type, and practically identical to Btree's VACUUM record type. All of these record types use the same basic conventions, like a snapshotConflictHorizon field for recovery conflicts (which is generated in a very similar way during original execution, and processed in precisely the same way during REDO), and arrays of page offset numbers sorted in ascending order. There are some remaining details where things from an index AM WAL record aren't directly analogous (or pretty much identical) to some other heapam WAL records, such as the way that the DELETE Btree record type deals with deleting a subset of TIDs from a posting list index tuple (generated by B-Tree deduplication). But even these exceptions don't require all that much discussion. You could either choose to only display the array of deleted index tuple page offset numbers, as well as the similar array of "updated" index tuple page offset numbers from xl_btree_delete, in which case you just display two arrays of page offset numbers, in the same standard way. You may or may not want to also show each individual xl_btree_update entry -- doing so would be kinda like showing the details of individual freeze plans, except that you'd probably display something very similar to the page offset number display here too (even though these aren't page offset numbers, they're 0-based offsets into the posting list's item pointer data array). BTW, there is also a tendency for non-btree index AM WAL records to be fairly similar or even near-identical to the B-Tree WAL records. While Hash indexes are very different to B-Tree indexes at a high level, it is nevertheless the case that xl_hash_vacuum_one_page is directly based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is directly based on xl_btree_insert. There are some other WAL record types that are completely different across hash and B-Tree, which is a reflection of the fact that the index grows using a totally different approach in each AM -- but that doesn't seem like something that throws up any roadblocks for you (these can all be displayed as simple structs anyway). Speaking with my B-Tree hat on, I'd just be happy to be able to see both of the page offset number arrays (the deleted + updated offset number arrays from xl_btree_delete/xl_btree_vacuum), without also being able to\ see output for each individual xl_btree_update item-pointer-array-offset arrays -- just seeing that much is already a huge improvement. That's why I'm a bit hesitant to use the term API just yet, because an obligation to be consistent in whatever way seems like it might block incremental progress. > Perhaps there should also be example output of the offset arrays in > pgwalinspect docs? That would definitely make sense. > I've changed the array format helper functions that Andres added to be a > single function with an additional layer of indirection so that any > record with an array can use it regardless of type and format of the > individual elements. The signature is based somewhat off of qsort_r() > and allows the user to pass a function with the the desired format of > the elements. That's handy. > Personally, I like having the infomasks for the freeze plans. If we > someday have a more structured input to rmgr_desc, we could then easily > have them in their own column and use functions like > heap_tuple_infomask_flags() on them. I agree, in general, though long term the best approach is one that has a configurable level of verbosity, with some kind of roughly uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though probably with only 2 or 3 distinct levels). Obviously what you're doing here will lead to a significant increase in the verbosity of the output for affected WAL records. I don't feel too bad about that, though. It's really an existing problem, and one that should be fixed either way. You kind of have to deal with this already, by having a good psql pager, since record types such as COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very verbose in roughly the same way. You only need to have one of these record types output by a function like pg_get_wal_records_info() to get absurdly wide output -- it hardly matters that most individual WAL record types have terse output at that point. > > > How hard would it be to invent a general mechanism to control the > > > verbosity > > > of what we'll show for each WAL record? > > > > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc: > > void (*rm_desc) (StringInfo buf, XLogReaderState > > *record); > > > > so we'd need to patch all of them. That might be worth doing at some point, > > but I don't want to tackle it right now. > > In terms of a more structured format, it seems like it would make the > most sense to pass a JSON or composite datatype structure to rm_desc > instead of that StringInfo. > > I would also like to see functions like XLogRecGetBlockRefInfo() pass > something more useful than a stringinfo buffer so that we could easily > extract out the relfilenode in pgwalinspect. That does seem particularly important. It's a pain to do this from SQL. In general I'm okay with focussing on pg_walinspect over pg_waldump, since it'll become more important over time. Obviously pg_waldump needs to still work, but I think it's okay to care less about pg_waldump usability. > > BTW, while playing around with this patch today, I noticed that it > > won't display the number of elements in each offset array directly. > > Perhaps it's worth including that, too? > > I believe I have addressed this in the attached patch. Thanks for taking care of that. -- Peter Geoghegan