Re: POC: Cleaning up orphaned files using undo logs

Robert Haas Wed, 21 Aug 2019 08:42:06 -0700

On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbal...@gmail.com> wrote:
> I have already attempted that part and I feel it is not making code
> any simpler than what we have today.  For packing, it's fine because I
> can process all the member once and directly pack it into one memory
> chunk and I can insert that to the buffer by one call of
> InsertUndoBytes and that will make the code simpler.


OK...

> But, while unpacking if I directly unpack to the UnpackUndoRecord then
> there are few complexities.  I am not saying those are difficult to
> implement but code may not look better.
>
> a) First, we need to add extra stages for unpacking as we need to do
> field by field.
>
> b) Some of the members like uur_payload and uur_tuple are not the same
> type in the UnpackUndoRecord compared to how it is stored in the page.
> In UnpackUndoRecord those are StringInfoData whereas on the page we
> store it as UndoRecordPayload header followed by the actual data.  I
> am not saying we can not unpack this directly we can do it like,
> first read the payload length from the page in uur_payload.len then
> read tuple length in uur_tuple.len then read both the data.  And, for
> that, we will have to add extra stages.

I don't think that this is true; or at least, I think it's avoidable.
The idea of an unpacking stage is that you refuse to advance to the
next stage until you've got a certain number of bytes of data; and
then you unpack everything that pertains to that stage.  If you have 2
4-byte fields that you want to unpack together, you can just wait
until you've 8 bytes of data and then unpack both.  You don't really
need 2 separate stages.  (Similarly, your concern about fields being
in a different order seems like it should be resolved by agreeing on
one ordering and having everything use it; I don't know why there
should be one order that is better in memory and another order that is
better on disk.)

The bigger issue here is that we don't seem to be making very much
progress toward improving the overall design.  Several significant
improvements have been suggested:

1. Thomas suggested quite some time ago that we should make sure that
the transaction header is the first optional header.  If we do that,
then I'm not clear on why we even need this incremental unpacking
stuff any more. The only reason for all of this machinery was so that
we could find the transaction header at an unknown offset inside a
complex record format; there is, if I understand correctly, no other
case in which we want to incrementally decode a record. But if the
transaction header is at a fixed offset, then there seems to be no
need to even have incremental decoding at all.  Or it can be very
simple, with three stages: (1) we don't yet have enough bytes to
figure out how big the record is; (2) we have enough bytes to figure
out how big the record is and we have figured that out but we don't
yet have all of those bytes; and (3) we have the whole record, we can
decode the whole thing and we're done.

2. Based on a discussion with Thomas, I suggested the GHOB stuff,
which gets rid of the idea of transaction headers inside individual
records altogether; instead, there is one undo blob per transaction
(or maybe several if we overflow to another undo log) which begins
with a sentinel byte that identifies it as owned by a transaction, and
then the transaction header immediately follows that without being
part of any record, and the records follow that data.  As with the
previous idea, this gets rid of the need for incremental decoding
because it gets rid of the need to find the transaction header inside
of a bigger record. As Thomas put it to me off-list, it puts the
records inside of a larger chunk of space owned by the transaction
instead of putting the transaction header inside of some particular
record; that seems more correct than the way we have it now.

3. Heikki suggested that the format should be much more generic and
that more should be left up to the AM.  While neither Andres nor I are
convinced that we should go as far in that direction as Heikki is
proposing, the idea clearly has some merit, and we should probably be
moving that way to some degree. For instance, the idea that we should
store a block number and TID is a bit sketchy when you consider that a
multi-insert operation really wants to store a TID list. The zheap
tree has a ridiculous kludge to work around that problem; clearly we
need something better.  We've also mentioned that, in the future, we
might want to support TIDs that are not 6 bytes, and that even just
looking at what's currently under development, zedstore wants to treat
TIDs as 48-bit opaque quantities, not a 4-byte block number and a
2-byte item pointer offset.  So, there is clearly a need to go through
the whole way we're doing this and rethink which parts are generic and
which parts are AM-specific.

4. A related problem, which has been mentioned or at least alluded to
by both Heikki and by me, is that we need a better way of handling the
AM-specific data. Right now, the zheap code packs fixed-size things
into the payload data and then finds them by knowing the offset where
any particular data is within that field, but that's an unmaintainable
mess.  The zheap code could be improved by at least defining those
offsets as constants someplace and adding some comments explaining the
payload formats of various undo records, but even if we do that, it's
not going to generalize very well to anything more complicated than a
few fixed-size bits of data.  I suggested using the pqformat stuff to
try to structure that -- a suggestion to which Heikki has
unfortunately not responded, because I'd really like to get his
thoughts on it -- but whether we do that particular thing or not, I
think we need to do something.  In addition to wanting a better way of
handling packing and unpacking for payload data, there's also a desire
to have it participate in record compression, for which we don't seem
to have any kind of plan.

5. Andres suggested multiple ideas for cleaning up and improving this
code in 
https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de
- which include the idea currently under discussion, several of the
same ideas that I mentioned above, and a number of other things, such
as making packing serialize to a char * rather than some ad-hoc
intermediate format and having a metadata array over which we can loop
rather than having multiple places where there's a separate bit of
code for every field type.  I don't think those suggestions are
entirely unproblematic; for instance, the metadata array would would
probably work a lot better if we moved the transaction and log-switch
headers outside of individual records as suggested in (2) above.
Otherwise, the metadata would have to include not only data-structure
offsets but some kind of a flag indicating which of several data
structures ought to contain the relevant information, which would make
the whole thing a lot messier. And depending on what we do about (4),
this might become moot or the details might change quite a bit,
because if we no longer have a long list of "generic" fields, then we
also won't have a bunch of places in the code that deal with that long
list of generic fields, which means the metadata array might not be
necessary, or might be simpler or smaller or designed differently.
All of which is to make the point that responding to Andres's feedback
will require a bunch of decisions about which parts of the feedback to
take (because some of them are mutually exclusive, as he acknowledges
himself) and what to do about them (because some of them are vague);
yet, taken together, they seem to amount to the need for significant
design changes, as do (1)-(4).

Now, just to be clear, the code we're talking about here is mostly
based on an original design by me, and whatever defects were present
in that original design are nobody's fault but mine. And that list of
defects includes pretty much everything in the above list. But, what
we need to figure out at this point is how we're going to get those
things fixed, and it seems to me that we're going to need a pretty
substantial redesign, but this discussion is kind of down in the
weeds.  I mean, what are we gaining by arguing about how many stages
we need for incremental unpacking if the real thing we need to is get
rid of that concept altogether?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

Reply via email to