On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > I have already attempted that part and I feel it is not making code > any simpler than what we have today. For packing, it's fine because I > can process all the member once and directly pack it into one memory > chunk and I can insert that to the buffer by one call of > InsertUndoBytes and that will make the code simpler.
OK... > But, while unpacking if I directly unpack to the UnpackUndoRecord then > there are few complexities. I am not saying those are difficult to > implement but code may not look better. > > a) First, we need to add extra stages for unpacking as we need to do > field by field. > > b) Some of the members like uur_payload and uur_tuple are not the same > type in the UnpackUndoRecord compared to how it is stored in the page. > In UnpackUndoRecord those are StringInfoData whereas on the page we > store it as UndoRecordPayload header followed by the actual data. I > am not saying we can not unpack this directly we can do it like, > first read the payload length from the page in uur_payload.len then > read tuple length in uur_tuple.len then read both the data. And, for > that, we will have to add extra stages. I don't think that this is true; or at least, I think it's avoidable. The idea of an unpacking stage is that you refuse to advance to the next stage until you've got a certain number of bytes of data; and then you unpack everything that pertains to that stage. If you have 2 4-byte fields that you want to unpack together, you can just wait until you've 8 bytes of data and then unpack both. You don't really need 2 separate stages. (Similarly, your concern about fields being in a different order seems like it should be resolved by agreeing on one ordering and having everything use it; I don't know why there should be one order that is better in memory and another order that is better on disk.) The bigger issue here is that we don't seem to be making very much progress toward improving the overall design. Several significant improvements have been suggested: 1. Thomas suggested quite some time ago that we should make sure that the transaction header is the first optional header. If we do that, then I'm not clear on why we even need this incremental unpacking stuff any more. The only reason for all of this machinery was so that we could find the transaction header at an unknown offset inside a complex record format; there is, if I understand correctly, no other case in which we want to incrementally decode a record. But if the transaction header is at a fixed offset, then there seems to be no need to even have incremental decoding at all. Or it can be very simple, with three stages: (1) we don't yet have enough bytes to figure out how big the record is; (2) we have enough bytes to figure out how big the record is and we have figured that out but we don't yet have all of those bytes; and (3) we have the whole record, we can decode the whole thing and we're done. 2. Based on a discussion with Thomas, I suggested the GHOB stuff, which gets rid of the idea of transaction headers inside individual records altogether; instead, there is one undo blob per transaction (or maybe several if we overflow to another undo log) which begins with a sentinel byte that identifies it as owned by a transaction, and then the transaction header immediately follows that without being part of any record, and the records follow that data. As with the previous idea, this gets rid of the need for incremental decoding because it gets rid of the need to find the transaction header inside of a bigger record. As Thomas put it to me off-list, it puts the records inside of a larger chunk of space owned by the transaction instead of putting the transaction header inside of some particular record; that seems more correct than the way we have it now. 3. Heikki suggested that the format should be much more generic and that more should be left up to the AM. While neither Andres nor I are convinced that we should go as far in that direction as Heikki is proposing, the idea clearly has some merit, and we should probably be moving that way to some degree. For instance, the idea that we should store a block number and TID is a bit sketchy when you consider that a multi-insert operation really wants to store a TID list. The zheap tree has a ridiculous kludge to work around that problem; clearly we need something better. We've also mentioned that, in the future, we might want to support TIDs that are not 6 bytes, and that even just looking at what's currently under development, zedstore wants to treat TIDs as 48-bit opaque quantities, not a 4-byte block number and a 2-byte item pointer offset. So, there is clearly a need to go through the whole way we're doing this and rethink which parts are generic and which parts are AM-specific. 4. A related problem, which has been mentioned or at least alluded to by both Heikki and by me, is that we need a better way of handling the AM-specific data. Right now, the zheap code packs fixed-size things into the payload data and then finds them by knowing the offset where any particular data is within that field, but that's an unmaintainable mess. The zheap code could be improved by at least defining those offsets as constants someplace and adding some comments explaining the payload formats of various undo records, but even if we do that, it's not going to generalize very well to anything more complicated than a few fixed-size bits of data. I suggested using the pqformat stuff to try to structure that -- a suggestion to which Heikki has unfortunately not responded, because I'd really like to get his thoughts on it -- but whether we do that particular thing or not, I think we need to do something. In addition to wanting a better way of handling packing and unpacking for payload data, there's also a desire to have it participate in record compression, for which we don't seem to have any kind of plan. 5. Andres suggested multiple ideas for cleaning up and improving this code in https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de - which include the idea currently under discussion, several of the same ideas that I mentioned above, and a number of other things, such as making packing serialize to a char * rather than some ad-hoc intermediate format and having a metadata array over which we can loop rather than having multiple places where there's a separate bit of code for every field type. I don't think those suggestions are entirely unproblematic; for instance, the metadata array would would probably work a lot better if we moved the transaction and log-switch headers outside of individual records as suggested in (2) above. Otherwise, the metadata would have to include not only data-structure offsets but some kind of a flag indicating which of several data structures ought to contain the relevant information, which would make the whole thing a lot messier. And depending on what we do about (4), this might become moot or the details might change quite a bit, because if we no longer have a long list of "generic" fields, then we also won't have a bunch of places in the code that deal with that long list of generic fields, which means the metadata array might not be necessary, or might be simpler or smaller or designed differently. All of which is to make the point that responding to Andres's feedback will require a bunch of decisions about which parts of the feedback to take (because some of them are mutually exclusive, as he acknowledges himself) and what to do about them (because some of them are vague); yet, taken together, they seem to amount to the need for significant design changes, as do (1)-(4). Now, just to be clear, the code we're talking about here is mostly based on an original design by me, and whatever defects were present in that original design are nobody's fault but mine. And that list of defects includes pretty much everything in the above list. But, what we need to figure out at this point is how we're going to get those things fixed, and it seems to me that we're going to need a pretty substantial redesign, but this discussion is kind of down in the weeds. I mean, what are we gaining by arguing about how many stages we need for incremental unpacking if the real thing we need to is get rid of that concept altogether? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company