On 8/7/24 21:39, Melanie Plageman wrote: > On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmh...@gmail.com> wrote: >> >> As I mentioned to you off-list, I feel like this needs some sort of >> recency bias. Certainly vacuum, and really almost any conceivable user >> of this facility, is going to care more about accurate answers for new >> data than for old data. If there's no recency bias, then I think that >> eventually answers for more recent LSNs will start to become less >> accurate, since they've got to share the data structure with more and >> more time from long ago. I don't think you've done anything about this >> in this version of the patch, but I might be wrong. > > That makes sense. This version of the patch set doesn't have a recency > bias implementation. I plan to work on it but will need to do the > testing like you mentioned. >
I agree that it's likely we probably want more accurate results for recent data, so some recency bias makes sense - for example for the eager vacuuming that's definitely true. But this was initially presented as a somewhat universal LSN/timestamp mapping, and in that case it might make sense to minimize the average error - which I think is what lsntime_to_drop() currently does, by calculating the "area" etc. Maybe it'd be good to approach this from the opposite direction, say what "accuracy guarantees" we want to provide, and then design the structure / algorithm to ensure that. Otherwise we may end up with an infinite discussion about algorithms with unclear idea which one is the best choice. And I'm sure "users" of the LSN/Timestamp mapping may get confused about what to expect, without reasonably clear guarantees. For example, it seems to me a "good" accuracy guarantee would be: Given a LSN, the age of the returned timestamp is less than 10% off the actual timestamp. The timestamp precision is in seconds. This means that if LSN was written 100 seconds ago, it would be OK to get an answer in the 90-110 seconds range. For LSN from 1h ago, the acceptable range would be 3600s +/- 360s. And so on. The 10% is just arbitrary, maybe it should be lower - doesn't matter much. How could we do this? We have 1s precision, so we start with buckets for each seconds. And we want to allow merging stuff nicely. The smallest merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do 1s -> 10s -> 100s -> 1000s instead. So we start with 100x one-second buckets [A_0, A_1, ..., A_99] -> 100 x 1s buckets [B_0, B_1, ..., B_99] -> 100 x 10s buckets [C_0, C_1, ..., C_99] -> 100 x 100s buckets [D_0, D_1, ..., D_99] -> 100 x 1000s buckets We start by adding data into A_k buckets. After filling all 100 of them, we grab the oldest 10 buckets, and combine/move them into B_k. And so on, until B is gets full too. Then we grab the 10 oldest B_k entries, and move them into C. and so on. For D the oldest entries would get discarded, or we could add another layer with each bucket representing 10k seconds. A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need (or want) to keep a longer history? These arrays are larger than what the current patch does, ofc. That has 64 x 16B entries, so 1kB. These arrays have ~6kB - but I'm pretty sure it could be made more compact, by growing the buckets slower. With 10x it's just simpler to think about, and also - 6kB seems pretty decent. Note: I just realized the patch does LOG_STREAM_INTERVAL_MS = 30s, so the 1s accuracy seems like an overkill, and it could be much smaller. >> One way to make the standby more accurately mimic the primary would be >> to base entries on the timestamp-LSN data that is already present in >> the WAL, i.e. {COMMIT|ABORT} [PREPARED] records. If you only added or >> updated entries on the primary when logging those records, the standby >> could redo exactly what the primary did. A disadvantage of this >> approach is that if there are no commits for a while then your mapping >> gets out of date, but that might be something we could just tolerate. >> Another possible solution is to log the changes you make on the >> primary and have the standby replay those changes. Perhaps I'm wrong >> to advocate for such solutions, but it feels error-prone to have one >> algorithm for the primary and a different algorithm for the standby. >> You now basically have two things that can break and you have to debug >> what went wrong instead of just one. > > Your point about maintaining two different systems for creating the > time stream being error prone makes sense. Honestly logging the > contents of the LSNTimeStream seems like it will be the simplest to > maintain and understand. I was a bit apprehensive to WAL log one part > of a single stats structure (since the other stats aren't logged), but > I think explaining why that's done is easier than explaining separate > LSNTimeStream creation code for replicas. > Isn't this a sign this does not quite fit into pgstats? Even if this happens to deal with unsafe restarts, replica promotions and so on, what if the user just does pg_stat_reset? That already causes trouble because we simply forget deleted/updated/inserted tuples. If we also forget data used for freezing heuristics, that does not seem great ... Wouldn't it be better to write this into WAL as part of a checkpoint (or something like that?), and make bgwriter to not only add LSN/timestamp into the stream, but also write it into WAL. It's already waking up, on idle systems ~32B written to WAL does not matter, and on busy system it's just noise. regards -- Tomas Vondra