On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <p...@bowt.ie> wrote: > * HEIKKI: Do we only generate one posting list in one WAL record? I > would assume it's better to deduplicate everything on the page, since > we're modifying it anyway.
Still thinking about this one. > * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the > remaining posting list on an updated tuple? Wouldn't it be simpler and > more space-efficient to store just the deleted TIDs? This probably makes sense. The btreevacuumposting() code that generates "updated" index tuples (tuples that VACUUM uses to replace existing ones when some but not all of the TIDs need to be removed) was derived from GIN's ginVacuumItemPointers(). That approach works well enough, but we can do better now. It shouldn't be that hard. My preferred approach is a little different to your suggested approach of storing the deleted TIDs directly. I would like to make it work by storing an array of uint16 offsets into a posting list, one array per "updated" tuple (with one offset per deleted TID within each array). These arrays (which must include an array size indicator at the start) can appear in the xl_btree_vacuum record, at the same place the patch currently puts a raw IndexTuple. They'd be equivalent to a raw IndexTuple -- the REDO routine would reconstruct the same raw posting list tuple on its own. This approach seems simpler, and is clearly very space efficient. This approach is similar to the approach used by REDO routines to handle posting list splits. Posting list splits must call _bt_swap_posting() on the primary, while the corresponding REDO routines also call _bt_swap_posting(). For space efficient "updates", we'd have to invent a sibling utility function -- we could call it _bt_delete_posting(), and put it next to _bt_swap_posting() within nbtdedup.c. How do you feel about that approach? (And how do you feel about the existing "REDO routines call _bt_swap_posting()" business that it's based on?) > * HEIKKI: Would it be more clear to have a separate struct for the > posting list split case? (i.e. don't reuse xl_btree_insert) I've concluded that this one probably isn't worthwhile. We'd have to carry a totally separate record on the stack within _bt_insertonpg(). If you feel strongly about it, I will reconsider. -- Peter Geoghegan