My ongoing project to make VACUUM more predictable over time by proactive freezing [1] will increase the overall number of tuples frozen by VACUUM significantly (at least in larger tables). It's important that we avoid any new user-visible impact from extra freezing, though. I recently spent a lot of time on adding high-level techniques that aim to avoid extra freezing (e.g. by being lazy about freezing) when that makes sense. Low level techniques aimed at making the mechanical process of freezing cheaper might also help. (In any case it's well worth optimizing.)
I'd like to talk about one such technique on this thread. The attached WIP patch reduces the size of xl_heap_freeze_page records by applying a simple deduplication process. This can be treated as independent work (I think it can, at least). The patch doesn't change anything about the conceptual model used by VACUUM's lazy_scan_prune function to build xl_heap_freeze_page records for a page, and yet still manages to make the WAL records for freeze records over 5x smaller in many important cases. They'll be ~4x-5x smaller with *most* workloads, even. Each individual tuple entry (each xl_heap_freeze_tuple) adds a full 12 bytes to the WAL record right now -- no matter what. So the existing approach is rather space inefficient by any standard (perhaps because it was developed under time pressure while fixing bugs in Postgres 9.3). More importantly, there is a lot of natural redundancy among each xl_heap_freeze_tuple entry -- each tuple's xl_heap_freeze_tuple details tend to match. We can usually get away with storing each unique combination of values from xl_heap_freeze_tuple once per xl_heap_freeze_page record, while storing associated page offset numbers in a separate area, grouped by their canonical freeze plan (which is a normalized version of the information currently stored in xl_heap_freeze_tuple). In practice most individual tuples that undergo any kind of freezing only need to have their xmin field frozen. And when xmax is affected at all, it'll usually just get set to InvalidTransactionId. And so the actual low-level processing steps for xmax have a high chance of being shared by other tuples on the page, even in ostensibly tricky cases. While there are quite a few paths that lead to VACUUM setting a tuple's xmax to InvalidTransactionId, they all end up with the same instructional state in the final xl_heap_freeze_tuple entry. Note that there is a small chance that the patch will be less space efficient by up to 2 bytes per tuple frozen per page in cases where we're allocating new Mulits during VACUUM. I think that this should be acceptable on its own -- even in rare bad cases we'll usually still come out ahead -- what are the chances that we won't make up the difference on the same page? Or at least within the same VACUUM? And that's before we talk about a future world in which freezing will batch tuples together at the page level (you don't have to bring the other VACUUM work into this discussion, I think, but it's not *completely* unrelated either). [1] https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3j0aad6fxk93u-xq...@mail.gmail.com -- Peter Geoghegan
v1-0001-Shrink-freeze-WAL-records-via-deduplication.patch
Description: Binary data