On Fri, Aug 17, 2018 at 9:55 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > But then you are injecting bad pages into the shared buffer arena. > In any case, depending on that behavior seems like a bad idea, because > it's a pretty questionable kluge in itself. > > Another point is that the truncation code attempts to remove all > to-be-truncated-away pages from the shared buffer arena, but that only > works if nobody else is loading such pages into shared buffers > concurrently. In the presence of concurrent scans, we might be left > with valid-looking buffers for pages that have been truncated away > on-disk. That could cause all sorts of fun later. Yeah, the buffers > should contain only dead tuples ... but, for example, they might not > be hinted dead. If somebody sets one of those hint bits and then > writes the buffer back out to disk, you've got real problems.
I didn't yet get how that's possible... count_nondeletable_pages() ensures that every to-be-truncated-away page doesn't contain any used item. From reading [1] I got that we might have even live tuples if truncation was failed. But if truncation wasn't failed, we shouldn't get this hint problem. Please, correct me if I'm wrong. After reading [1] and [2] I got that there are at least 3 different issues with heap truncation: 1) Data corruption on file truncation error (explained in [1]). 2) Expensive scanning of the whole shared buffers before file truncation. 3) Cancel of read-only queries on standby even if hot_standby_feedback is on, caused by replication of AccessExclusiveLock. It seems that fixing any of those issues requires redesign of heap truncation. So, ideally redesign of heap truncation should fix all the issues of above. Or at least it should be understood how the rest of issues can be fixed later using the new design. I would like to share some my sketchy thoughts about new heap truncation design. Let's imagine we introduced dirty_barrier buffer flag, which prevents dirty buffer from being written (and correspondingly evicted). Then truncation algorithm could look like this: 1) Acquire ExclusiveLock on relation. 2) Calculate truncation point using count_nondeletable_pages(), while simultaneously placing dirty_barrier flag on dirty buffers and saving their numbers to array. Assuming no writes are performing concurrently, no to-be-truncated-away pages should be written from this point. 3) Truncate data files. 4) Iterate past truncation point buffers and clean dirty and dirty_barrier flags from them (using numbers we saved to array on step #2). 5) Release relation lock. *) On exception happen after step #2, iterate past truncation point buffers and clean dirty_barrier flags from them (using numbers we saved to array on step #2) After heap truncation using this algorithm, shared buffers may contain past-OEF buffers. But those buffers are empty (no used items) and clean. So, real-only queries shouldn't hint those buffers dirty because there are no used items. Normally, these buffers will be just evicted away from the shared buffer arena. If relation extension will happen short after heap truncation then some of those buffers could be found after relation extension. I think this situation could be handled. For instance, we can teach vacuum to claim page as new once all the tuples were gone. We're taking only exclusive lock here. And assuming we will teach our scans to treat page-past-OEF situation as no-visible-tuples-found, concurrent read-only queries will work concurrently with heap truncate. Also we don't have to scan whole shared buffers, only past truncation point buffers are scanned at step #2. Later flags are cleaned only from truncation point dirty buffers. Data corruption on truncation error also shouldn't happen as well, because we don't forget to write any dirty buffers before insure that data files were successfully truncated. The problem I see with this approach so far is placing too many dirty_barrier flags can affect concurrent activity. In order to cope that we may, for instance, truncate relation in multiple iterations when we find too many past truncation point dirty buffers. Any thoughts? Links. 1. https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC%40tunaPC 2. https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company