Hi all, Index vacuuming is one of the most time-consuming processes in lazy vacuuming. lazy_tid_reaped() is a large part among them. The attached the flame graph shows a profile of a vacuum on a table that has one index and 80 million live rows and 20 million dead rows, where lazy_tid_reaped() accounts for about 47% of the total vacuum execution time.
lazy_tid_reaped() is essentially an existence check; for every index tuple, it checks if the TID of the heap it points to exists in the set of TIDs of dead tuples. The maximum size of dead tuples is limited by maintenance_work_mem, and if the upper limit is reached, the heap scan is suspended, index vacuum and heap vacuum are performed, and then heap scan is resumed again. Therefore, in terms of the performance of index vacuuming, there are two important factors: the performance of lookup TIDs from the set of dead tuples and its memory usage. The former is obvious whereas the latter affects the number of Index vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete) performs a full scan of the index, so it is important in terms of performance to avoid index vacuuming from being executed more than once during lazy vacuum. Currently, the TIDs of dead tuples are stored in an array that is collectively allocated at the start of lazy vacuum and TID lookup uses bsearch(). There are the following challenges and limitations: 1. Don't allocate more than 1GB. There was a discussion to eliminate this limitation by using MemoryContextAllocHuge() but there were concerns about point 2[1]. 2. Allocate the whole memory space at once. 3. Slow lookup performance (O(logN)). I’ve done some experiments in this area and would like to share the results and discuss ideas. Problems Solutions =============== Firstly, I've considered using existing data structures: IntegerSet(src/backend/lib/integerset.c) and TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but only either point 2 or 3. IntegerSet uses lower memory thanks to simple-8b encoding but is slow at lookup, still O(logN), since it’s a tree structure. On the other hand, TIDBitmap has a good lookup performance, O(1), but could unnecessarily use larger memory in some cases since it always allocates the space for bitmap enough to store all possible offsets. With 8kB blocks, the maximum number of line pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the bitmap is 40 bytes long and we always need 46 bytes in total per block including other meta information. So I prototyped a new data structure dedicated to storing dead tuples during lazy vacuum while borrowing the idea from Roaring Bitmap[2]. The authors provide an implementation of Roaring Bitmap[3] (Apache 2.0 license). But I've implemented this idea from scratch because we need to integrate it with Dynamic Shared Memory/Area to support parallel vacuum and need to support ItemPointerData, 6-bytes integer in total, whereas the implementation supports only 4-bytes integers. Also, when it comes to vacuum, we neither need to compute the intersection, the union, nor the difference between sets, but need only an existence check. The data structure is somewhat similar to TIDBitmap. It consists of the hash table and the container area; the hash table has entries per block and each block entry allocates its memory space, called a container, in the container area to store its offset numbers. The container area is actually an array of bytes and can be enlarged as needed. In the container area, the data representation of offset numbers varies depending on their cardinality. It has three container types: array, bitmap, and run. For example, if there are two dead tuples at offset 1 and 150, it uses the array container that has an array of two 2-byte integers representing 1 and 150, using 4 bytes in total. If we used the bitmap container in this case, we would need 20 bytes instead. On the other hand, if there are consecutive 20 dead tuples from offset 1 to 20, it uses the run container that has an array of 2-byte integers. The first value in each pair represents a starting offset number, whereas the second value represents its length. Therefore, in this case, the run container uses only 4 bytes in total. Finally, if there are dead tuples at every other offset from 1 to 100, it uses the bitmap container that has an uncompressed bitmap, using 13 bytes. We need another 16 bytes per block entry for hash table entry. The lookup complexity of a bitmap container is O(1) whereas the one of an array and a run container is O(N) or O(logN) but the number of elements in those two containers should not be large it would not be a problem. Evaluation ======== Before implementing this idea and integrating it with lazy vacuum code, I've implemented a benchmark tool dedicated to evaluating lazy_tid_reaped() performance[4]. It has some functions: generating TIDs for both index tuples and dead tuples, loading dead tuples to the data structure, simulating lazy_tid_reaped() using those virtual heap tuples and heap dead tuples. So the code lacks many features such as iteration and DSM/DSA support but it makes testing of data structure easier. FYI I've confirmed the validity of this tool. When I ran a vacuum on the table with 3GB size, index vacuuming took 12.3 sec and lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar situation with the tool, the lookup benchmark with the array data structure took approximately 8.0 sec. Given that the tool doesn't simulate the cost of function calls, it seems to reasonably simulate it. I've evaluated the lookup performance and memory foot point against the four types of data structure: array, integerset (intset), tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the distribution of dead tuples in blocks. Since tbm doesn't have a function for existence check I've added it and allocate enough memory to make sure that tbm never be lossy during the evaluation. In all test cases, I simulated that the table has 1,000,000 blocks and every block has at least one dead tuple. The benchmark scenario is that for each virtual heap tuple we check if there is its TID in the dead tuple storage. Here are the results of execution time in milliseconds and memory usage in bytes: * Test-case 1 (10 dead tuples in 20 offsets interval) An array container is selected in this test case, using 20 bytes for each block. Execution Time Memory Usage array 14,140.91 60,008,248 intset 9,350.08 50,339,840 tbm 1,299.62 100,671,544 rtbm 1,892.52 58,744,944 * Test-case 2 (10 consecutive dead tuples from offset 1) A bitmap container is selected in this test case, using 2 bytes for each block. Execution Time Memory Usage array 1,056.60 60,008,248 intset 650.85 50,339,840 tbm 194.61 100,671,544 rtbm 154.57 27,287,664 * Test-case 3 (2 dead tuples at 1 and 100 offsets) An array container is selected in this test case, using 4 bytes for each block. Since 'array' data structure (not array container of rtbm) uses only 12 bytes for each block, given that the size of hash table entry size in 'rtbm', 'array' data structure uses less memory. Execution Time Memory Usage array 6,054.22 12,008,248 intset 4,203.41 16,785,408 tbm 759.17 100,671,544 rtbm 750.08 29,384,816 * Test-case 4 (100 consecutive dead tuples from 1) A run container is selected in this test case, using 4 bytes for each block. Execution Time Memory Usage array 8,883.03 600,008,248 intset 7,358.23 100,671,488 tbm 758.81 100,671,544 rtbm 764.33 29,384,816 Overall, 'rtbm' has a much better lookup performance and good memory usage especially if there are relatively many dead tuples. However, in some cases, 'intset' and 'array' have a better memory usage. Feedback is very welcome. Thank you for reading the email through to the end. Regards, [1] https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com [2] http://roaringbitmap.org/ [3] https://github.com/RoaringBitmap/CRoaring [4] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench -- Masahiko Sawada EDB: https://www.enterprisedb.com/