Hi, On Wed, Mar 22, 2023 at 12:15 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > Hi all, > > I started this new thread from another thread[1] where we're > discussing a new storage for TIDs, TidStore, since we found a > difficulty about the memory usage limit for TidStores on DSA. > > TidStore is a new data structure to efficiently store TIDs, backed by > a radix tree. In the patch series proposed on that thread, in addition > to radix tree and TidStore, there is another patch for lazy (parallel) > vacuum to replace the array of dead tuple TIDs with a TidStore. To > support parallel vacuum, radix tree (and TidStore) can be created on a > local memory as well as on DSA. Also, it has memory usage limit > functionality; we can specify the memory limit (e.g., > maintenance_work_mem) to TidStoreCreate() function. Once the total DSA > segment size (area->control->total_segment_size) exceeds the limit, > TidStoreIsFull() returns true. The lazy vacuum can continue scanning > heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns > true. Currently lazy vacuum is the sole user of TidStore but maybe it > can be used by other codes such as tidbitmap.c where will be limited > by work_mem. > > During the development, we found out that DSA memory growth is > unpredictable, leading to inefficient memory limitation. > > DSA is built on top of DSM segments and it manages a set of DSM > segments, adding new segments as required and detaching them when they > are no longer needed. The DSA segment starts with 1MB in size and a > new segment size is at least big enough to follow a geometric series > that approximately doubles the total storage each time we create a new > segment. Because of this fact, it's not efficient to simply compare > the memory limit to the total segment size. For example, if > maintenance_work_mem is 512MB, the total segment size will be like: > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the > limit, continue heap scan. > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed > 254MB). > > One might think we can use dsa_set_size_limit() but it cannot; lazy > vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end > up stopping the insertion halfway. > > Besides excessively allocating memory, since the initial DSM segment > size is fixed 1MB, memory usage of a shared TidStore will start from > 1MB+. This is higher than the minimum values of both work_mem and > maintenance_work_mem, 64kB and 1MB respectively. Increasing the > minimum m_w_m to 2MB might be acceptable but not for work_mem. > > Researching possible solutions, we found that aset.c also has a > similar characteristic; allocates an 8K block (by default) upon the > first allocation in a context, and doubles that size for each > successive block request. But we can specify the initial block size > and max blocksize. This made me think of an idea to specify both to > DSA and both values are calculated based on m_w_m. I've attached the > patch for this idea. The changes to dsa.c are straightforward since > dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and > DSA_MAX_SEGMENT_SIZE. I just made these values configurable. > > FYI with this patch, we can create a DSA in parallel_vacuum_init() > with initial and maximum block sizes as follows: > > initial block size = min(m_w_m / 4, 1MB) > max block size = max(m_w_m / 8, 8MB) > > In most cases, we can start with a 1MB initial segment, the same as > before. For larger memory, the heap scan stops after DSA allocates > 1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the > both initial and maximum segment sizes are 1MB and 64MB respectively, > and then DSA allocates the segments as follows until heap scanning > stops: > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the > limit, continue heap scan. > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop > (allocated additional 62MB). > > It also works with smaller memory; If the limit is 1MB, we start with > a 256KB initial segment and heap scanning stops after DSA allocated > 1.5MB (= 256kB + 256kB + 512kB + 512kB). > > There is room for considering better formulas for initial and maximum > block sizes but making both values configurable is a promising idea. > And the analogous behavior to aset could be a good thing for > readability and maintainability. There is another test result where I > used this idea on top of a radix tree[2]. > > We need to consider the total number of allocated DSA segments as the > total number of DSM segments available on the system is fixed[3]. But > it seems not problematic even with this patch since we allocate only a > few additional segments (in above examples 17 segs vs. 19 segs). There > was no big difference also in performance[2]. >
The last time I posted this email seemed not good timing since it was close to the feature freeze, and the email was very long. The tidstore and radix tree developments are still in-progress[1] and this change is still necessary. I'd like to summarize the problem and proposal: * Both the initial DSA segment size and the maximum DSA segment size are fixed values: 1MB and 1TB respectively. * The total allocated DSA segments follows a geometric series. * The patch makes both the initial and maximum DSA segment sizes configurable. * Which helps: * minimize wasting memory when the total DSA segment size reaches the limit set by caller. * create a data structure with a small memory, for example 64kB, the minimum value of work_mem. According to the recent discussion, it might be sufficient to make only the maximum DSA segment size configurable. I'll register this item for the next commit fest. Regards, [1] https://www.postgresql.org/message-id/CAD21AoDjTbp2SHn4hRzAxWNeYArn4Yd4UdH9XRoNzdrYWNgExw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com