On Tue, Apr 11, 2017 at 4:17 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfre...@gmail.com> > wrote: >> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmh...@gmail.com> wrote: >>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of >>> maintenance_work_mem >>> >>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite >>> enough so we'll allocate another 8GB, for a total of 16256MB, but more >>> than three-quarters of that last allocation ends up being wasted. >>> I've been told on this list before that doubling is the one true way >>> of increasing the size of an allocated chunk of memory, but I'm still >>> a bit unconvinced. >> >> There you're wrong. The allocation is capped to 1GB, so wastage has an >> upper bound of 1GB. > > Ah, OK. Sorry, didn't really look at the code. I stand corrected, > but then it seems a bit strange to me that the largest and smallest > allocations are only 8x different. I still don't really understand > what that buys us.
Basically, attacking the problem (that, I think, you mentioned) of very small systems in which overallocation for small vacuums was an issue. The "slow start" behavior of starting with smaller segments tries to improve the situation for small vacuums, not big ones. By starting at 128M and growing up to 1GB, overallocation is bound to the range 128M-1GB and is proportional to the amount of dead tuples, not table size, as it was before. Starting at 128M helps the initial segment search, but I could readily go for starting at 64M, I don't think it would make a huge difference. Removing exponential growth, however, would. As the patch stands, small systems (say 32-bit systems) without overcommit and with slowly-changing data can now set high m_w_m without running into overallocation issues with autovacuum reserving too much virtual space, as it will reserve memory only proportional to the amount of dead tuples. Previously, it would reserve all of m_w_m regardless of whether it was needed or not, with the only exception being really small tables, so m_w_m=1GB was unworkable in those cases. Now it should be fine. > What would we lose if we just made 'em all 128MB? TBH, not that much. We'd need 8x compares to find the segment, that forces a switch to binary search of the segments, which is less cache-friendly. So it's more complex code, less cache locality. I'm just not sure what's the benefit given current limits. The only aim of this multiarray approach was making *virtual address space reservations* proportional to the amount of actual memory needed, as opposed to configured limits. It doesn't need to be a tight fit, because calling palloc on its own doesn't actually use that memory, at least on big allocations like these - the OS will not map the memory pages until they're first touched. That's true in most modern systems, and many ancient ones too. In essence, the patch as it is proposed, doesn't *need* a binary search, because the segment list can only grow up to 15 segments at its biggest, and that's a size small enough that linear search will outperform (or at least perform as well as) binary search. Reducing the initial segment size wouldn't change that. If the 12GB limit is lifted, or the maximum segment size reduced (from 1GB to 128MB for example), however, that would change. I'd be more in favor of lifting the 12GB limit than of reducing the maximum segment size, for the reasons above. Raising the 12GB limit has concrete and readily apparent benefits, whereas using bigger (or smaller) segments is far more debatable. Yes, that will need a binary search. But, I was hoping that could be a second (or third) patch, to keep things simple, and benefits measurable. Also, the plan as discussed in this very long thread, was to eventually try to turn segments into bitmaps if dead tuple density was big enough. That benefits considerably from big segments, since lookup on a bitmap is O(1) - the bigger the segments, the faster the lookup, as the search on the segment list would be dominant. So... what shall we do? At this point, I've given all my arguments for the current design. If the more senior developers don't agree, I'll be happy to try your way. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers