On Nov 12, 2012, at 7:37 AM, Shaun Thomas wrote: > Hey everyone, > > We recently got bit by this, and I wanted to make sure it was known to the > general community. > > In new(er) Linux kernels, including late versions of the 2.6 tree, XFS has > introduced dynamic speculative preallocation. What does this do? It was added > to prevent filesystem fragmentation by preallocating a large chunk of memory > to files so extensions to those files can go on the same allocation. The > "dynamic" part just means it adjusts the size of this preallocation based on > internal heuristics. > > Unfortunately, they also changed the logic in how this extra space is > tracked. At least in previous kernels, this space would eventually be > deallocated. Now, it survives as long as there are any in-memory references > to a file, such as in a busy PG database. The filesystem itself sees this > space as "used" and will be reported as such with tools such as df or du. > > How do you check if this is affecting you? > > du -sm --apparent-size /your/pg/dir; du -sm /your/pg/dir > > If you're using XFS, and there is a large difference in these numbers, you've > been bitten by the speculative preallocation system. > > But where does it go while allocated? Why, to your OS system cache, of > course. Systems with several GB of RAM may experience extreme phantom > database "bloat", because of the dynamic aspect of the preallocation system, > So there are actually two problems: > > 1. Data files are reported as larger than their actual size and have extra > space around "just in case". Since PG has a maximum file size of 1GB, this is > basically pointless. > 2. Blocks that could be used for inode caching to improve query performance > are reserved instead for caching empty segments for XFS. > > The first can theoretically exhaust the free space on a file system. We were > seeing 45GB(!) of bloat on one of our databases caused directly by this. The > second, due to the new and improved PG planner, can result in terrible query > performance and high system load since the OS cache does not match > assumptions. > > So how is this fixed? Luckily, the dynamic allocator can be disabled by > choosing an allocation size. Add "allocsize" to your mount options. We used a > size of 1m (for 1 megabyte) to retain some of the defragmentation benefits, > while still blocking the dynamic allocator. The minimum size is 64k, so some > experimentation is probably warranted. > > This mount option *is not compatible* with the "remount" mount option, so > you'll need to completely shut everything down and unmount the filesystem to > apply. > > We spent days trying to track down the reason our systems were reporting a > load of 20-30 after a recent OS upgrade. I figured it was only fair to share > this to save others the same effort. > > Good luck!
Oh hey, I've been wondering for a while why our master dbs seem to be using so much more space than their slaves. This appears to be the reason. Thanks for the work in tracking it down! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general