On Nov 12, 2012, at 7:37 AM, Shaun Thomas wrote:

> Hey everyone,
> 
> We recently got bit by this, and I wanted to make sure it was known to the 
> general community.
> 
> In new(er) Linux kernels, including late versions of the 2.6 tree, XFS has 
> introduced dynamic speculative preallocation. What does this do? It was added 
> to prevent filesystem fragmentation by preallocating a large chunk of memory 
> to files so extensions to those files can go on the same allocation. The 
> "dynamic" part just means it adjusts the size of this preallocation based on 
> internal heuristics.
> 
> Unfortunately, they also changed the logic in how this extra space is 
> tracked. At least in previous kernels, this space would eventually be 
> deallocated. Now, it survives as long as there are any in-memory references 
> to a file, such as in a busy PG database. The filesystem itself sees this 
> space as "used" and will be reported as such with tools such as df or du.
> 
> How do you check if this is affecting you?
> 
> du -sm --apparent-size /your/pg/dir; du -sm /your/pg/dir
> 
> If you're using XFS, and there is a large difference in these numbers, you've 
> been bitten by the speculative preallocation system.
> 
> But where does it go while allocated? Why, to your OS system cache, of 
> course. Systems with several GB of RAM may experience extreme phantom 
> database "bloat", because of the dynamic aspect of the preallocation system, 
> So there are actually two problems:
> 
> 1. Data files are reported as larger than their actual size and have extra 
> space around "just in case". Since PG has a maximum file size of 1GB, this is 
> basically pointless.
> 2. Blocks that could be used for inode caching to improve query performance 
> are reserved instead for caching empty segments for XFS.
> 
> The first can theoretically exhaust the free space on a file system. We were 
> seeing 45GB(!) of bloat on one of our databases caused directly by this. The 
> second, due to the new and improved PG planner, can result in terrible query 
> performance and high system load since the OS cache does not match 
> assumptions.
> 
> So how is this fixed? Luckily, the dynamic allocator can be disabled by 
> choosing an allocation size. Add "allocsize" to your mount options. We used a 
> size of 1m (for 1 megabyte) to retain some of the defragmentation benefits, 
> while still blocking the dynamic allocator. The minimum size is 64k, so some 
> experimentation is probably warranted.
> 
> This mount option *is not compatible* with the "remount" mount option, so 
> you'll need to completely shut everything down and unmount the filesystem to 
> apply.
> 
> We spent days trying to track down the reason our systems were reporting a 
> load of 20-30 after a recent OS upgrade. I figured it was only fair to share 
> this to save others the same effort.
> 
> Good luck!


Oh hey, I've been wondering for a while why our master dbs seem to be using so 
much more space than their slaves. This appears to be the reason. Thanks for 
the work in tracking it down!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to