On Thu, Apr 23, 2015 at 01:50:28PM +0200, Alberto Garcia wrote: > On Thu 23 Apr 2015 12:15:04 PM CEST, Stefan Hajnoczi wrote: > > >> For a cache size of 128MB, the PSS is actually ~10MB larger without > >> the patch, which seems to come from posix_memalign(). > > > > Do you mean RSS or are you using a tool that reports a "PSS" number > > that I don't know about? > > > > We should understand what is going on instead of moving the code > > around to hide/delay the problem. > > Both RSS and PSS ("proportional set size", also reported by the kernel). > > I'm not an expert in memory allocators, but I measured the overhead like > this: > > An L2 cache of 128MB implies a refcount cache of 32MB, in total 160MB. > With a default cluster size of 64k, that's 2560 cache entries. > > So I wrote a test case that allocates 2560 blocks of 64k each using > posix_memalign and mmap, and here's how their /proc/<pid>/smaps compare: > > -Size: 165184 kB > -Rss: 10244 kB > -Pss: 10244 kB > +Size: 161856 kB > +Rss: 0 kB > +Pss: 0 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean: 0 kB > -Private_Dirty: 10244 kB > -Referenced: 10244 kB > -Anonymous: 10244 kB > +Private_Dirty: 0 kB > +Referenced: 0 kB > +Anonymous: 0 kB > AnonHugePages: 0 kB > Swap: 0 kB > KernelPageSize: 4 kB > > Those are the 10MB I saw. For the record I also tried with malloc() and > the results are similar to those of posix_memalign().
The posix_memalign() call wastes memory. I compared: posix_memalign(&memptr, 65536, 2560 * 65536); memset(memptr, 0, 2560 * 65536); with: for (i = 0; i < 2560; i++) { posix_memalign(&memptr, 65536, 65536); memset(memptr, 0, 65536); } Here are the results: -Size: 163920 kB -Rss: 163860 kB -Pss: 163860 kB +Size: 337800 kB +Rss: 183620 kB +Pss: 183620 kB Note the memset simulates a fully occupied cache. The 19 MB RSS difference between the two seems wasteful. The large "Size" difference hints that the mmap pattern is very different when posix_memalign() is called multiple times. We could avoid the 19 MB overhead by switching to a single allocation. What's more is that dropping the memset() to simulate no cache entry usage (like your example) gives us a grand total of 20 kB RSS. There is no point in delaying allocations if we do a single big upfront allocation. I'd prefer a patch that replaces the small allocations with a single big one. That's a win in both empty and full cache cases. Stefan
pgpB5atrlOjCq.pgp
Description: PGP signature