On 20.07.21 16:45, Daniel P. Berrangé wrote:
On Wed, Jul 14, 2021 at 01:23:03PM +0200, David Hildenbrand wrote:
#1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
global variables and prepare for concurrency and #3 makes os_mem_prealloc()
safe to be called from multiple threads concurrently.
Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
Linux commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
page patch [1].
Looking at that commit message, I see your caveat about POPULATE_WRITE
used together with shared file mappings, causing an undesirable glut
of dirty pages that needs to be flushed back to the underlying storage.
Is this something we need to be concerned with for the hostmem-file.c
implementation ? While it is mostly used to point to files on tmpfs
or hugetlbfs, I think users do something point it to a plain file
on a normal filesystem. So will we need to optimize to use the
fallocate+POPULATE_READ combination at some point ?
In the future, it might make sense to use fallocate() only when it comes
to shared file mappings.
AFAIKS os_mem_prealloc() currently serves the following purposes:
1) Preallocate anonymous memory or backend storage (file, hugetlbfs, ...)
2) Apply mbind() policy, preallocating it from the right node when
applicable.
3) Prefault page tables
For shared mappings, it's a little bit difficult, though: mbind() does
not seem to work on shared mappings (which to some degree makes
logically sense, but I don't think QEMU users are aware that it is like
that): "The specified policy will be ignored for any MAP_SHARED
mappings in the specified memory range. Rather the pages will be
allocated according to the memory policy of the thread that caused the
page to be allocated. Again, this may not be the thread that called
mbind()."
So 2) does not apply. A simple fallocate() can get 1) done more efficiently.
So if we want to use MADV_POPULATE_READ completely depends on whether we
want 3). It can make sense to prefault page tables for RT workloads,
however, there is usually nothing stopping the OS from clearing the page
cache and requiring a refault later -- except with mlock.
So whether we want fallocate() or fallocate()+MADV_POPULATE_READ for
shared file mappings really depends on the use case, and on the system
setup. If the system won't immediately free up the page cache and undo
what MADV_POPULATE_READ did, it might make sense to use it.
Long story short: it's complicated :)
--
Thanks,
David / dhildenb