Hey Tony, On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote: > You're right that text segments are fairly small and shared; memory usage > was dominated by storage for blocklists read from file. This makes the > problem more general than just tiny systems, since people tend to size > their blocklists proportional to system memory size.
I wounldn't say this. Users try to squeeze too-large files also when they do not have enough memory for them... On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote: > You're also right that actual memory footprint increases only minimally > with each fork() thanks to copy-on-write; I'm certain these OOM systems > aren't really exhausting memory. But I do think there's confusion around > memory usage optimizations like COW vs. memory accounting used for OOM. OOM is just severely broken IMO. As a concept. Linux should likely not allow overcommitment at all, there is just no way at all for software to account for memory not being available it successfully allocated some time ago. On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote: > I recall looking at dnsmasq process statistics on OOM invocation, and > noticed their VM set sizes were usually close to total system memory, > i.e. > COW wasn't relevant. And from a dnsmasq proc memory map, the large > segment > storing the blocklist was marked read-write. I suspect that despite COW, > since that memory is *potentially* writable it's being accounted for at > fork() time. The fork technically needs to allocate as much memory as the program is currently using but /proc/[pid]/maps won't tell you if the memory is copy- on-write or not. It is for sure read-write as, otherwise, when the fork would write to it, it would be sent SIGSEGV. Instead, when trying to write to a copy-on-write page, you will trigger a page-fault, the page will be duplicated and you can continue happily as if nothing would have happened. Also the "p" (private) doesn't help much here because it is just distinguishing from "s" (shared) at this point. It *should* be possible to extract the relevant information from /proc/[pid]/pagemap and then check the details of the page(s) in /proc/kpageflags for KPF_SWAPBACKED (page is backed by swap/RAM). This is the only way I'm aware of to check if this is a copy-on-write page existing in multiple places. If you know a simpler way to do this, I'd be happy to learn. On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote: > A possible fix I'd suggest is to update dnsmasq's memory handling. IIRC, > we use the same cache structure and memory allocation for both DNS cache > and storing static server lists read from file. Perhaps use a separate, > page-aligned memory pool to store these lists, then after initialization > (and before forking) use mprotect() to set the region as read-only. > > Assuming it works, this would have the advantage of being a no-knobs > solution vs. setting kludgey process or connection limits. I like the idea of splitting the cache in two parts. Say a static and a dynamic cache. Using mprotect() shouldn't even be necessary but helps to ensure we're not writing to the static part of the cache anywhere in the code. KSM (kernel samepage merging) comes to my mind as well, but this seems to be the wrong tool for the job. Figured I should mention it nonetheless. On Wed, 2021-03-31 at 19:43 -0700, Tony Ambardar wrote: > One other thing I saw while testing with large blocklists was a > noticeable > latency increase, likely related to lookup times. I recall some > discussion > on the ML where you mentioned work on a hash/tree solution was in > progress. Were those changes completed? Yes, dnsmasq uses hash buckets to minimize the amount of memory it has to loop over when trying to find a name. Best, Dominik _______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss