Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
It's really nice if these kinds of things are broken up. First, replace the old want_memblock parameter, then add the parameter to the __add_page() calls. > +/* > + * NONE: No memory block is to be created (e.g. device memory). > + * NORMAL: Memory block that represents normal (boot or hotplugged) memory > + * (e.g. ACPI DIMMs) that should be onlined either automatically > + * (memhp_auto_online) or manually by user space to select a > + * specific zone. > + * Applicable to memhp_auto_online. > + * STANDBY: Memory block that represents standby memory that should only > + * be onlined on demand by user space (e.g. standby memory on > + * s390x), but never automatically by the kernel. > + * Not applicable to memhp_auto_online. > + * PARAVIRT: Memory block that represents memory added by > + * paravirtualized mechanisms (e.g. hyper-v, xen) that will > + * always automatically get onlined. Memory will be unplugged > + * using ballooning, not by relying on the MOVABLE ZONE. > + * Not applicable to memhp_auto_online. > + */ > +enum { > + MEMORY_BLOCK_NONE, > + MEMORY_BLOCK_NORMAL, > + MEMORY_BLOCK_STANDBY, > + MEMORY_BLOCK_PARAVIRT, > +}; This does not seem like the best way to expose these. STANDBY, for instance, seems to be essentially a replacement for a check against running on s390 in userspace to implement a _typical_ s390 policy. It seems rather weird to try to make the userspace policy determination easier by telling userspace about the typical s390 policy via the kernel. As for the OOM issues, that sounds like something we need to fix by refusing to do (or delaying) hot-add operations once we consume too much ZONE_NORMAL from memmap[]s rather than trying to indirectly tell userspace to hurry thing along. So, to my eye, we need: +enum { + MEMORY_BLOCK_NONE, + MEMORY_BLOCK_STANDBY, /* the default */ + MEMORY_BLOCK_AUTO_ONLINE, +}; and we can probably collapse NONE into AUTO_ONLINE because userspace ends up doing the same thing for both: nothing. > struct memory_block { > unsigned long start_section_nr; > unsigned long end_section_nr; > @@ -34,6 +58,7 @@ struct memory_block { > int (*phys_callback)(struct memory_block *); > struct device dev; > int nid;/* NID for this memory block */ > + int type; /* type of this memory block */ > }; Shouldn't we just be creating and using an actual named enum type? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
> How should a policy in user space look like when new memory gets added > - on s390x? Not onlining paravirtualized memory is very wrong. Because we're going to balloon it away in a moment anyway? We have auto-onlining. Why isn't that being used on s390? > So the type of memory is very important here to have in user space. > Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()" > to decide whether to online memory and how to online memory is wrong. > Only some specific memory types (which I call "normal") are to be > handled by user space. > > For the other ones, we exactly know what to do: > - standby? don't online I think you're horribly conflating the software desire for what the stae should be and the hardware itself. >> As for the OOM issues, that sounds like something we need to fix by >> refusing to do (or delaying) hot-add operations once we consume too much >> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell >> userspace to hurry thing along. > > That is a moving target and doing that automatically is basically > impossible. Nah. We know how much metadata we've allocated. We know how much ZONE_NORMAL we are eating. We can *easily* add something to add_memory() that just sleeps until the ratio is not out-of-whack. > You can add a lot of memory to the movable zone and > everything is fine. Suddenly a lot of processes are started - boom. > MOVABLE should only every be used if you expect an unplug. And for > paravirtualized devices, a "typical" unplug does not exist. No, it's more complicated than that. People use MOVABLE, for instance, to allow more consistent huge page allocations. It's certainly not just hot-remove. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote: > It is more than just memmaps (e.g. forking udev process doing memory > onlining also needs memory) but yes, the main idea is to make the > onlining synchronous with hotplug. That's a good theoretical concern. But, is it a problem we need to solve in practice? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()
On 10/12/20 9:19 AM, Eric Biggers wrote: > On Sun, Oct 11, 2020 at 11:56:35PM -0700, Ira Weiny wrote: >>> And I still don't really understand. After this patchset, there is still >>> code >>> nearly identical to the above (doing a temporary mapping just for a memcpy) >>> that >>> would still be using kmap_atomic(). >> I don't understand. You mean there would be other call sites calling: >> >> kmap_atomic() >> memcpy() >> kunmap_atomic() > Yes, there are tons of places that do this. Try 'git grep -A6 kmap_atomic' > and look for memcpy(). > > Hence why I'm asking what will be the "recommended" way to do this... > kunmap_thread() or kmap_atomic()? kmap_atomic() is always preferred over kmap()/kmap_thread(). kmap_atomic() is _much_ more lightweight since its TLB invalidation is always CPU-local and never broadcast. So, basically, unless you *must* sleep while the mapping is in place, kmap_atomic() is preferred. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH V2] staging: android: lowmemorykiller: imporve lmk to avoid deadlock issue
On 08/02/2015 10:53 PM, Wang, Biao wrote: > Consider the following case: > Task A trigger lmk with a lock held, while task B try to > get this lock, but unfortunately B is the very culprit task lmk select to > kill. Then B will never be killed, and A will forever select B to kill. > Such dead lock will trigger softlock up issue. It would be interesting to have some actual data about where this helps. For instance, which locks does this happen on? What kind of allocation? Also, we apparently _do_ mark a lowmemorykiller victim as an oom victim and let them use memory reserves. Why does that not allow the allocation to complete at least long enough to get the kill signal to the victim? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC 0/6] Module for tracking/accounting shared memory buffers
On 10/11/2016 04:50 PM, Ruchi Kandoi wrote: > Any process holding a reference to these buffers will keep the kernel from > reclaiming its backing pages. mm counters don't provide a complete picture of > these allocations, since they only account for pages that are mapped into a > process's address space. This problem is especially bad for systems like > Android that use dma-buf fds to share graphics and multimedia buffers between > processes: these allocations are often large, have complex sharing patterns, > and are rarely mapped into every process that holds a reference to them. What do you end up _doing_ with all this new information that you have here? You know which processes have "pinned" these shared buffers, and exported that information in /proc. But, then what? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH] staging: lustre: fix GFP_ATOMIC macro usage
On 01/21/2014 12:02 PM, Dilger, Andreas wrote: > The Lustre allocation macros track the memory usage across the whole > filesystem, > not just of a single structure that a mempool/slab/whatever would do. > This is > useful to know for debugging purposes (e.g. user complains about not having > enough RAM for their highly-tuned application, or to check for leaks at > unmount). Urg, it does this with a global variable. If we did this kind of thing generically, we'd get eaten alive by all of the cacheline bouncing from that atomic. It's also a 32-bit atomic. Guess it's never had more than 4GB of memory tracked in there. :) This also doesn't track overhead from things that *are* in slabs like the inodes, or the 14 kmem_caches that lustre has, so it's far from a complete picture of how lustre is using memory. > It can also log the alloc/free calls and post-process them to find leaks > easily, or find pieces code that is allocating too much memory that are not > using dedicated slabs. This also works if you encounter a system with a > lot of allocated memory, enable "free" logging, and then unmount the > filesystem. > The logs will show which structures are being freed (assuming they are not > leaked completely) and point you to whatever is not being shrunk properly. This isn't perfect, but it does cover most of the ext4 call sites in my kernel. It would work better for a module, I'd imagine: cd /sys/kernel/debug/tracing/events/kmem echo -n 'call_site < 0x81e0af00 && call_site >= 0x81229cf0' > kmalloc/filter echo 1 > kmalloc/enable cat /sys/kernel/debug/tracing/trace_pipe It will essentially log all the kmalloc() calls from the ext4 code. I got the call site locations from grepping System.map. It would be _really_ nice if we were able to do something like: echo 'call_site =~ *ext4*' but there's no way to do that which I know of. You could probably rig something up with the function graph tracer by triggering only on entry to the lustre code. > I don't know if there is any way to track this with regular kmalloc(), and > creating separate slabs for so ever data structure would be ugly. The > generic > /proc/meminfo data doesn't really tell you what is using all the memory, > and > the size- slabs give some information, but are used all over the > kernel. > > I'm pretty much resigned to losing all of this functionality, but it > definitely > has been very useful for finding problems. Yeah, it is hard to find out who is responsible for leaking pages or kmalloc()s, especially after the fact. But, we seem to limp along just fine. If lustre is that bad of a memory leaker that it *NEEDS* this feature, we have bigger problems on our hands. :) ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[staging] unused rtl8192u/ieee80211/digest.c ?
I was doing some code audits looking at scattergather uses, and noticed that update() in drivers/staging/rtl8192u/ieee80211/digest.c uses sg.page which doesn't exist any longer, so this can't possibly compile. Turns out that digest.c is actually unused. It doesn't get referenced in a Makefile or get compiled and doesn't get used as far as I can see. This driver looks pretty dead. The original author hasn't touched it in the 4 years since it got in to staging/. Shouldn't we be kicking stuff like this out of staging? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks
On 07/23/2013 07:52 AM, KY Srinivasan wrote: > The current scheme of involving user > level code to close this loop obviously does not perform well under high > memory pressure. Adding memory usually requires allocating some large, contiguous areas of memory for use as mem_map[] and other VM structures. That's really hard to do under heavy memory pressure. How are you accomplishing this? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks
On 07/23/2013 08:54 AM, KY Srinivasan wrote: >> > Adding memory usually requires allocating some large, contiguous areas >> > of memory for use as mem_map[] and other VM structures. That's really >> > hard to do under heavy memory pressure. How are you accomplishing this? > I cannot avoid failures because of lack of memory. In this case I notify the > host of > the failure and also tag the failure as transient. Host retries the operation > after some > delay. There is no guarantee it will succeed though. You didn't really answer the question. You have allocated some large, physically contiguous areas of memory under heavy pressure. But you also contend that there is too much memory pressure to run a small userspace helper. Under heavy memory pressure, I'd expect large, kernel allocations to fail much more often than running a small userspace helper. It _sounds_ like you really want to be able to have the host retry the operation if it fails, and you return success/failure from inside the kernel. It's hard for you to tell if running the userspace helper failed, so your solution is to move what what previously done in userspace in to the kernel so that you can more easily tell if it failed or succeeded. Is that right? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks
On 07/23/2013 10:21 AM, KY Srinivasan wrote: >> You have allocated some large, physically contiguous areas of memory >> under heavy pressure. But you also contend that there is too much >> memory pressure to run a small userspace helper. Under heavy memory >> pressure, I'd expect large, kernel allocations to fail much more often >> than running a small userspace helper. > > I am only reporting what I am seeing. Broadly, I have two main failure > conditions to > deal with: (a) resource related failure (add_memory() returning -ENOMEM) and > (b) not being > able to online a segment that has been successfully hot-added. I have seen > both these failures > under high memory pressure. By supporting "in context" onlining, we can > eliminate one failure > case. Our inability to online is not a recoverable failure from the host's > point of view - the memory > is committed to the guest (since hot add succeeded) but is not usable since > it is not onlined. Could you please precisely report on what you are seeing in detail? Where are the -ENOMEMs coming from? Which allocation site? Are you seeing OOMs or page allocation failure messages on the console? The operation was split up in to two parts for good reason. It's actually for your _precise_ use case. A system under memory pressure is going to have troubles doing a hot-add. You need memory to add memory. Of the two operations ("add" and "online"), "add" is the one vastly more likely to fail. It has to allocate several large swaths of contiguous physical memory. For that reason, the system was designed so that you could "add" and "online" separately. The intention was that you could "add" far in advance and then "online" under memory pressure, with the "online" having *VASTLY* smaller memory requirements and being much more likely to succeed. You're lumping the "allocate several large swaths of contiguous physical memory" failures in to the same class as "run a small userspace helper". They are _really_ different problems. Both prone to allocation failures for sure, but _very_ separate problems. Please don't conflate them. >> It _sounds_ like you really want to be able to have the host retry the >> operation if it fails, and you return success/failure from inside the >> kernel. It's hard for you to tell if running the userspace helper >> failed, so your solution is to move what what previously done in >> userspace in to the kernel so that you can more easily tell if it failed >> or succeeded. >> >> Is that right? > > No; I am able to get the proper error code for recoverable failures (hot add > failures > because of lack of memory). By doing what I am proposing here, we can avoid > one class > of failures completely and I think this is what resulted in a better "hot > add" experience in the > guest. I think you're taking a huge leap here: "We could not online memory, thus we must take userspace out of the loop." You might be right. There might be only one way out of this situation. But you need to provide a little more supporting evidence before we all arrive at the same conclusion. BTW, it doesn't _require_ udev. There could easily be another listener for hotplug events. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks
On 07/24/2013 12:45 PM, KY Srinivasan wrote: > All I am saying is that I see two classes of failures: (a) Our > inability to allocate memory to manage the memory that is being hot added > and (b) Our inability to bring the hot added memory online within a reasonable > amount of time. I am not sure the cause for (b) and I was just speculating > that > this could be memory related. What is interesting is that I have seen failure > related > to our inability to online the memory after having succeeded in hot adding the > memory. I think we should hold off on this patch and other like it until we've been sufficiently able to explain how (b) happens. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 2/2] Drivers: hv: balloon: Online the hot-added memory "in context"
On 07/24/2013 02:29 PM, K. Y. Srinivasan wrote: > /* > - * Wait for the memory block to be onlined. > - * Since the hot add has succeeded, it is ok to > - * proceed even if the pages in the hot added region > - * have not been "onlined" within the allowed time. > + * Before proceeding to hot add the next segment, > + * online the segment that has been hot added. >*/ > - wait_for_completion_timeout(&dm_device.ol_waitevent, 5*HZ); > + online_memory_block(start_pfn); Ah You've got a timeout in the code in order to tell the hypervisor that you were successfully able to add the memory? The userspace addition code probably wasn't running within this timeout period. right? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks
On 07/25/2013 04:14 AM, KY Srinivasan wrote: > As promised, I have sent out the patches for (a) an implementation of an > in-kernel API > for onlining and a consumer for this API. While I don't know the exact > reason why the > user mode code is delayed (under some low memory conditions), what is the > harm in having > a mechanism to online memory that has been hot added without involving user > space code. KY, your potential problem, not being able to online more memory because of a shortage of memory, is a serious one. However, this potential issue exists in long-standing code, and potentially affects all users of memory hotplug. The problem has not been described in sufficient detail for the rest of the developers to tell if you are facing a new problem, or whether *any* proposed solution will help the problem you face. Your propsed solution changes the semantics of existing user/kernel interfaces, duplicates existing functionality, and adds code complexity to the kernel. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks
On 07/25/2013 08:15 AM, Kay Sievers wrote: > Complexity, well, it's just a bit of code which belongs in the kernel. > The mentioned unconditional hotplug loop through userspace is > absolutely pointless. Such defaults never belong in userspace tools if > they do not involve data that is only available in userspace and > something would make a decision about that. Saying "hello" to > userspace and usrspace has a hardcoded "yes" in return makes no sense > at all. The kernel can just go ahead and do its job, like it does for > all other devices it finds too. Sorry, but memory is different than all other devices. You never need a mouse in order to add another mouse to the kernel. I'll repaste something I said earlier in this thread: > A system under memory pressure is going to have troubles doing a > hot-add. You need memory to add memory. Of the two operations ("add" > and "online"), "add" is the one vastly more likely to fail. It has to > allocate several large swaths of contiguous physical memory. For that > reason, the system was designed so that you could "add" and "online" > separately. The intention was that you could "add" far in advance and > then "online" under memory pressure, with the "online" having *VASTLY* > smaller memory requirements and being much more likely to succeed. So, no, it makes no sense to just have userspace always unconditionally online all the memory that the kernel adds. But, the way it's set up, we _have_ a method that can work under lots memory pressure, and it is available for users that want it. It was designed 10 years ago, and maybe it's outdated, or history has proved that nobody is going to use it the way it was designed. If I had it to do over again, I'd probably set up configurable per-node sets of spare kernel metadata. That way, you could say "make sure we have enough memory reserved to add $FOO sections to node $BAR". Use that for the largest allocations, then depend on PF_MEMALLOC to get us enough for the other little bits along the way. Also, if this is a problem, it's going to be a problem for *EVERY* user of memory hotplug, not just hyperv. So, let's see it fixed generically for *EVERY* user. Something along the lines of: 1. Come up with an interface that specifies a default policy for newly-added memory sections. Currently, added memory gets "added", but not "onlined", and the default should stay doing that. 2. Make sure that we at least WARN_ONCE() if someone tries to online an already-kernel-onlined memory section. That way, if someone trips over this new policy, we have a _chance_ of explaining to them what is going on. 3. Think about what we do in the failure case where we are able to "add", but fail to "online" in the kernel. Do we tear the newly-added structures down and back out of the operation, or do we leave the memory added, but offline (what happens in the normal case now)? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel