Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-09-28 Thread Dave Hansen
It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE: No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *   (e.g. ACPI DIMMs) that should be onlined either automatically
> + *   (memhp_auto_online) or manually by user space to select a
> + *   specific zone.
> + *   Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *   be onlined on demand by user space (e.g. standby memory on
> + *   s390x), but never automatically by the kernel.
> + *   Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *   paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *   always automatically get onlined. Memory will be unplugged
> + *   using ballooning, not by relying on the MOVABLE ZONE.
> + *   Not applicable to memhp_auto_online.
> + */
> +enum {
> + MEMORY_BLOCK_NONE,
> + MEMORY_BLOCK_NORMAL,
> + MEMORY_BLOCK_STANDBY,
> + MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +  MEMORY_BLOCK_NONE,
 +  MEMORY_BLOCK_STANDBY, /* the default */
 +  MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>   unsigned long start_section_nr;
>   unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>   int (*phys_callback)(struct memory_block *);
>   struct device dev;
>   int nid;/* NID for this memory block */
> + int type;   /* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-01 Thread Dave Hansen
> How should a policy in user space look like when new memory gets added
> - on s390x? Not onlining paravirtualized memory is very wrong.

Because we're going to balloon it away in a moment anyway?

We have auto-onlining.  Why isn't that being used on s390?


> So the type of memory is very important here to have in user space.
> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
> to decide whether to online memory and how to online memory is wrong.
> Only some specific memory types (which I call "normal") are to be
> handled by user space.
> 
> For the other ones, we exactly know what to do:
> - standby? don't online

I think you're horribly conflating the software desire for what the stae
should be and the hardware itself.

>> As for the OOM issues, that sounds like something we need to fix by
>> refusing to do (or delaying) hot-add operations once we consume too much
>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>> userspace to hurry thing along.
> 
> That is a moving target and doing that automatically is basically
> impossible.

Nah.  We know how much metadata we've allocated.  We know how much
ZONE_NORMAL we are eating.  We can *easily* add something to
add_memory() that just sleeps until the ratio is not out-of-whack.

> You can add a lot of memory to the movable zone and
> everything is fine. Suddenly a lot of processes are started - boom.
> MOVABLE should only every be used if you expect an unplug. And for
> paravirtualized devices, a "typical" unplug does not exist.

No, it's more complicated than that.  People use MOVABLE, for instance,
to allow more consistent huge page allocations.  It's certainly not just
hot-remove.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Dave Hansen
On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Dave Hansen
On 10/12/20 9:19 AM, Eric Biggers wrote:
> On Sun, Oct 11, 2020 at 11:56:35PM -0700, Ira Weiny wrote:
>>> And I still don't really understand.  After this patchset, there is still 
>>> code
>>> nearly identical to the above (doing a temporary mapping just for a memcpy) 
>>> that
>>> would still be using kmap_atomic().
>> I don't understand.  You mean there would be other call sites calling:
>>
>> kmap_atomic()
>> memcpy()
>> kunmap_atomic()
> Yes, there are tons of places that do this.  Try 'git grep -A6 kmap_atomic'
> and look for memcpy().
> 
> Hence why I'm asking what will be the "recommended" way to do this...
> kunmap_thread() or kmap_atomic()?

kmap_atomic() is always preferred over kmap()/kmap_thread().
kmap_atomic() is _much_ more lightweight since its TLB invalidation is
always CPU-local and never broadcast.

So, basically, unless you *must* sleep while the mapping is in place,
kmap_atomic() is preferred.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V2] staging: android: lowmemorykiller: imporve lmk to avoid deadlock issue

2015-08-03 Thread Dave Hansen
On 08/02/2015 10:53 PM, Wang, Biao wrote:
> Consider the following case:
> Task A trigger lmk with a lock held, while task B try to
> get this lock, but unfortunately B is the very culprit task lmk select to
> kill. Then B will never be killed, and A will forever select B to kill.
> Such dead lock will trigger softlock up issue.

It would be interesting to have some actual data about where this helps.
 For instance, which locks does this happen on?  What kind of
allocation?  Also, we apparently _do_ mark a lowmemorykiller victim as
an oom victim and let them use memory reserves.  Why does that not allow
the allocation to complete at least long enough to get the kill signal
to the victim?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC 0/6] Module for tracking/accounting shared memory buffers

2016-10-12 Thread Dave Hansen
On 10/11/2016 04:50 PM, Ruchi Kandoi wrote:
> Any process holding a reference to these buffers will keep the kernel from
> reclaiming its backing pages.  mm counters don't provide a complete picture of
> these allocations, since they only account for pages that are mapped into a
> process's address space.  This problem is especially bad for systems like
> Android that use dma-buf fds to share graphics and multimedia buffers between
> processes: these allocations are often large, have complex sharing patterns,
> and are rarely mapped into every process that holds a reference to them.

What do you end up _doing_ with all this new information that you have
here?  You know which processes have "pinned" these shared buffers, and
exported that information in /proc.  But, then what?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging: lustre: fix GFP_ATOMIC macro usage

2014-01-21 Thread Dave Hansen
On 01/21/2014 12:02 PM, Dilger, Andreas wrote:
> The Lustre allocation macros track the memory usage across the whole
> filesystem,
> not just of a single structure that a mempool/slab/whatever would do.
> This is
> useful to know for debugging purposes (e.g. user complains about not having
> enough RAM for their highly-tuned application, or to check for leaks at
> unmount).

Urg, it does this with a global variable.  If we did this kind of thing
generically, we'd get eaten alive by all of the cacheline bouncing from
that atomic.  It's also a 32-bit atomic.  Guess it's never had more than
4GB of memory tracked in there. :)

This also doesn't track overhead from things that *are* in slabs like
the inodes, or the 14 kmem_caches that lustre has, so it's far from a
complete picture of how lustre is using memory.

> It can also log the alloc/free calls and post-process them to find leaks
> easily, or find pieces code that is allocating too much memory that are not
> using dedicated slabs.  This also works if you encounter a system with a
> lot of allocated memory, enable "free" logging, and then unmount the
> filesystem. 
> The logs will show which structures are being freed (assuming they are not
> leaked completely) and point you to whatever is not being shrunk properly.

This isn't perfect, but it does cover most of the ext4 call sites in my
kernel.  It would work better for a module, I'd imagine:

cd /sys/kernel/debug/tracing/events/kmem
echo -n 'call_site < 0x81e0af00 && call_site >=
0x81229cf0' > kmalloc/filter
echo 1 > kmalloc/enable
cat /sys/kernel/debug/tracing/trace_pipe

It will essentially log all the kmalloc() calls from the ext4 code.  I
got the call site locations from grepping System.map.  It would be
_really_ nice if we were able to do something like:

echo 'call_site =~ *ext4*'

but there's no way to do that which I know of.  You could probably rig
something up with the function graph tracer by triggering only on entry
to the lustre code.

> I don't know if there is any way to track this with regular kmalloc(), and
> creating separate slabs for so ever data structure would be ugly.  The
> generic
> /proc/meminfo data doesn't really tell you what is using all the memory,
> and
> the size- slabs give some information, but are used all over the
> kernel.
> 
> I'm pretty much resigned to losing all of this functionality, but it
> definitely
> has been very useful for finding problems.

Yeah, it is hard to find out who is responsible for leaking pages or
kmalloc()s, especially after the fact.  But, we seem to limp along just
fine.  If lustre is that bad of a memory leaker that it *NEEDS* this
feature, we have bigger problems on our hands. :)



___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[staging] unused rtl8192u/ieee80211/digest.c ?

2014-02-03 Thread Dave Hansen
I was doing some code audits looking at scattergather uses, and noticed
that update() in drivers/staging/rtl8192u/ieee80211/digest.c uses
sg.page which doesn't exist any longer, so this can't possibly compile.

Turns out that digest.c is actually unused.  It doesn't get referenced
in a Makefile or get compiled and doesn't get used as far as I can see.

This driver looks pretty dead.  The original author hasn't touched it in
the 4 years since it got in to staging/.  Shouldn't we be kicking stuff
like this out of staging?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-23 Thread Dave Hansen
On 07/23/2013 07:52 AM, KY Srinivasan wrote:
>  The current scheme of involving user
> level code to close this loop obviously does not perform well under high 
> memory pressure.

Adding memory usually requires allocating some large, contiguous areas
of memory for use as mem_map[] and other VM structures.  That's really
hard to do under heavy memory pressure.  How are you accomplishing this?

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-23 Thread Dave Hansen
On 07/23/2013 08:54 AM, KY Srinivasan wrote:
>> > Adding memory usually requires allocating some large, contiguous areas
>> > of memory for use as mem_map[] and other VM structures.  That's really
>> > hard to do under heavy memory pressure.  How are you accomplishing this?
> I cannot avoid failures because of lack of memory. In this case I notify the 
> host of
> the failure and also tag the failure as transient. Host retries the operation 
> after some
> delay. There is no guarantee it will succeed though.

You didn't really answer the question.

You have allocated some large, physically contiguous areas of memory
under heavy pressure.  But you also contend that there is too much
memory pressure to run a small userspace helper.  Under heavy memory
pressure, I'd expect large, kernel allocations to fail much more often
than running a small userspace helper.

It _sounds_ like you really want to be able to have the host retry the
operation if it fails, and you return success/failure from inside the
kernel.  It's hard for you to tell if running the userspace helper
failed, so your solution is to move what what previously done in
userspace in to the kernel so that you can more easily tell if it failed
or succeeded.

Is that right?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-24 Thread Dave Hansen
On 07/23/2013 10:21 AM, KY Srinivasan wrote:
>> You have allocated some large, physically contiguous areas of memory
>> under heavy pressure.  But you also contend that there is too much
>> memory pressure to run a small userspace helper.  Under heavy memory
>> pressure, I'd expect large, kernel allocations to fail much more often
>> than running a small userspace helper.
> 
> I am only reporting what I am seeing. Broadly, I have two main failure 
> conditions to
> deal with: (a) resource related failure (add_memory() returning -ENOMEM) and 
> (b) not being
> able to online a segment that has been successfully hot-added. I have seen 
> both these failures
> under high memory pressure. By supporting "in context" onlining, we can 
> eliminate one failure
> case. Our inability to online is not a recoverable failure from the host's 
> point of view - the memory
> is committed to the guest (since hot add succeeded) but is not usable since 
> it is not onlined.

Could you please precisely report on what you are seeing in detail?
Where are the -ENOMEMs coming from?  Which allocation site?  Are you
seeing OOMs or page allocation failure messages on the console?

The operation was split up in to two parts for good reason.  It's
actually for your _precise_ use case.

A system under memory pressure is going to have troubles doing a
hot-add.  You need memory to add memory.  Of the two operations ("add"
and "online"), "add" is the one vastly more likely to fail.  It has to
allocate several large swaths of contiguous physical memory.  For that
reason, the system was designed so that you could "add" and "online"
separately.  The intention was that you could "add" far in advance and
then "online" under memory pressure, with the "online" having *VASTLY*
smaller memory requirements and being much more likely to succeed.

You're lumping the "allocate several large swaths of contiguous physical
memory" failures in to the same class as "run a small userspace helper".
 They are _really_ different problems.  Both prone to allocation
failures for sure, but _very_ separate problems.  Please don't conflate
them.

>> It _sounds_ like you really want to be able to have the host retry the
>> operation if it fails, and you return success/failure from inside the
>> kernel.  It's hard for you to tell if running the userspace helper
>> failed, so your solution is to move what what previously done in
>> userspace in to the kernel so that you can more easily tell if it failed
>> or succeeded.
>>
>> Is that right?
> 
> No; I am able to get the proper error code for recoverable failures (hot add 
> failures
> because of lack of memory). By doing what I am proposing here, we can avoid 
> one class
> of failures completely and I think this is what resulted in a better "hot 
> add" experience in the
> guest.

I think you're taking a huge leap here: "We could not online memory,
thus we must take userspace out of the loop."

You might be right.  There might be only one way out of this situation.
 But you need to provide a little more supporting evidence before we all
arrive at the same conclusion.

BTW, it doesn't _require_ udev.  There could easily be another listener
for hotplug events.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-24 Thread Dave Hansen
On 07/24/2013 12:45 PM, KY Srinivasan wrote:
> All I am saying is that I see two classes of failures: (a) Our
> inability to allocate memory to manage the memory that is being hot added
> and (b) Our inability to bring the hot added memory online within a reasonable
> amount of time. I am not sure the cause for (b) and I was just speculating 
> that
> this could be memory related. What is interesting is that I have seen failure 
> related
> to our inability to online the memory after having succeeded in hot adding the
> memory.

I think we should hold off on this patch and other like it until we've
been sufficiently able to explain how (b) happens.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] Drivers: hv: balloon: Online the hot-added memory "in context"

2013-07-24 Thread Dave Hansen
On 07/24/2013 02:29 PM, K. Y. Srinivasan wrote:
>   /*
> -  * Wait for the memory block to be onlined.
> -  * Since the hot add has succeeded, it is ok to
> -  * proceed even if the pages in the hot added region
> -  * have not been "onlined" within the allowed time.
> +  * Before proceeding to hot add the next segment,
> +  * online the segment that has been hot added.
>*/
> - wait_for_completion_timeout(&dm_device.ol_waitevent, 5*HZ);
> + online_memory_block(start_pfn);

Ah  You've got a timeout in the code in order to tell the
hypervisor that you were successfully able to add the memory?  The
userspace addition code probably wasn't running within this timeout
period.  right?
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-25 Thread Dave Hansen
On 07/25/2013 04:14 AM, KY Srinivasan wrote:
> As promised, I have sent out the patches for (a) an implementation of an 
> in-kernel API
> for onlining  and a consumer for this API. While I don't know the exact 
> reason why the
> user mode code is delayed (under some low memory conditions), what is the 
> harm in having
> a mechanism to online memory that has been hot added without involving user 
> space code.

KY, your potential problem, not being able to online more memory because
of a shortage of memory, is a serious one.

However, this potential issue exists in long-standing code, and
potentially affects all users of memory hotplug.  The problem has not
been described in sufficient detail for the rest of the developers to
tell if you are facing a new problem, or whether *any* proposed solution
will help the problem you face.

Your propsed solution changes the semantics of existing user/kernel
interfaces, duplicates existing functionality, and adds code complexity
to the kernel.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-25 Thread Dave Hansen
On 07/25/2013 08:15 AM, Kay Sievers wrote:
> Complexity, well, it's just a bit of code which belongs in the kernel.
> The mentioned unconditional hotplug loop through userspace is
> absolutely pointless. Such defaults never belong in userspace tools if
> they do not involve data that is only available in userspace and
> something would make a decision about that. Saying "hello" to
> userspace and usrspace has a hardcoded "yes" in return makes no sense
> at all. The kernel can just go ahead and do its job, like it does for
> all other devices it finds too.

Sorry, but memory is different than all other devices.  You never need a
mouse in order to add another mouse to the kernel.

I'll repaste something I said earlier in this thread:

> A system under memory pressure is going to have troubles doing a
> hot-add.  You need memory to add memory.  Of the two operations ("add"
> and "online"), "add" is the one vastly more likely to fail.  It has to
> allocate several large swaths of contiguous physical memory.  For that
> reason, the system was designed so that you could "add" and "online"
> separately.  The intention was that you could "add" far in advance and
> then "online" under memory pressure, with the "online" having *VASTLY*
> smaller memory requirements and being much more likely to succeed.

So, no, it makes no sense to just have userspace always unconditionally
online all the memory that the kernel adds.  But, the way it's set up,
we _have_ a method that can work under lots memory pressure, and it is
available for users that want it.  It was designed 10 years ago, and
maybe it's outdated, or history has proved that nobody is going to use
it the way it was designed.

If I had it to do over again, I'd probably set up configurable per-node
sets of spare kernel metadata.  That way, you could say "make sure we
have enough memory reserved to add $FOO sections to node $BAR".  Use
that for the largest allocations, then depend on PF_MEMALLOC to get us
enough for the other little bits along the way.

Also, if this is a problem, it's going to be a problem for *EVERY* user
of memory hotplug, not just hyperv.  So, let's see it fixed generically
for *EVERY* user.  Something along the lines of:

1. Come up with an interface that specifies a default policy for
   newly-added memory sections.  Currently, added memory gets "added",
   but not "onlined", and the default should stay doing that.
2. Make sure that we at least WARN_ONCE() if someone tries to online an
   already-kernel-onlined memory section.  That way, if someone trips
   over this new policy, we have a _chance_ of explaining to them what
   is going on.
3. Think about what we do in the failure case where we are able to
   "add", but fail to "online" in the kernel.  Do we tear the
   newly-added structures down and back out of the operation, or do
   we leave the memory added, but offline (what happens in the normal
   case now)?



___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel