find customers

2015-05-14 Thread David

How are you doing?
We can bring you more business.
We can find new customers for you from our email marketing.

Just reply back and I can go over options for you.

Thanks,
David
Email: vee...@aliyun.com


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 05:08:52PM +0200, Arnd Bergmann wrote:
> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann 
> ---

>  fs/btrfs/super.c| 2 +-

Acked-by: David Sterba 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFCv2 3/6] mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

2018-09-17 Thread David Hildenbrand
Am 03.09.18 um 02:36 schrieb Rashmica:
> Hi David,
> 
> 
> On 21/08/18 20:44, David Hildenbrand wrote:
> 
>> There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
>> fix concurrent memory hot-add deadlock"), which tried to fix a possible
>> lock inversion reported and discussed in [1] due to the two locks
>>  a) device_lock()
>>  b) mem_hotplug_lock
>>
>> While add_memory() first takes b), followed by a) during
>> bus_probe_device(), onlining of memory from user space first took b),
>> followed by a), exposing a possible deadlock.
> 
> Do you mean "onlining of memory from user space first took a),
> followed by b)"? 

Very right, thanks.

> 
>> In [1], and it was decided to not make use of device_hotplug_lock, but
>> rather to enforce a locking order.
>>
>> The problems I spotted related to this:
>>
>> 1. Memory block device attributes: While .state first calls
>>mem_hotplug_begin() and the calls device_online() - which takes
>>device_lock() - .online does no longer call mem_hotplug_begin(), so
>>effectively calls online_pages() without mem_hotplug_lock.
>>
>> 2. device_online() should be called under device_hotplug_lock, however
>>onlining memory during add_memory() does not take care of that.
>>
>> In addition, I think there is also something wrong about the locking in
>>
>> 3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
>>without locks. This was introduced after 30467e0b3be. And skimming over
>>the code, I assume it could need some more care in regards to locking
>>(e.g. device_online() called without device_hotplug_lock - but I'll
>>not touch that for now).
> 
> Can you mention that you fixed this in later patches?

Sure!

> 
> 
> The series looks good to me. Feel free to add my reviewed-by:
> 
> Reviewed-by: Rashmica Gupta 
> 

Thanks, r-b only for this patch or all of the series?

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 0/2] hv_netvsc: associate VF and PV device by serial number

2018-09-17 Thread David Miller
From: Stephen Hemminger 
Date: Fri, 14 Sep 2018 12:54:55 -0700

> The Hyper-V implementation of PCI controller has concept of 32 bit serial 
> number
> (not to be confused with PCI-E serial number).  This value is sent in the 
> protocol
> from the host to indicate SR-IOV VF device is attached to a synthetic NIC.
> 
> Using the serial number (instead of MAC address) to associate the two devices
> avoids lots of potential problems when there are duplicate MAC addresses from
> tunnels or layered devices.
> 
> The patch set is broken into two parts, one is for the PCI controller
> and the other is for the netvsc device. Normally, these go through different
> trees but sending them together here for better review. The PCI changes
> were submitted previously, but the main review comment was "why do you
> need this?". This is why.
> 
> v2 - slot name can be shorter.
>  remove locking when creating pci_slots; see comment for explaination

Series applied, thanks.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 1/6] mm/memory_hotplug: make remove_memory() take the device_hotplug_lock

2018-09-18 Thread David Hildenbrand
remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported. So let's provide a variant
that takes the lock and only export that one.

The lock is already held in
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
So, let's use the locked variant.

The lock is not held (but taken in)
arch/powerpc/platforms/powernv/memtrace.c
So let's keep using the (now) locked variant.

Apart from that, there are not other users in the tree.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Rashmica Gupta 
Cc: Michael Neuling 
Cc: Balbir Singh 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Pavel Tatashin 
Cc: Greg Kroah-Hartman 
Cc: Oscar Salvador 
Cc: YASUAKI ISHIMATSU 
Cc: Mathieu Malaterre 
Reviewed-by: Pavel Tatashin 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c   | 2 --
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 +++---
 drivers/acpi/acpi_memhotplug.c  | 2 +-
 include/linux/memory_hotplug.h  | 3 ++-
 mm/memory_hotplug.c | 9 -
 5 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 51dc398ae3f7..8f1cd4f3bfd5 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -90,9 +90,7 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, 
u64 nr_pages)
walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
  change_memblock_state);
 
-   lock_device_hotplug();
remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
-   unlock_device_hotplug();
 
return true;
 }
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c1578f54c626..b3f54466e25f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -334,7 +334,7 @@ static int pseries_remove_memblock(unsigned long base, 
unsigned int memblock_siz
nid = memory_add_physaddr_to_nid(base);
 
for (i = 0; i < sections_per_block; i++) {
-   remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
+   __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
base += MIN_MEMORY_BLOCK_SIZE;
}
 
@@ -423,7 +423,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
block_sz = pseries_memory_block_size();
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
-   remove_memory(nid, lmb->base_addr, block_sz);
+   __remove_memory(nid, lmb->base_addr, block_sz);
 
/* Update memory regions for memory remove */
memblock_remove(lmb->base_addr, block_sz);
@@ -710,7 +710,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   remove_memory(nid, lmb->base_addr, block_sz);
+   __remove_memory(nid, lmb->base_addr, block_sz);
dlpar_remove_device_tree_lmb(lmb);
} else {
lmb->flags |= DRCONF_MEM_ASSIGNED;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309c..811148415993 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct 
acpi_memory_device *mem_device)
nid = memory_add_physaddr_to_nid(info->start_addr);
 
acpi_unbind_memory_blocks(info);
-   remove_memory(nid, info->start_addr, info->length);
+   __remove_memory(nid, info->start_addr, info->length);
list_del(&info->list);
kfree(info);
}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 34a28227068d..1f096852f479 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -301,6 +301,7 @@ extern bool is_mem_section_removable(unsigned long pfn, 
unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern void remove_memory(int nid, u64 start, u64 size);
+extern void __remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -317,6 +318,7 @@ static inline int offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 }
 
 static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline void __remove_memory(int nid, u64 start, u64 size) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 exte

[PATCH v1 2/6] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2018-09-18 Thread David Hildenbrand
add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory
to synchronize against online/offline request (e.g. from user space) -
which already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory
hot-add deadlock"). add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
drivers/xen/balloon.c
arch/powerpc/platforms/powernv/memtrace.c
drivers/s390/char/sclp_cmd.c
drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used
by XEN, which is never built as a module. If somebody requires it, we
also have to export a locked variant (as device_hotplug_lock is never
exported).

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mathieu Malaterre 
Cc: Pavel Tatashin 
Cc: YASUAKI ISHIMATSU 
Reviewed-by: Pavel Tatashin 
Signed-off-by: David Hildenbrand 
---
 .../platforms/pseries/hotplug-memory.c|  2 +-
 drivers/acpi/acpi_memhotplug.c|  2 +-
 drivers/base/memory.c |  9 ++--
 drivers/xen/balloon.c |  3 +++
 include/linux/memory_hotplug.h|  1 +
 mm/memory_hotplug.c   | 22 ---
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index b3f54466e25f..2e6f41dc103a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -702,7 +702,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz);
if (rc) {
dlpar_remove_device_tree_lmb(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 811148415993..8fe0960ea572 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 817320c7c4c1..40cac122ec73 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -519,15 +519,20 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
return -EINVAL;
 
+   ret = lock_device_hotplug_sysfs();
+   if (ret)
+   goto out;
+
nid = memory_add_physaddr_to_nid(phys_addr);
-   ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+   ret = __add_memory(nid, phys_addr,
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
 
if (ret)
goto out;
 
ret = count;
 out:
+   unlock_device_hotplug();
return ret;
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index e12bb256036f..6bab019a82b1 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -395,7 +395,10 @@ static enum bp_state reserve_additional_memory(void)
 * callers drop the mutex before trying again.
 */
mutex_unlock(&balloon_mutex);
+   /* add_memory_resource() requires the device_hotplug lock */
+   lock_device_hotplug();
rc = add_memory_resource(nid, resource, memhp_auto_online);
+   unlock_device_hotplug();
mutex_lock(&balloon_mutex);
 
if (rc) {
diff --git a/include

[PATCH v1 0/6] mm: online/offline_pages called w.o. mem_hotplug_lock

2018-09-18 Thread David Hildenbrand
Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock. And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.

While e.g.
echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.

E.g. via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock. So we can
have concurrent callers in online_pages(). We e.g. touch in online_pages()
basically unprotected zone->present_pages then.

Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible. We
would e.g. have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.

Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.

I propose the general rules (documentation added in patch 6):

1. add_memory/add_memory_resource() must only be called with
   device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
   already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
   device_hotplug_lock. This is already documented and true for now in core
   code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
   online_pages/offline_pages.

To me, this looks way cleaner than what we have right now (and easier to
verify). And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.


RFCv2 -> v1:
- Dropped an unnecessary _ref from remove_memory() in patch #1
- Minor patch description fixes.
- Added rb's

RFC -> RFCv2:
- Don't export device_hotplug_lock, provide proper remove_memory/add_memory
  wrappers.
- Split up the patches a bit.
- Try to improve powernv memtrace locking
- Add some documentation for locking that matches my knowledge

David Hildenbrand (6):
  mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
  mm/memory_hotplug: make add_memory() take the device_hotplug_lock
  mm/memory_hotplug: fix online/offline_pages called w.o.
mem_hotplug_lock
  powerpc/powernv: hold device_hotplug_lock when calling device_online()
  powerpc/powernv: hold device_hotplug_lock in memtrace_offline_pages()
  memory-hotplug.txt: Add some details about locking internals

 Documentation/memory-hotplug.txt  | 39 +++-
 arch/powerpc/platforms/powernv/memtrace.c | 14 +++--
 .../platforms/pseries/hotplug-memory.c|  8 +--
 drivers/acpi/acpi_memhotplug.c|  4 +-
 drivers/base/memory.c | 22 +++
 drivers/xen/balloon.c |  3 +
 include/linux/memory_hotplug.h|  4 +-
 mm/memory_hotplug.c   | 59 +++
 8 files changed, 115 insertions(+), 38 deletions(-)

-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 3/6] mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

2018-09-18 Thread David Hildenbrand
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock

While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.

In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.

The problems I spotted related to this:

1. Memory block device attributes: While .state first calls
   mem_hotplug_begin() and the calls device_online() - which takes
   device_lock() - .online does no longer call mem_hotplug_begin(), so
   effectively calls online_pages() without mem_hotplug_lock.

2. device_online() should be called under device_hotplug_lock, however
   onlining memory during add_memory() does not take care of that.

In addition, I think there is also something wrong about the locking in

3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
   without locks. This was introduced after 30467e0b3be. And skimming over
   the code, I assume it could need some more care in regards to locking
   (e.g. device_online() called without device_hotplug_lock. This will
   be addressed in the following patches.

Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()

We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().

Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).

[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html

This patch is partly based on a patch by Vitaly Kuznetsov.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Rashmica Gupta 
Cc: Michael Neuling 
Cc: Balbir Singh 
Cc: Kate Stewart 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Dan Williams 
Cc: Oscar Salvador 
Cc: YASUAKI ISHIMATSU 
Cc: Mathieu Malaterre 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/base/memory.c | 13 +
 mm/memory_hotplug.c   | 28 
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 40cac122ec73..0e5985682642 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -228,7 +228,6 @@ static bool pages_correctly_probed(unsigned long start_pfn)
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
- * Must already be protected by mem_hotplug_begin().
  */
 static int
 memory_block_action(unsigned long phys_index, unsigned long action, int 
online_type)
@@ -294,7 +293,6 @@ static int memory_subsys_online(struct device *dev)
if (mem->online_type < 0)
mem->online_type = MMOP_ONLINE_KEEP;
 
-   /* Already under protection of mem_hotplug_begin() */
ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 
/* clear online_type */
@@ -341,19 +339,11 @@ store_mem_state(struct device *dev,
goto err;
}
 
-   /*
-* Memory hotplug needs to hold mem_hotplug_begin() for probe to find
-* the correct memory block to online before doing device_online(dev),
-* which will take dev->mutex.  Take the lock early to prevent an
-* inversion, memory_subsys_online() callbacks will be implemented by
-* assuming it's already protected.
-*/
-   mem_hotplug_begin();
-
switch (online_type) {
case MMOP_ONLINE_KERNEL:
case MMOP_ONLINE_MOVABLE:
case MMOP_ONLINE_KEEP:
+   /* mem->online_type is protected by device_hotplug_lock */
mem->online_type = online_type;
ret = device_online(&mem->dev);
break;
@@ -364,7 +354,6 @@ store_mem_state(struct device *dev,
ret = -EINVAL; /* should never happen */
}
 
-   mem_hotplug_done();
 err:
unlock_device_hotplug();
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ef5444145c88..497e9315ca6f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -881,7 +881,6 @@ static 

[PATCH v1 6/6] memory-hotplug.txt: Add some details about locking internals

2018-09-18 Thread David Hildenbrand
Let's document the magic a bit, especially why device_hotplug_lock is
required when adding/removing memory and how it all play together with
requests to online/offline memory from user space.

Cc: Jonathan Corbet 
Cc: Michal Hocko 
Cc: Andrew Morton 
Reviewed-by: Pavel Tatashin 
Signed-off-by: David Hildenbrand 
---
 Documentation/memory-hotplug.txt | 39 +++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 7f49ebf3ddb2..03aaad7d7373 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -3,7 +3,7 @@ Memory Hotplug
 ==
 
 :Created:  Jul 28 2007
-:Updated: Add description of notifier of memory hotplug:   Oct 11 2007
+:Updated: Add some details about locking internals:Aug 20 2018
 
 This document is about memory hotplug including how-to-use and current status.
 Because Memory Hotplug is still under development, contents of this text will
@@ -495,6 +495,43 @@ further processing of the notification queue.
 
 NOTIFY_STOP stops further processing of the notification queue.
 
+
+Locking Internals
+=
+
+When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
+the device_hotplug_lock should be held to:
+
+- synchronize against online/offline requests (e.g. via sysfs). This way, 
memory
+  block devices can only be accessed (.online/.state attributes) by user
+  space once memory has been fully added. And when removing memory, we
+  know nobody is in critical sections.
+- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
+
+Especially, there is a possible lock inversion that is avoided using
+device_hotplug_lock when adding memory and user space tries to online that
+memory faster than expected:
+
+- device_online() will first take the device_lock(), followed by
+  mem_hotplug_lock
+- add_memory_resource() will first take the mem_hotplug_lock, followed by
+  the device_lock() (while creating the devices, during bus_add_device()).
+
+As the device is visible to user space before taking the device_lock(), this
+can result in a lock inversion.
+
+onlining/offlining of memory should be done via device_online()/
+device_offline() - to make sure it is properly synchronized to actions
+via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+
+When adding/removing/onlining/offlining memory or adding/removing
+heterogeneous/device memory, we should always hold the mem_hotplug_lock to
+serialise memory hotplug (e.g. access to global/zone variables).
+
+In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) allows
+for a quite efficient get_online_mems/put_online_mems implementation.
+
+
 Future Work
 ===
 
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 5/6] powerpc/powernv: hold device_hotplug_lock in memtrace_offline_pages()

2018-09-18 Thread David Hildenbrand
Let's perform all checking + offlining + removing under
device_hotplug_lock, so nobody can mess with these devices via
sysfs concurrently.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Reviewed-by: Pavel Tatashin 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index ef7181d4fe68..473e59842ec5 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -74,9 +74,13 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, 
u64 nr_pages)
 {
u64 end_pfn = start_pfn + nr_pages - 1;
 
+   lock_device_hotplug();
+
if (walk_memory_range(start_pfn, end_pfn, NULL,
-   check_memblock_online))
+   check_memblock_online)) {
+   unlock_device_hotplug();
return false;
+   }
 
walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
  change_memblock_state);
@@ -84,14 +88,16 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, 
u64 nr_pages)
if (offline_pages(start_pfn, nr_pages)) {
walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
  change_memblock_state);
+   unlock_device_hotplug();
return false;
}
 
walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
  change_memblock_state);
 
-   remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
+   __remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
 
+   unlock_device_hotplug();
return true;
 }
 
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 4/6] powerpc/powernv: hold device_hotplug_lock when calling device_online()

2018-09-18 Thread David Hildenbrand
device_online() should be called with device_hotplug_lock() held.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Reviewed-by: Pavel Tatashin 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 8f1cd4f3bfd5..ef7181d4fe68 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -229,9 +229,11 @@ static int memtrace_online(void)
 * we need to online the memory ourselves.
 */
if (!memhp_auto_online) {
+   lock_device_hotplug();
walk_memory_range(PFN_DOWN(ent->start),
  PFN_UP(ent->start + ent->size - 1),
  NULL, online_mem_block);
+   unlock_device_hotplug();
}
 
/*
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 0/6] mm: online/offline_pages called w.o. mem_hotplug_lock

2018-09-19 Thread David Hildenbrand
Am 19.09.18 um 03:22 schrieb Balbir Singh:
> On Tue, Sep 18, 2018 at 01:48:16PM +0200, David Hildenbrand wrote:
>> Reading through the code and studying how mem_hotplug_lock is to be used,
>> I noticed that there are two places where we can end up calling
>> device_online()/device_offline() - online_pages()/offline_pages() without
>> the mem_hotplug_lock. And there are other places where we call
>> device_online()/device_offline() without the device_hotplug_lock.
>>
>> While e.g.
>>  echo "online" > /sys/devices/system/memory/memory9/state
>> is fine, e.g.
>>  echo 1 > /sys/devices/system/memory/memory9/online
>> Will not take the mem_hotplug_lock. However the device_lock() and
>> device_hotplug_lock.
>>
>> E.g. via memory_probe_store(), we can end up calling
>> add_memory()->online_pages() without the device_hotplug_lock. So we can
>> have concurrent callers in online_pages(). We e.g. touch in online_pages()
>> basically unprotected zone->present_pages then.
>>
>> Looks like there is a longer history to that (see Patch #2 for details),
>> and fixing it to work the way it was intended is not really possible. We
>> would e.g. have to take the mem_hotplug_lock in device/base/core.c, which
>> sounds wrong.
>>
>> Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
>> More details can be found in patch 3 and patch 6.
>>
>> I propose the general rules (documentation added in patch 6):
>>
>> 1. add_memory/add_memory_resource() must only be called with
>>device_hotplug_lock.
>> 2. remove_memory() must only be called with device_hotplug_lock. This is
>>already documented and holds for all callers.
>> 3. device_online()/device_offline() must only be called with
>>device_hotplug_lock. This is already documented and true for now in core
>>code. Other callers (related to memory hotplug) have to be fixed up.
>> 4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
>>online_pages/offline_pages.
>>
>> To me, this looks way cleaner than what we have right now (and easier to
>> verify). And looking at the documentation of remove_memory, using
>> lock_device_hotplug also for add_memory() feels natural.
>>
> 
> That seems reasonable, but also implies that device_online() would hold
> back add/remove memory, could you please also document what mode
> read/write the locks need to be held? For example can the device_hotplug_lock
> be held in read mode while add/remove memory via (mem_hotplug_lock) is held
> in write mode?

device_hotplug_lock is an ordinary mutex. So no option there.

Only mem_hotplug_lock is a per CPU RW mutex. And as of now it only
exists to not require get_online_mems()/put_online_mems() to take the
device_hotplug_lock. Which is perfectly valid, because these users only
care about memory (not any other devices) not suddenly vanish. And that
RW lock makes things fast.

Any modifications (online/offline/add/remove) require the
mem_hotplug_lock in write.

I can add some more details to documentation in patch #6.

"... we should always hold the mem_hotplug_lock (via
mem_hotplug_begin/mem_hotplug_done) in write mode to serialize memory
hotplug" ..."

"In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in
read mode allows for a quite efficient get_online_mems/put_online_mems
implementation, so code accessing memory can protect from that memory
vanishing."

Would that work for you?

Thanks!

> 
> Balbir Singh.
>  
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next 00/22] net: fix return type of ndo_start_xmit function

2018-09-20 Thread David Miller
From: YueHaibing 
Date: Thu, 20 Sep 2018 20:32:44 +0800

> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, so make sure the implementation in
> this driver has returns 'netdev_tx_t' value, and change the function
> return type to netdev_tx_t.

I would advise you not to send so many of these changes as a group.

If one of the patches needs feedback addressed, which is already the
case, you will have to resubmit the entire series all over again with
the fixes.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next,v2,0/3] hv_netvsc: Support LRO/RSC in the vSwitch

2018-09-22 Thread David Miller
From: Haiyang Zhang 
Date: Fri, 21 Sep 2018 18:20:34 +

> From: Haiyang Zhang 
> 
> The patch adds support for LRO/RSC in the vSwitch feature. It reduces
> the per packet processing overhead by coalescing multiple TCP segments
> when possible. The feature is enabled by default on VMs running on
> Windows Server 2019 and later.
> 
> The patch set also adds ethtool command handler and documents.

Series applied, thank you.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 1/6] mm/memory_hotplug: make remove_memory() take the device_hotplug_lock

2018-09-25 Thread David Hildenbrand
remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported. So let's provide a variant
that takes the lock and only export that one.

The lock is already held in
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
arch/powerpc/platforms/powernv/memtrace.c

Apart from that, there are not other users in the tree.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Rashmica Gupta 
Cc: Michael Neuling 
Cc: Balbir Singh 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Pavel Tatashin 
Cc: Greg Kroah-Hartman 
Cc: Oscar Salvador 
Cc: YASUAKI ISHIMATSU 
Cc: Mathieu Malaterre 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c   | 2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 +++---
 drivers/acpi/acpi_memhotplug.c  | 2 +-
 include/linux/memory_hotplug.h  | 3 ++-
 mm/memory_hotplug.c | 9 -
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index a29fdf8a2e56..773623f6bfb1 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -121,7 +121,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
lock_device_hotplug();
end_pfn = base_pfn + nr_pages;
for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> 
PAGE_SHIFT) {
-   remove_memory(nid, pfn << PAGE_SHIFT, bytes);
+   __remove_memory(nid, pfn << PAGE_SHIFT, bytes);
}
unlock_device_hotplug();
return base_pfn << PAGE_SHIFT;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 9a15d39995e5..dd0264c43f3e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -305,7 +305,7 @@ static int pseries_remove_memblock(unsigned long base, 
unsigned int memblock_siz
nid = memory_add_physaddr_to_nid(base);
 
for (i = 0; i < sections_per_block; i++) {
-   remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
+   __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
base += MIN_MEMORY_BLOCK_SIZE;
}
 
@@ -394,7 +394,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
block_sz = pseries_memory_block_size();
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
-   remove_memory(nid, lmb->base_addr, block_sz);
+   __remove_memory(nid, lmb->base_addr, block_sz);
 
/* Update memory regions for memory remove */
memblock_remove(lmb->base_addr, block_sz);
@@ -681,7 +681,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   remove_memory(nid, lmb->base_addr, block_sz);
+   __remove_memory(nid, lmb->base_addr, block_sz);
invalidate_lmb_associativity_index(lmb);
} else {
lmb->flags |= DRCONF_MEM_ASSIGNED;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309c..811148415993 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct 
acpi_memory_device *mem_device)
nid = memory_add_physaddr_to_nid(info->start_addr);
 
acpi_unbind_memory_blocks(info);
-   remove_memory(nid, info->start_addr, info->length);
+   __remove_memory(nid, info->start_addr, info->length);
list_del(&info->list);
kfree(info);
}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 34a28227068d..1f096852f479 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -301,6 +301,7 @@ extern bool is_mem_section_removable(unsigned long pfn, 
unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern void remove_memory(int nid, u64 start, u64 size);
+extern void __remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -317,6 +318,7 @@ static inline int offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 }
 
 static inline void remove_memory(int nid, u64 start, u

[PATCH v2 0/6] mm: online/offline_pages called w.o. mem_hotplug_lock

2018-09-25 Thread David Hildenbrand
Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock. And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.

While e.g.
echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.

E.g. via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock. So we can
have concurrent callers in online_pages(). We e.g. touch in online_pages()
basically unprotected zone->present_pages then.

Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible. We
would e.g. have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.

Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.

I propose the general rules (documentation added in patch 6):

1. add_memory/add_memory_resource() must only be called with
   device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
   already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
   device_hotplug_lock. This is already documented and true for now in core
   code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
   online_pages/offline_pages.

To me, this looks way cleaner than what we have right now (and easier to
verify). And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.


v1 -> v2:
- Upstream changes in powerpc/powernv code required modifications to
  patch #1, #4 and #5.
- Minor patch description changes.
- Added more locking details in patch #6.
- Added rb's

RFCv2 -> v1:
- Dropped an unnecessary _ref from remove_memory() in patch #1
- Minor patch description fixes.
- Added rb's

RFC -> RFCv2:
- Don't export device_hotplug_lock, provide proper remove_memory/add_memory
  wrappers.
- Split up the patches a bit.
- Try to improve powernv memtrace locking
- Add some documentation for locking that matches my knowledge

David Hildenbrand (6):
  mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
  mm/memory_hotplug: make add_memory() take the device_hotplug_lock
  mm/memory_hotplug: fix online/offline_pages called w.o.
mem_hotplug_lock
  powerpc/powernv: hold device_hotplug_lock when calling device_online()
  powerpc/powernv: hold device_hotplug_lock when calling
memtrace_offline_pages()
  memory-hotplug.txt: Add some details about locking internals

 Documentation/memory-hotplug.txt  | 42 -
 arch/powerpc/platforms/powernv/memtrace.c |  8 ++-
 .../platforms/pseries/hotplug-memory.c|  8 +--
 drivers/acpi/acpi_memhotplug.c|  4 +-
 drivers/base/memory.c | 22 +++
 drivers/xen/balloon.c |  3 +
 include/linux/memory_hotplug.h|  4 +-
 mm/memory_hotplug.c   | 59 +++
 8 files changed, 114 insertions(+), 36 deletions(-)

-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 2/6] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2018-09-25 Thread David Hildenbrand
add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory
to synchronize against online/offline request (e.g. from user space) -
which already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory
hot-add deadlock"). add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
drivers/xen/balloon.c
arch/powerpc/platforms/powernv/memtrace.c
drivers/s390/char/sclp_cmd.c
drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used
by XEN, which is never built as a module. If somebody requires it, we
also have to export a locked variant (as device_hotplug_lock is never
exported).

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mathieu Malaterre 
Cc: Pavel Tatashin 
Cc: YASUAKI ISHIMATSU 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 .../platforms/pseries/hotplug-memory.c|  2 +-
 drivers/acpi/acpi_memhotplug.c|  2 +-
 drivers/base/memory.c |  9 ++--
 drivers/xen/balloon.c |  3 +++
 include/linux/memory_hotplug.h|  1 +
 mm/memory_hotplug.c   | 22 ---
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index dd0264c43f3e..d26a771d985e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -673,7 +673,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 811148415993..8fe0960ea572 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 817320c7c4c1..40cac122ec73 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -519,15 +519,20 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
return -EINVAL;
 
+   ret = lock_device_hotplug_sysfs();
+   if (ret)
+   goto out;
+
nid = memory_add_physaddr_to_nid(phys_addr);
-   ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+   ret = __add_memory(nid, phys_addr,
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
 
if (ret)
goto out;
 
ret = count;
 out:
+   unlock_device_hotplug();
return ret;
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index a3f5cbfcd4a1..fdfc64f5acea 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -395,7 +395,10 @@ static enum bp_state reserve_additional_memory(void)
 * callers drop the mutex before trying again.
 */
mutex_unlock(&balloon_mutex);
+   /* add_memory_resource() requires the device_hotplug lock */
+   lock_device_hotplug();
rc = add_memory_resource(nid, resource, memhp_auto_online);
+   unlock_device_hotplug();
mut

[PATCH v2 6/6] memory-hotplug.txt: Add some details about locking internals

2018-09-25 Thread David Hildenbrand
Let's document the magic a bit, especially why device_hotplug_lock is
required when adding/removing memory and how it all play together with
requests to online/offline memory from user space.

Cc: Jonathan Corbet 
Cc: Michal Hocko 
Cc: Andrew Morton 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 Documentation/memory-hotplug.txt | 42 +++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 7f49ebf3ddb2..ce4faa5530fa 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -3,7 +3,7 @@ Memory Hotplug
 ==
 
 :Created:  Jul 28 2007
-:Updated: Add description of notifier of memory hotplug:   Oct 11 2007
+:Updated: Add some details about locking internals:Aug 20 2018
 
 This document is about memory hotplug including how-to-use and current status.
 Because Memory Hotplug is still under development, contents of this text will
@@ -495,6 +495,46 @@ further processing of the notification queue.
 
 NOTIFY_STOP stops further processing of the notification queue.
 
+
+Locking Internals
+=
+
+When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
+the device_hotplug_lock should be held to:
+
+- synchronize against online/offline requests (e.g. via sysfs). This way, 
memory
+  block devices can only be accessed (.online/.state attributes) by user
+  space once memory has been fully added. And when removing memory, we
+  know nobody is in critical sections.
+- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
+
+Especially, there is a possible lock inversion that is avoided using
+device_hotplug_lock when adding memory and user space tries to online that
+memory faster than expected:
+
+- device_online() will first take the device_lock(), followed by
+  mem_hotplug_lock
+- add_memory_resource() will first take the mem_hotplug_lock, followed by
+  the device_lock() (while creating the devices, during bus_add_device()).
+
+As the device is visible to user space before taking the device_lock(), this
+can result in a lock inversion.
+
+onlining/offlining of memory should be done via device_online()/
+device_offline() - to make sure it is properly synchronized to actions
+via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+
+When adding/removing/onlining/offlining memory or adding/removing
+heterogeneous/device memory, we should always hold the mem_hotplug_lock in
+write mode to serialise memory hotplug (e.g. access to global/zone
+variables).
+
+In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
+mode allows for a quite efficient get_online_mems/put_online_mems
+implementation, so code accessing memory can protect from that memory
+vanishing.
+
+
 Future Work
 ===
 
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 5/6] powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages()

2018-09-25 Thread David Hildenbrand
Let's perform all checking + offlining + removing under
device_hotplug_lock, so nobody can mess with these devices via
sysfs concurrently.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index fdd48f1a39f7..d84d09c56af9 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -70,6 +70,7 @@ static int change_memblock_state(struct memory_block *mem, 
void *arg)
return 0;
 }
 
+/* called with device_hotplug_lock held */
 static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
 {
u64 end_pfn = start_pfn + nr_pages - 1;
@@ -111,6 +112,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
end_pfn = round_down(end_pfn - nr_pages, nr_pages);
 
for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
+   lock_device_hotplug();
if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
/*
 * Remove memory in memory block size chunks so that
@@ -118,7 +120,6 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
 * we never try to remove memory that spans two iomem
 * resources.
 */
-   lock_device_hotplug();
end_pfn = base_pfn + nr_pages;
for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> 
PAGE_SHIFT) {
__remove_memory(nid, pfn << PAGE_SHIFT, bytes);
@@ -126,6 +127,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
unlock_device_hotplug();
return base_pfn << PAGE_SHIFT;
}
+   unlock_device_hotplug();
}
 
return 0;
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 3/6] mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

2018-09-25 Thread David Hildenbrand
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock

While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.

In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.

The problems I spotted related to this:

1. Memory block device attributes: While .state first calls
   mem_hotplug_begin() and the calls device_online() - which takes
   device_lock() - .online does no longer call mem_hotplug_begin(), so
   effectively calls online_pages() without mem_hotplug_lock.

2. device_online() should be called under device_hotplug_lock, however
   onlining memory during add_memory() does not take care of that.

In addition, I think there is also something wrong about the locking in

3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
   without locks. This was introduced after 30467e0b3be. And skimming over
   the code, I assume it could need some more care in regards to locking
   (e.g. device_online() called without device_hotplug_lock. This will
   be addressed in the following patches.

Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()

We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().

Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).

[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html

This patch is partly based on a patch by Vitaly Kuznetsov.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Rashmica Gupta 
Cc: Michael Neuling 
Cc: Balbir Singh 
Cc: Kate Stewart 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Dan Williams 
Cc: Oscar Salvador 
Cc: YASUAKI ISHIMATSU 
Cc: Mathieu Malaterre 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/base/memory.c | 13 +
 mm/memory_hotplug.c   | 28 
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 40cac122ec73..0e5985682642 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -228,7 +228,6 @@ static bool pages_correctly_probed(unsigned long start_pfn)
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
- * Must already be protected by mem_hotplug_begin().
  */
 static int
 memory_block_action(unsigned long phys_index, unsigned long action, int 
online_type)
@@ -294,7 +293,6 @@ static int memory_subsys_online(struct device *dev)
if (mem->online_type < 0)
mem->online_type = MMOP_ONLINE_KEEP;
 
-   /* Already under protection of mem_hotplug_begin() */
ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 
/* clear online_type */
@@ -341,19 +339,11 @@ store_mem_state(struct device *dev,
goto err;
}
 
-   /*
-* Memory hotplug needs to hold mem_hotplug_begin() for probe to find
-* the correct memory block to online before doing device_online(dev),
-* which will take dev->mutex.  Take the lock early to prevent an
-* inversion, memory_subsys_online() callbacks will be implemented by
-* assuming it's already protected.
-*/
-   mem_hotplug_begin();
-
switch (online_type) {
case MMOP_ONLINE_KERNEL:
case MMOP_ONLINE_MOVABLE:
case MMOP_ONLINE_KEEP:
+   /* mem->online_type is protected by device_hotplug_lock */
mem->online_type = online_type;
ret = device_online(&mem->dev);
break;
@@ -364,7 +354,6 @@ store_mem_state(struct device *dev,
ret = -EINVAL; /* should never happen */
}
 
-   mem_hotplug_done();
 err:
unlock_device_hotplug();
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index affb03e0dfef..d4c7e42e46f3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -860,7 +860,6 @@ static 

[PATCH v2 4/6] powerpc/powernv: hold device_hotplug_lock when calling device_online()

2018-09-25 Thread David Hildenbrand
device_online() should be called with device_hotplug_lock() held.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 773623f6bfb1..fdd48f1a39f7 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -242,9 +242,11 @@ static int memtrace_online(void)
 * we need to online the memory ourselves.
 */
if (!memhp_auto_online) {
+   lock_device_hotplug();
walk_memory_range(PFN_DOWN(ent->start),
  PFN_UP(ent->start + ent->size - 1),
  NULL, online_mem_block);
+   unlock_device_hotplug();
}
 
/*
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 5/6] powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages()

2018-09-26 Thread David Hildenbrand
On 25/09/2018 14:15, Balbir Singh wrote:
> On Tue, Sep 25, 2018 at 11:14:56AM +0200, David Hildenbrand wrote:
>> Let's perform all checking + offlining + removing under
>> device_hotplug_lock, so nobody can mess with these devices via
>> sysfs concurrently.
>>
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Michael Ellerman 
>> Cc: Rashmica Gupta 
>> Cc: Balbir Singh 
>> Cc: Michael Neuling 
>> Reviewed-by: Pavel Tatashin 
>> Reviewed-by: Rashmica Gupta 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  arch/powerpc/platforms/powernv/memtrace.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
>> b/arch/powerpc/platforms/powernv/memtrace.c
>> index fdd48f1a39f7..d84d09c56af9 100644
>> --- a/arch/powerpc/platforms/powernv/memtrace.c
>> +++ b/arch/powerpc/platforms/powernv/memtrace.c
>> @@ -70,6 +70,7 @@ static int change_memblock_state(struct memory_block *mem, 
>> void *arg)
>>  return 0;
>>  }
>>  
>> +/* called with device_hotplug_lock held */
>>  static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
>>  {
>>  u64 end_pfn = start_pfn + nr_pages - 1;
>> @@ -111,6 +112,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
>>  end_pfn = round_down(end_pfn - nr_pages, nr_pages);
>>  
>>  for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
>> +lock_device_hotplug();
> 
> Why not grab the lock before the for loop? That way we can avoid bad cases 
> like a
> large node being scanned for a small number of pages (nr_pages). Ideally we 
> need
> a cond_resched() in the loop, but I guess offline_pages() has one.

Yes, it does.

I can move it out of the loop, thanks!

> 
> Acked-by: Balbir Singh 
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 3/6] mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

2018-09-27 Thread David Hildenbrand
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock

While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.

In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.

The problems I spotted related to this:

1. Memory block device attributes: While .state first calls
   mem_hotplug_begin() and the calls device_online() - which takes
   device_lock() - .online does no longer call mem_hotplug_begin(), so
   effectively calls online_pages() without mem_hotplug_lock.

2. device_online() should be called under device_hotplug_lock, however
   onlining memory during add_memory() does not take care of that.

In addition, I think there is also something wrong about the locking in

3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
   without locks. This was introduced after 30467e0b3be. And skimming over
   the code, I assume it could need some more care in regards to locking
   (e.g. device_online() called without device_hotplug_lock. This will
   be addressed in the following patches.

Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()

We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().

Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).

[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html

This patch is partly based on a patch by Vitaly Kuznetsov.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Rashmica Gupta 
Cc: Michael Neuling 
Cc: Balbir Singh 
Cc: Kate Stewart 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Dan Williams 
Cc: Oscar Salvador 
Cc: YASUAKI ISHIMATSU 
Cc: Mathieu Malaterre 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/base/memory.c | 13 +
 mm/memory_hotplug.c   | 28 
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 40cac122ec73..0e5985682642 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -228,7 +228,6 @@ static bool pages_correctly_probed(unsigned long start_pfn)
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
- * Must already be protected by mem_hotplug_begin().
  */
 static int
 memory_block_action(unsigned long phys_index, unsigned long action, int 
online_type)
@@ -294,7 +293,6 @@ static int memory_subsys_online(struct device *dev)
if (mem->online_type < 0)
mem->online_type = MMOP_ONLINE_KEEP;
 
-   /* Already under protection of mem_hotplug_begin() */
ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 
/* clear online_type */
@@ -341,19 +339,11 @@ store_mem_state(struct device *dev,
goto err;
}
 
-   /*
-* Memory hotplug needs to hold mem_hotplug_begin() for probe to find
-* the correct memory block to online before doing device_online(dev),
-* which will take dev->mutex.  Take the lock early to prevent an
-* inversion, memory_subsys_online() callbacks will be implemented by
-* assuming it's already protected.
-*/
-   mem_hotplug_begin();
-
switch (online_type) {
case MMOP_ONLINE_KERNEL:
case MMOP_ONLINE_MOVABLE:
case MMOP_ONLINE_KEEP:
+   /* mem->online_type is protected by device_hotplug_lock */
mem->online_type = online_type;
ret = device_online(&mem->dev);
break;
@@ -364,7 +354,6 @@ store_mem_state(struct device *dev,
ret = -EINVAL; /* should never happen */
}
 
-   mem_hotplug_done();
 err:
unlock_device_hotplug();
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index affb03e0dfef..d4c7e42e46f3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -860,7 +860,6 @@ static 

[PATCH v3 2/6] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2018-09-27 Thread David Hildenbrand
add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory
to synchronize against online/offline request (e.g. from user space) -
which already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory
hot-add deadlock"). add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
drivers/xen/balloon.c
arch/powerpc/platforms/powernv/memtrace.c
drivers/s390/char/sclp_cmd.c
drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used
by XEN, which is never built as a module. If somebody requires it, we
also have to export a locked variant (as device_hotplug_lock is never
exported).

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mathieu Malaterre 
Cc: Pavel Tatashin 
Cc: YASUAKI ISHIMATSU 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 .../platforms/pseries/hotplug-memory.c|  2 +-
 drivers/acpi/acpi_memhotplug.c|  2 +-
 drivers/base/memory.c |  9 ++--
 drivers/xen/balloon.c |  3 +++
 include/linux/memory_hotplug.h|  1 +
 mm/memory_hotplug.c   | 22 ---
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index dd0264c43f3e..d26a771d985e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -673,7 +673,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 811148415993..8fe0960ea572 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 817320c7c4c1..40cac122ec73 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -519,15 +519,20 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
return -EINVAL;
 
+   ret = lock_device_hotplug_sysfs();
+   if (ret)
+   goto out;
+
nid = memory_add_physaddr_to_nid(phys_addr);
-   ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+   ret = __add_memory(nid, phys_addr,
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
 
if (ret)
goto out;
 
ret = count;
 out:
+   unlock_device_hotplug();
return ret;
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index a3f5cbfcd4a1..fdfc64f5acea 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -395,7 +395,10 @@ static enum bp_state reserve_additional_memory(void)
 * callers drop the mutex before trying again.
 */
mutex_unlock(&balloon_mutex);
+   /* add_memory_resource() requires the device_hotplug lock */
+   lock_device_hotplug();
rc = add_memory_resource(nid, resource, memhp_auto_online);
+   unlock_device_hotplug();
mut

[PATCH v3 1/6] mm/memory_hotplug: make remove_memory() take the device_hotplug_lock

2018-09-27 Thread David Hildenbrand
remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported. So let's provide a variant
that takes the lock and only export that one.

The lock is already held in
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
arch/powerpc/platforms/powernv/memtrace.c

Apart from that, there are not other users in the tree.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Rashmica Gupta 
Cc: Michael Neuling 
Cc: Balbir Singh 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Pavel Tatashin 
Cc: Greg Kroah-Hartman 
Cc: Oscar Salvador 
Cc: YASUAKI ISHIMATSU 
Cc: Mathieu Malaterre 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c   | 2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 +++---
 drivers/acpi/acpi_memhotplug.c  | 2 +-
 include/linux/memory_hotplug.h  | 3 ++-
 mm/memory_hotplug.c | 9 -
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index a29fdf8a2e56..773623f6bfb1 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -121,7 +121,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
lock_device_hotplug();
end_pfn = base_pfn + nr_pages;
for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> 
PAGE_SHIFT) {
-   remove_memory(nid, pfn << PAGE_SHIFT, bytes);
+   __remove_memory(nid, pfn << PAGE_SHIFT, bytes);
}
unlock_device_hotplug();
return base_pfn << PAGE_SHIFT;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 9a15d39995e5..dd0264c43f3e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -305,7 +305,7 @@ static int pseries_remove_memblock(unsigned long base, 
unsigned int memblock_siz
nid = memory_add_physaddr_to_nid(base);
 
for (i = 0; i < sections_per_block; i++) {
-   remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
+   __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
base += MIN_MEMORY_BLOCK_SIZE;
}
 
@@ -394,7 +394,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
block_sz = pseries_memory_block_size();
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
-   remove_memory(nid, lmb->base_addr, block_sz);
+   __remove_memory(nid, lmb->base_addr, block_sz);
 
/* Update memory regions for memory remove */
memblock_remove(lmb->base_addr, block_sz);
@@ -681,7 +681,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   remove_memory(nid, lmb->base_addr, block_sz);
+   __remove_memory(nid, lmb->base_addr, block_sz);
invalidate_lmb_associativity_index(lmb);
} else {
lmb->flags |= DRCONF_MEM_ASSIGNED;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309c..811148415993 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct 
acpi_memory_device *mem_device)
nid = memory_add_physaddr_to_nid(info->start_addr);
 
acpi_unbind_memory_blocks(info);
-   remove_memory(nid, info->start_addr, info->length);
+   __remove_memory(nid, info->start_addr, info->length);
list_del(&info->list);
kfree(info);
}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 34a28227068d..1f096852f479 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -301,6 +301,7 @@ extern bool is_mem_section_removable(unsigned long pfn, 
unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern void remove_memory(int nid, u64 start, u64 size);
+extern void __remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -317,6 +318,7 @@ static inline int offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 }
 
 static inline void remove_memory(int nid, u64 start, u

[PATCH v3 0/6] mm: online/offline_pages called w.o. mem_hotplug_lock

2018-09-27 Thread David Hildenbrand
@Andrew, Only patch #5 changed (see change notes below). Thanks!


Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock. And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.

While e.g.
echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.

E.g. via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock. So we can
have concurrent callers in online_pages(). We e.g. touch in online_pages()
basically unprotected zone->present_pages then.

Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible. We
would e.g. have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.

Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.

I propose the general rules (documentation added in patch 6):

1. add_memory/add_memory_resource() must only be called with
   device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
   already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
   device_hotplug_lock. This is already documented and true for now in core
   code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
   online_pages/offline_pages.

To me, this looks way cleaner than what we have right now (and easier to
verify). And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.


v2 -> v3:
- Take device_hotplug_lock outside of loop in patch #5
- Added Ack to patch #5

v1 -> v2:
- Upstream changes in powerpc/powernv code required modifications to
  patch #1, #4 and #5.
- Minor patch description changes.
- Added more locking details in patch #6.
- Added rb's

RFCv2 -> v1:
- Dropped an unnecessary _ref from remove_memory() in patch #1
- Minor patch description fixes.
- Added rb's

RFC -> RFCv2:
- Don't export device_hotplug_lock, provide proper remove_memory/add_memory
  wrappers.
- Split up the patches a bit.
- Try to improve powernv memtrace locking
- Add some documentation for locking that matches my knowledg

David Hildenbrand (6):
  mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
  mm/memory_hotplug: make add_memory() take the device_hotplug_lock
  mm/memory_hotplug: fix online/offline_pages called w.o.
mem_hotplug_lock
  powerpc/powernv: hold device_hotplug_lock when calling device_online()
  powerpc/powernv: hold device_hotplug_lock when calling
memtrace_offline_pages()
  memory-hotplug.txt: Add some details about locking internals

 Documentation/memory-hotplug.txt  | 42 -
 arch/powerpc/platforms/powernv/memtrace.c |  8 ++-
 .../platforms/pseries/hotplug-memory.c|  8 +--
 drivers/acpi/acpi_memhotplug.c|  4 +-
 drivers/base/memory.c | 22 +++
 drivers/xen/balloon.c |  3 +
 include/linux/memory_hotplug.h|  4 +-
 mm/memory_hotplug.c   | 59 +++
 8 files changed, 114 insertions(+), 36 deletions(-)

-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 5/6] powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages()

2018-09-27 Thread David Hildenbrand
Let's perform all checking + offlining + removing under
device_hotplug_lock, so nobody can mess with these devices via
sysfs concurrently.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Acked-by: Balbir Singh 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index fdd48f1a39f7..84d038ed3882 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -70,6 +70,7 @@ static int change_memblock_state(struct memory_block *mem, 
void *arg)
return 0;
 }
 
+/* called with device_hotplug_lock held */
 static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
 {
u64 end_pfn = start_pfn + nr_pages - 1;
@@ -110,6 +111,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
/* Trace memory needs to be aligned to the size */
end_pfn = round_down(end_pfn - nr_pages, nr_pages);
 
+   lock_device_hotplug();
for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
/*
@@ -118,7 +120,6 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
 * we never try to remove memory that spans two iomem
 * resources.
 */
-   lock_device_hotplug();
end_pfn = base_pfn + nr_pages;
for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> 
PAGE_SHIFT) {
__remove_memory(nid, pfn << PAGE_SHIFT, bytes);
@@ -127,6 +128,7 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
return base_pfn << PAGE_SHIFT;
}
}
+   unlock_device_hotplug();
 
return 0;
 }
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 4/6] powerpc/powernv: hold device_hotplug_lock when calling device_online()

2018-09-27 Thread David Hildenbrand
device_online() should be called with device_hotplug_lock() held.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 773623f6bfb1..fdd48f1a39f7 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -242,9 +242,11 @@ static int memtrace_online(void)
 * we need to online the memory ourselves.
 */
if (!memhp_auto_online) {
+   lock_device_hotplug();
walk_memory_range(PFN_DOWN(ent->start),
  PFN_UP(ent->start + ent->size - 1),
  NULL, online_mem_block);
+   unlock_device_hotplug();
}
 
/*
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v3 6/6] memory-hotplug.txt: Add some details about locking internals

2018-09-27 Thread David Hildenbrand
Let's document the magic a bit, especially why device_hotplug_lock is
required when adding/removing memory and how it all play together with
requests to online/offline memory from user space.

Cc: Jonathan Corbet 
Cc: Michal Hocko 
Cc: Andrew Morton 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Signed-off-by: David Hildenbrand 
---
 Documentation/memory-hotplug.txt | 42 +++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 7f49ebf3ddb2..ce4faa5530fa 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -3,7 +3,7 @@ Memory Hotplug
 ==
 
 :Created:  Jul 28 2007
-:Updated: Add description of notifier of memory hotplug:   Oct 11 2007
+:Updated: Add some details about locking internals:Aug 20 2018
 
 This document is about memory hotplug including how-to-use and current status.
 Because Memory Hotplug is still under development, contents of this text will
@@ -495,6 +495,46 @@ further processing of the notification queue.
 
 NOTIFY_STOP stops further processing of the notification queue.
 
+
+Locking Internals
+=
+
+When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
+the device_hotplug_lock should be held to:
+
+- synchronize against online/offline requests (e.g. via sysfs). This way, 
memory
+  block devices can only be accessed (.online/.state attributes) by user
+  space once memory has been fully added. And when removing memory, we
+  know nobody is in critical sections.
+- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
+
+Especially, there is a possible lock inversion that is avoided using
+device_hotplug_lock when adding memory and user space tries to online that
+memory faster than expected:
+
+- device_online() will first take the device_lock(), followed by
+  mem_hotplug_lock
+- add_memory_resource() will first take the mem_hotplug_lock, followed by
+  the device_lock() (while creating the devices, during bus_add_device()).
+
+As the device is visible to user space before taking the device_lock(), this
+can result in a lock inversion.
+
+onlining/offlining of memory should be done via device_online()/
+device_offline() - to make sure it is properly synchronized to actions
+via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+
+When adding/removing/onlining/offlining memory or adding/removing
+heterogeneous/device memory, we should always hold the mem_hotplug_lock in
+write mode to serialise memory hotplug (e.g. access to global/zone
+variables).
+
+In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
+mode allows for a quite efficient get_online_mems/put_online_mems
+implementation, so code accessing memory can protect from that memory
+vanishing.
+
+
 Future Work
 ===
 
-- 
2.17.1

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-09-28 Thread David Hildenbrand
How to/when to online hotplugged memory is hard to manage for
distributions because different memory types are to be treated differently.
Right now, we need complicated udev rules that e.g. check if we are
running on s390x, on a physical system or on a virtualized system. But
there is also sometimes the demand to really online memory immediately
while adding in the kernel and not to wait for user space to make a
decision. And on virtualized systems there might be different
requirements, depending on "how" the memory was added (and if it will
eventually get unplugged again - DIMM vs. paravirtualized mechanisms).

On the one hand, we have physical systems where we sometimes
want to be able to unplug memory again - e.g. a DIMM - so we have to online
it to the MOVABLE zone optionally. That decision is usually made in user
space.

On the other hand, we have memory that should never be onlined
automatically, only when asked for by an administrator. Such memory only
applies to virtualized environments like s390x, where the concept of
"standby" memory exists. Memory is detected and added during boot, so it
can be onlined when requested by the admininistrator or some tooling.
Only when onlining, memory will be allocated in the hypervisor.

But then, we also have paravirtualized devices (namely xen and hyper-v
balloons), that hotplug memory that will never ever be removed from a
system right now using offline_pages/remove_memory. If at all, this memory
is logically unplugged and handed back to the hypervisor via ballooning.

For paravirtualized devices it is relevant that memory is onlined as
quickly as possible after adding - and that it is added to the NORMAL
zone. Otherwise, it could happen that too much memory in a row is added
(but not onlined), resulting in out-of-memory conditions due to the
additional memory for "struct pages" and friends. MOVABLE zone as well
as delays might be very problematic and lead to crashes (e.g. zone
imbalance).

Therefore, introduce memory block types and online memory depending on
it when adding the memory. Expose the memory type to user space, so user
space handlers can start to process only "normal" memory. Other memory
block types can be ignored. One thing less to worry about in user space.

Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Stephen Rothwell 
Cc: Michal Hocko 
Cc: "Kirill A. Shutemov" 
Cc: David Hildenbrand 
Cc: Nicholas Piggin 
Cc: "Jonathan Neuschäfer" 
Cc: Joe Perches 
Cc: Michael Neuling 
Cc: Mauricio Faria de Oliveira 
Cc: Balbir Singh 
Cc: Rashmica Gupta 
Cc: Pavel Tatashin 
Cc: Rob Herring 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "mike.tra...@hpe.com" 
Cc: Joonsoo Kim 
Cc: Oscar Salvador 
Cc: Mathieu Malaterre 
Signed-off-by: David Hildenbrand 
---

This patch is based on the current mm-tree, where some related
patches from me are currently residing that touched the add_memory()
functions.

 arch/ia64/mm/init.c   |  4 +-
 arch/powerpc/mm/mem.c |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  3 +-
 arch/s390/mm/init.c   |  4 +-
 arch/sh/mm/init.c |  4 +-
 arch/x86/mm/init_32.c |  4 +-
 arch/x86/mm/init_64.c |  8 +--
 drivers/acpi/acpi_memhotplug.c|  3 +-
 drivers/base/memory.c | 63 ---
 drivers/hv/hv_balloon.c   | 33 ++--
 drivers/s390/char/sclp_cmd.c  |  3 +-
 drivers/xen/balloon.c |  2 +-
 include/linux/memory.h| 28 +-
 include/linux/memory_hotplug.h| 17 +++---
 mm/hmm.c  |  6 ++-
 mm/memory_hotplug.c   | 31 ++-
 16 files changed, 139 insertions(+), 78 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..813d1d86bf95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-   bool want_memblock)
+   int memory_block_type)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   ret 

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-01 Thread David Hildenbrand
On 28/09/2018 19:02, Dave Hansen wrote:
> It's really nice if these kinds of things are broken up.  First, replace
> the old want_memblock parameter, then add the parameter to the
> __add_page() calls.

Definitely, once we agree that is is not nuts, I will split it up for
the next version :)

> 
>> +/*
>> + * NONE: No memory block is to be created (e.g. device memory).
>> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
>> + *   (e.g. ACPI DIMMs) that should be onlined either automatically
>> + *   (memhp_auto_online) or manually by user space to select a
>> + *   specific zone.
>> + *   Applicable to memhp_auto_online.
>> + * STANDBY:  Memory block that represents standby memory that should only
>> + *   be onlined on demand by user space (e.g. standby memory on
>> + *   s390x), but never automatically by the kernel.
>> + *   Not applicable to memhp_auto_online.
>> + * PARAVIRT: Memory block that represents memory added by
>> + *   paravirtualized mechanisms (e.g. hyper-v, xen) that will
>> + *   always automatically get onlined. Memory will be unplugged
>> + *   using ballooning, not by relying on the MOVABLE ZONE.
>> + *   Not applicable to memhp_auto_online.
>> + */
>> +enum {
>> +MEMORY_BLOCK_NONE,
>> +MEMORY_BLOCK_NORMAL,
>> +MEMORY_BLOCK_STANDBY,
>> +MEMORY_BLOCK_PARAVIRT,
>> +};
> 
> This does not seem like the best way to expose these.
> 
> STANDBY, for instance, seems to be essentially a replacement for a check
> against running on s390 in userspace to implement a _typical_ s390
> policy.  It seems rather weird to try to make the userspace policy
> determination easier by telling userspace about the typical s390 policy
> via the kernel.

Now comes the fun part: I am working on another paravirtualized memory
hotplug way for KVM guests, based on virtio ("virtio-mem").

These devices can potentially be used concurrently with
- s390x standby memory
- DIMMs

How should a policy in user space look like when new memory gets added
- on s390x? Not onlining paravirtualized memory is very wrong.
- on e.g. x86? Onlining memory to the MOVABLE zone is very wrong.

So the type of memory is very important here to have in user space.
Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
to decide whether to online memory and how to online memory is wrong.
Only some specific memory types (which I call "normal") are to be
handled by user space.

For the other ones, we exactly know what to do:
- standby? don't online
- paravirt? always online to normal zone

I will add some more details as reply to Michal.

> 
> As for the OOM issues, that sounds like something we need to fix by
> refusing to do (or delaying) hot-add operations once we consume too much
> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
> userspace to hurry thing along.

That is a moving target and doing that automatically is basically
impossible. You can add a lot of memory to the movable zone and
everything is fine. Suddenly a lot of processes are started - boom.
MOVABLE should only every be used if you expect an unplug. And for
paravirtualized devices, a "typical" unplug does not exist.

> 
> So, to my eye, we need:
> 
>  +enum {
>  +MEMORY_BLOCK_NONE,
>  +MEMORY_BLOCK_STANDBY, /* the default */
>  +MEMORY_BLOCK_AUTO_ONLINE,
>  +};

auto-online is strongly misleading, that's why I called it "normal", but
I am open for suggestions. The information about devices handles fully
in the kernel - "paravirt" is key for me.

> 
> and we can probably collapse NONE into AUTO_ONLINE because userspace
> ends up doing the same thing for both: nothing.

For external reasons, yes, for internal reasons no (see hmm/device
memory). In user space, we will never end up with MEMORY_BLOCK_NONE,
because there is no memory block.

> 
>>  struct memory_block {
>>  unsigned long start_section_nr;
>>  unsigned long end_section_nr;
>> @@ -34,6 +58,7 @@ struct memory_block {
>>  int (*phys_callback)(struct memory_block *);
>>  struct device dev;
>>  int nid;/* NID for this memory block */
>> +int type;   /* type of this memory block */
>>  };
> 
> Shouldn't we just be creating and using an actual named enum type?
> 

That makes sense.

Thanks!

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-01 Thread David Hildenbrand
On 01/10/2018 10:40, Michal Hocko wrote:
> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> [...]
> 
> I haven't read the patch itself but I just wanted to note one thing
> about this part
> 
>> For paravirtualized devices it is relevant that memory is onlined as
>> quickly as possible after adding - and that it is added to the NORMAL
>> zone. Otherwise, it could happen that too much memory in a row is added
>> (but not onlined), resulting in out-of-memory conditions due to the
>> additional memory for "struct pages" and friends. MOVABLE zone as well
>> as delays might be very problematic and lead to crashes (e.g. zone
>> imbalance).
> 
> I have proposed (but haven't finished this due to other stuff) a
> solution for this. Newly added memory can host memmaps itself and then
> you do not have the problem in the first place. For vmemmap it would
> have an advantage that you do not really have to beg for 2MB pages to
> back the whole section but you would get it for free because the initial
> part of the section is by definition properly aligned and unused.

So the plan is to "host metadata for new memory on the memory itself".
Just want to note that this is basically impossible for s390x with the
current mechanisms. (added memory is dead, until onlining notifies the
hypervisor and memory is allocated). It will also be problematic for
paravirtualized memory devices (e.g. XEN's "not backed by the
hypervisor" hacks).

This would only be possible for memory DIMMs, memory that is completely
accessible as far as I can see. Or at least, some specified "first part"
is accessible.

Other problems are other metadata like extended struct pages and friends.

(I really like the idea of adding memory without allocating memory in
the hypervisor in the first place, please keep me tuned).

And please note: This solves some problematic part ("adding too much
memory to the movable zone or not onlining it"), but not the issue of
zone imbalance in the first place. And not one issue I try to tackle
here: don't add paravirtualized memory to the movable zone.

> 
> I yet have to think about the whole proposal but I am missing the most
> important part. _Who_ is going to use the new exported information and
> for what purpose. You said that distributions have hard time to
> distinguish different types of onlinining policies but isn't this
> something that is inherently usecase specific?
> 

Let's think about a distribution. We have a clash of use cases here
(just what you describe). What I propose solves one part of it ("handle
what you know how to handle right in the kernel").

1. Users of DIMMs usually expect that they can be unplugged again. That
is why you want to control how to online memory in user space (== add it
to the movable zone).

2. Users of standby memory (s390) expect that memory will never be
onlined automatically. It will be onlined manually.

3. Users of paravirtualized devices (esp. Hyper-V) don't care about
memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
add a whole bunch of memory and expect that everything works fine. So
that memory is onlined immediately and that memory is added to the
NORMAL zone. Users never want the MOVABLE zone.

1. is a reason why distributions usually don't configure
"MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
MOVABLE zone. That however implies, that e.g. for x86, you have to
handle all new memory in user space, especially also HyperV memory.
There, you then have to check for things like "isHyperV()" to decide
"oh, yes, this should definitely not go to the MOVABLE zone".

As you know, I am working on virtio-mem, which can basically be combined
with 1 or 2. And user space has no idea about the difference between
added memory blocks. Was it memory from a DIMM (== ZONE_MOVABLE)? Was it
memory from a paravirtualized device (== ZONE_NORMAL)? Was it standby
memory? (don't online)


That part, I try to solve with this interface.

To answer your question: User space will only care about "normal" memory
and then decide how to online it (for now, usually MOVABLE, because
that's what customers expect with DIMMs). The use case of DIMMS, we
don't know and therefore we can't expose. The use case of the other
cases, we know exactly already in the kernel.

Existing user space hacks will continue to work but can be replaces by a
new check against "normal" memory block types.

Thanks for looking into this!

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next] hv_netvsc: Fix rndis_per_packet_info internal field initialization

2018-10-01 Thread David Miller
From: Haiyang Zhang 
Date: Fri, 28 Sep 2018 14:41:23 +

> From: Haiyang Zhang 
> 
> The RSC feature -- a bit field "internal" was added here with total
> size unchanged:
> struct rndis_per_packet_info {
>   u32 size;
>   u32 type:31;
>   u32 internal:1;
>   u32 ppi_offset;
> };
> 
> On TX path, we put rndis msg into skb head room, which is not zeroed
> before passing to us. We do not use the "internal" field in TX path,
> but it may impact older hosts which use the entire 32 bits as "type".
> 
> To fix the bug, this patch sets the field "internal" to zero.
> 
> Fixes: c8e4eff4675f ("hv_netvsc: Add support for LRO/RSC in the vSwitch")
> Signed-off-by: Haiyang Zhang 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-02 Thread David Hildenbrand
On 02/10/2018 15:47, Michal Hocko wrote:
> On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
>> On 01/10/2018 10:40, Michal Hocko wrote:
>>> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
>>> [...]
>>>
>>> I haven't read the patch itself but I just wanted to note one thing
>>> about this part
>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>
>>> I have proposed (but haven't finished this due to other stuff) a
>>> solution for this. Newly added memory can host memmaps itself and then
>>> you do not have the problem in the first place. For vmemmap it would
>>> have an advantage that you do not really have to beg for 2MB pages to
>>> back the whole section but you would get it for free because the initial
>>> part of the section is by definition properly aligned and unused.
>>
>> So the plan is to "host metadata for new memory on the memory itself".
>> Just want to note that this is basically impossible for s390x with the
>> current mechanisms. (added memory is dead, until onlining notifies the
>> hypervisor and memory is allocated). It will also be problematic for
>> paravirtualized memory devices (e.g. XEN's "not backed by the
>> hypervisor" hacks).
> 
> OK, I understand that not all usecases can use self memmap hosting
> others do not have much choice left though. You have to allocate from
> somewhere. Well and alternative would be to have no memmap until
> onlining but I am not sure how much work that would be.
> 
>> This would only be possible for memory DIMMs, memory that is completely
>> accessible as far as I can see. Or at least, some specified "first part"
>> is accessible.
>>
>> Other problems are other metadata like extended struct pages and friends.
> 
> I wouldn't really worry about extended struct pages. Those should be
> used for debugging purposes mostly. Ot at least that was the case last
> time I've checked.

Yes, I guess that is true. Being able to add and online memory without
the need for additional (external) memory would be the ultimate goal,
but highly complicated. But steps into that direction is a good idea.

> 
>> (I really like the idea of adding memory without allocating memory in
>> the hypervisor in the first place, please keep me tuned).
>>
>> And please note: This solves some problematic part ("adding too much
>> memory to the movable zone or not onlining it"), but not the issue of
>> zone imbalance in the first place. And not one issue I try to tackle
>> here: don't add paravirtualized memory to the movable zone.
> 
> Zone imbalance is an inherent problem of the highmem zone. It is
> essentially the highmem zone we all loved so much back in 32b days.
> Yes the movable zone doesn't have any addressing limitations so it is a
> bit more relaxed but considering the hotplug scenarios I have seen so
> far people just want to have full NUMA nodes movable to allow replacing
> DIMMs. And then we are back to square one and the zone imbalance issue.
> You have those regardless where memmaps are allocated from.

Unfortunately yes. And things get more complicated as you are adding a
whole DIMMs and get notifications in the granularity of memory blocks.
Usually you are not interested in onlining any memory block of that DIMM
as MOVABLE as soon as you would have to online one memory block of that
DIMM as NORMAL - because that can already block the whole DIMM.

> 
>>> I yet have to think about the whole proposal but I am missing the most
>>> important part. _Who_ is going to use the new exported information and
>>> for what purpose. You said that distributions have hard time to
>>> distinguish different types of onlinining policies but isn't this
>>> something that is inherently usecase specific?
>>>
>>
>> Let's think about a distribution. We have a clash of use cases here
>> (just what you describe). What I propose solves one part of it ("handle
>> what you know how to handle right in the kernel").
>>
>> 1. Users of DIMMs usually expect that they can be unplugged again. That
>> is why you want

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread David Hildenbrand
On 03/10/2018 15:54, Michal Hocko wrote:
> On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
>> On 02/10/2018 15:47, Michal Hocko wrote:
> [...]
>>> Zone imbalance is an inherent problem of the highmem zone. It is
>>> essentially the highmem zone we all loved so much back in 32b days.
>>> Yes the movable zone doesn't have any addressing limitations so it is a
>>> bit more relaxed but considering the hotplug scenarios I have seen so
>>> far people just want to have full NUMA nodes movable to allow replacing
>>> DIMMs. And then we are back to square one and the zone imbalance issue.
>>> You have those regardless where memmaps are allocated from.
>>
>> Unfortunately yes. And things get more complicated as you are adding a
>> whole DIMMs and get notifications in the granularity of memory blocks.
>> Usually you are not interested in onlining any memory block of that DIMM
>> as MOVABLE as soon as you would have to online one memory block of that
>> DIMM as NORMAL - because that can already block the whole DIMM.
> 
> For the purpose of the hotremove, yes. But as Dave has noted people are
> (ab)using zone movable for other purposes - e.g. large pages.

That might be right for some very special use cases. For most of users
this is not the case (meaning it should be the default but if the user
wants to change it, he should be allowed to change it).

>  
> [...]
>>> Then the immediate question would be why to use memory hotplug for that
>>> at all? Why don't you simply start with a huge pre-allocated physical
>>> address space and balloon memory in an out per demand. Why do you want
>>> to inject new memory during the runtime?
>>
>> Let's assume you have a guest with 20GB size and eventually want to
>> allow to grow it to 4TB. You would have to allocate metadata for 4TB
>> right from the beginning. That's definitely now what we want. That is
>> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
>> hypervisor even tells you at which places additional memory has been
>> made available.
> 
> Then you have to live with the fact that your hot added memory will be
> self hosted and find a way for ballooning to work with that. The price
> would be that some part of the memory is not really balloonable in the
> end.
> 
>>>> 1. is a reason why distributions usually don't configure
>>>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>>>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>>>> handle all new memory in user space, especially also HyperV memory.
>>>> There, you then have to check for things like "isHyperV()" to decide
>>>> "oh, yes, this should definitely not go to the MOVABLE zone".
>>>
>>> Why do you need a generic hotplug rule in the first place? Why don't you
>>> simply provide different set of rules for different usecases? Let users
>>> decide which usecase they prefer rather than try to be clever which
>>> almost always hits weird corner cases.
>>>
>>
>> Memory hotplug has to work as reliable as we can out of the box. Letting
>> the user make simple decisions like "oh, I am on hyper-V, I want to
>> online memory to the normal zone" does not feel right.
> 
> Users usually know what is their usecase and then it is just a matter of
> plumbing (e.g. distribution can provide proper tools to deploy those
> usecases) to chose the right and for user obscure way to make it work.

I disagree. If we can ship sane defaults, we should do that and allow to
make changes later on. This is how distributions have been working for
ever. But yes, allowing to make modifications is always a good idea to
tailor it to some special case user scenarios. (tuned or whatever we
have in place).

> 
>> But yes, we
>> should definitely allow to make modifications. So some sane default rule
>> + possible modification is usually a good idea.
>>
>> I think Dave has a point with using MOVABLE for huge page use cases. And
>> there might be other corner cases as you correctly state.
>>
>> I wonder if this patch itself minus modifying online/offline might make
>> sense. We can then implement simple rules in user space
>>
>> if (normal) {
>>  /* customers expect hotplugged DIMMs to be unpluggable */
>>  online_movable();
>> } else if (paravirt) {
>>  /* paravirt memory should as default always go to the NORMAL */
>>  online();
>> } else {
>>  /* standby memory will never get onlined automatically */
>>

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread David Hildenbrand
On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

As already stated, I guess we should then provide user space with
sufficient information to make a good decision (to implement rules).

The eBPF is basically the same idea, only the rules are formulated
differently and directly handle in the kernel. Still it might be e.e.
relevant if memory is standby memory (that's what I remember the
official s390x name), or something else.

Right now, the (udev) rules we have make assumptions based on general
system properties (s390x, HyperV ...).

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread David Hildenbrand
On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> Dave Hansen  writes:
> 
>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>> It is more than just memmaps (e.g. forking udev process doing memory
>>> onlining also needs memory) but yes, the main idea is to make the
>>> onlining synchronous with hotplug.
>>
>> That's a good theoretical concern.
>>
>> But, is it a problem we need to solve in practice?
> 
> Yes, unfortunately. It was previously discovered that when we try to
> hotplug tons of memory to a low memory system (this is a common scenario
> with VMs) we end up with OOM because for all new memory blocks we need
> to allocate page tables, struct pages, ... and we need memory to do
> that. The userspace program doing memory onlining also needs memory to
> run and in case it prefers to fork to handle hundreds of notfifications
> ... well, it may get OOMkilled before it manages to online anything.
> 
> Allocating all kernel objects from the newly hotplugged blocks would
> definitely help to manage the situation but as I said this won't solve
> the 'forking udev' problem completely (it will likely remain in
> 'extreme' cases only. We can probably work around it by onlining with a
> dedicated process which doesn't do memory allocation).
> 

I guess the problem is even worse. We always have two phases

1. add memory - requires memory allocation
2. online memory - might require memory allocations e.g. for slab/slub

So if we just added memory but don't have sufficient memory to start a
user space process to trigger onlining, then we most likely also don't
have sufficient memory to online the memory right away (in some scenarios).

We would have to allocate all new memory for 1 and 2 from the memory to
be onlined. I guess the latter part is less trivial.

So while onlining the memory from the kernel might make things a little
more robust, we would still have the chance for OOM / onlining failing.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread David Hildenbrand
On 04/10/2018 08:28, Michal Hocko wrote:
> On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
> [...]
>> Let me rephrase: You state that user space has to make the decision and
>> that user should be able to set/reconfigure rules. That is perfectly fine.
>>
>> But then we should give user space access to sufficient information to
>> make a decision. This might be the type of memory as we learned (what
>> some part of this patch proposes), but maybe later more, e.g. to which
>> physical device memory belongs (e.g. to hotplug it all movable or all
>> normal) ...
> 
> I am pretty sure that user knows he/she wants to use ballooning in
> HyperV or Xen, or that the memory hotplug should be used as a "RAS"
> feature to allow add and remove DIMMs for reliability. Why shouldn't we
> have a package to deploy an appropriate set of udev rules for each of
> those usecases? I am pretty sure you need some other plumbing to enable
> them anyway (e.g. RAS would require to have movable_node kernel
> parameters, ballooning a kernel module etc.).
> 
> Really, one udev script to rule them all will simply never work.
> 

I am on your side. We will need multiple ones. But we need sane
defaults. And a default rule will always exist. And users will expect
that the defaults somewhat match their expectation unless they really
have some special use cases.

All I am saying is, again, that if user space is to make decisions, it
should get sufficient information to make sane decision. And in my point
of view, the type of memory allows us to make these decision and to
provide a "single udev script to rule them all" with sane defaults.

I at least think the distinction between "auto-online" and "standby" is
required (what Dave suggested).

The we can make a simple rule

if (auto-online memory) {
if (virtual environment) {
"online"
} else {
    "online_movable"
}
}
/* standby memory not onlined as default */

We are able to provide sane defaults.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread David Hildenbrand
On 01/10/2018 18:24, Dave Hansen wrote:
>> How should a policy in user space look like when new memory gets added
>> - on s390x? Not onlining paravirtualized memory is very wrong.
> 
> Because we're going to balloon it away in a moment anyway?

No, rether somebody wanted this VM to have more memory, so it should use
it - basically what HyperV or XEN also do. (in contrast to the concept
of standby memory on s390).

> > We have auto-onlining.  Why isn't that being used on s390?

Do you mean the sys parameter? How would that help?

> 
> 
>> So the type of memory is very important here to have in user space.
>> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
>> to decide whether to online memory and how to online memory is wrong.
>> Only some specific memory types (which I call "normal") are to be
>> handled by user space.
>>
>> For the other ones, we exactly know what to do:
>> - standby? don't online
> 
> I think you're horribly conflating the software desire for what the stae
> should be and the hardware itself.

Agreed, user space should be able to configure it.

> 
>>> As for the OOM issues, that sounds like something we need to fix by
>>> refusing to do (or delaying) hot-add operations once we consume too much
>>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>>> userspace to hurry thing along.
>>
>> That is a moving target and doing that automatically is basically
>> impossible.
> 
> Nah.  We know how much metadata we've allocated.  We know how much
> ZONE_NORMAL we are eating.  We can *easily* add something to
> add_memory() that just sleeps until the ratio is not out-of-whack.
> 
>> You can add a lot of memory to the movable zone and
>> everything is fine. Suddenly a lot of processes are started - boom.
>> MOVABLE should only every be used if you expect an unplug. And for
>> paravirtualized devices, a "typical" unplug does not exist.
> 
> No, it's more complicated than that.  People use MOVABLE, for instance,
> to allow more consistent huge page allocations.  It's certainly not just
> hot-remove.
> 

As noted in the other thread, that's a good point. We have to allow to
make a decision in user space.

I agree to your initial proposal to distinguish "standby" from
"auto-online". It would allow to have sane defaults in user space.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread David Hildenbrand
On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

Vitalys point was that this policy is to be formulated by user space via
eBPF and handled by the kernel. So the work of formulating the policy
would still have to be done by user space.

I guess the only problem this would partially solve is onlining of
memory failing as we can no longer fork in user space. Just as you said,
I also think this is rather some serious misconfiguration. We will
already most probably have other applications triggering OOM already.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread David Hildenbrand
On 04/10/2018 08:19, Michal Hocko wrote:
> On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
>> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
>>> Dave Hansen  writes:
>>>
>>>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>>>> It is more than just memmaps (e.g. forking udev process doing memory
>>>>> onlining also needs memory) but yes, the main idea is to make the
>>>>> onlining synchronous with hotplug.
>>>>
>>>> That's a good theoretical concern.
>>>>
>>>> But, is it a problem we need to solve in practice?
>>>
>>> Yes, unfortunately. It was previously discovered that when we try to
>>> hotplug tons of memory to a low memory system (this is a common scenario
>>> with VMs) we end up with OOM because for all new memory blocks we need
>>> to allocate page tables, struct pages, ... and we need memory to do
>>> that. The userspace program doing memory onlining also needs memory to
>>> run and in case it prefers to fork to handle hundreds of notfifications
>>> ... well, it may get OOMkilled before it manages to online anything.
>>>
>>> Allocating all kernel objects from the newly hotplugged blocks would
>>> definitely help to manage the situation but as I said this won't solve
>>> the 'forking udev' problem completely (it will likely remain in
>>> 'extreme' cases only. We can probably work around it by onlining with a
>>> dedicated process which doesn't do memory allocation).
>>>
>>
>> I guess the problem is even worse. We always have two phases
>>
>> 1. add memory - requires memory allocation
>> 2. online memory - might require memory allocations e.g. for slab/slub
>>
>> So if we just added memory but don't have sufficient memory to start a
>> user space process to trigger onlining, then we most likely also don't
>> have sufficient memory to online the memory right away (in some scenarios).
>>
>> We would have to allocate all new memory for 1 and 2 from the memory to
>> be onlined. I guess the latter part is less trivial.
>>
>> So while onlining the memory from the kernel might make things a little
>> more robust, we would still have the chance for OOM / onlining failing.
> 
> Yes, _theoretically_. Is this a practical problem for reasonable
> configurations though? I mean, this will never be perfect and we simply
> cannot support all possible configurations. We should focus on
> reasonable subset of them. From my practical experience the vast
> majority of memory is consumed by memmaps (roughly 1.5%). That is not a
> lot but I agree that allocating that from the zone normal and off node
> is not great. Especially the second part which is noticeable for whole
> node hotplug.
> 
> I have a feeling that arguing about fork not able to proceed or OOMing
> for the memory hotplug is a bit of a stretch and a sign a of
> misconfiguration.
> 

Just to rephrase, I have the same opinion. Something is already messed
up if we cannot even fork anymore. We will have OOM already all over the
place before/during/after forking.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread David Hildenbrand
On 04/10/2018 17:28, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 10:13:48 +0200
> David Hildenbrand  wrote:
> 
> ok, so what is the problem here?
> 
> Handling the hotplug in userspace through udev may be suboptimal and
> kernel handling might be faster but that's orthogonal to the problem at
> hand.

Yes, that one to solve is a different story.

> 
> The state of the art is to determine what to do with hotplugged memory
> in userspace based on platform and virtualization type.

Exactly.

> 
> Changing the default to depend on the driver that added the memory
> rather than platform type should solve the issue of VMs growing
> different types of memory device emulation.

Yes, my original proposal (this patch) was to handle it in the kernel
for known types. But as we learned, there might be some use cases that
might still require to make a decision in user space.

So providing the user space either with some type hint (auto-online vs.
standby) or the driver that added it (system vs. hyper-v ...) would
solve the issue.

> 
> Am I missing something?
> 

No, that's it. Thanks!

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-05 Thread David Hildenbrand
On 04/10/2018 19:50, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 17:45:13 +0200
> David Hildenbrand  wrote:
> 
>> On 04/10/2018 17:28, Michal Suchánek wrote:
> 
>>>
>>> The state of the art is to determine what to do with hotplugged
>>> memory in userspace based on platform and virtualization type.  
>>
>> Exactly.
>>
>>>
>>> Changing the default to depend on the driver that added the memory
>>> rather than platform type should solve the issue of VMs growing
>>> different types of memory device emulation.  
>>
>> Yes, my original proposal (this patch) was to handle it in the kernel
>> for known types. But as we learned, there might be some use cases that
>> might still require to make a decision in user space.
>>
>> So providing the user space either with some type hint (auto-online
>> vs. standby) or the driver that added it (system vs. hyper-v ...)
>> would solve the issue.
> 
> Is that not available in the udev event?
> 

Not that I am aware. Memory blocks "devices" have no drivers.

ls -la /sys/devices/system/memory/memory0/subsystem/drivers
total 0

(add_memory()/add_memory_resource() creates the memory block devices
when called from a driver)


> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [v3, 2/6] MAINTAINERS: update files maintained under DPAA2 PTP/ETHERNET

2018-10-08 Thread David Miller
From: Yangbo Lu 
Date: Mon,  8 Oct 2018 15:44:26 +0800

> The files maintained under DPAA2 PTP/ETHERNET needs to
> be updated since dpaa2 ptp driver had been moved into
> drivers/net/ethernet/freescale/dpaa2/.
> 
> Signed-off-by: Yangbo Lu 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [v3, 1/6] net: dpaa2: move DPAA2 PTP driver out of staging/

2018-10-08 Thread David Miller
From: Yangbo Lu 
Date: Mon,  8 Oct 2018 15:44:25 +0800

> This patch is to move DPAA2 PTP driver out of staging/
> since the dpaa2-eth had been moved out.
> 
> Signed-off-by: Yangbo Lu 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [v3, 4/6] net: dpaa2: rename rtc as ptp in dpaa2-ptp driver

2018-10-08 Thread David Miller
From: Yangbo Lu 
Date: Mon,  8 Oct 2018 15:44:28 +0800

> In dpaa2-ptp driver, it's odd to use rtc in names of
> some functions and structures except these dprtc APIs.
> This patch is to use ptp instead of rtc in names.
> 
> Signed-off-by: Yangbo Lu 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [v3, 3/6] net: dpaa2: fix dependency of config FSL_DPAA2_ETH

2018-10-08 Thread David Miller
From: Yangbo Lu 
Date: Mon,  8 Oct 2018 15:44:27 +0800

> The NETDEVICES dependency and ETHERNET dependency hadn't
> been required since dpaa2-eth was moved out of staging.
> Also allowed COMPILE_TEST for dpaa2-eth.
> 
> Signed-off-by: Yangbo Lu 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [v3, 5/6] net: dpaa2: remove unused code for dprtc

2018-10-08 Thread David Miller
From: Yangbo Lu 
Date: Mon,  8 Oct 2018 15:44:29 +0800

> This patch is to removed unused code for dprtc.
> This code will be re-added along with more features
> of dpaa2-ptp added.
> 
> Signed-off-by: Yangbo Lu 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [v3, 6/6] net: dpaa2: fix and improve dpaa2-ptp driver

2018-10-08 Thread David Miller
From: Yangbo Lu 
Date: Mon,  8 Oct 2018 15:44:30 +0800

> This patch is to fix and improve dpaa2-ptp driver
> in some places.
> 
> - Fixed the return for some functions.
> - Replaced kzalloc with devm_kzalloc.
> - Removed dev_set_drvdata(dev, NULL).
> - Made ptp_dpaa2_caps const.
> 
> Signed-off-by: Yangbo Lu 

Applied.

You can make things much easier for me and everyone else if you provide
a proper "[PATCH 0/N] ..." header posting with your patch series.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next,v3] hv_netvsc: fix vf serial matching with pci slot info

2018-10-15 Thread David Miller
From: Haiyang Zhang 
Date: Mon, 15 Oct 2018 19:06:15 +

> From: Haiyang Zhang 
> 
> The VF device's serial number is saved as a string in PCI slot's
> kobj name, not the slot->number. This patch corrects the netvsc
> driver, so the VF device can be successfully paired with synthetic
> NIC.
> 
> Fixes: 00d7ddba1143 ("hv_netvsc: pair VF based on serial number")
> Reported-by: Vitaly Kuznetsov 
> Signed-off-by: Haiyang Zhang 

Applied.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


RE: [PATCH V2 5/5] Tools: hv: kvp: Fix a warning of buffer overflow with gcc 8.0.1

2018-10-18 Thread David Laight
From: Dan Carpenter
> Sent: 18 October 2018 07:33
> 
> On Thu, Oct 18, 2018 at 05:09:32AM +, k...@linuxonhyperv.com wrote:
> > From: Dexuan Cui 
> >
> > The patch fixes:
> >
> > hv_kvp_daemon.c: In function 'kvp_set_ip_info':
> > hv_kvp_daemon.c:1305:2: note: 'snprintf' output between 41 and 4136 bytes
> > into a destination of size 4096
> >
> > The "(unsigned int)str_len" is to avoid:
> >
> > hv_kvp_daemon.c:1309:30: warning: comparison of integer expressions of
> > different signedness: 'int' and 'long unsigned int' [-Wsign-compare]

I usually use 'str_len + 0u' rather than a cast.

> Ugh...  Any tool with the most basic flow analysis would realize this
> was a false positive.  We use at least three static analyzers which
> catch signedness bugs.  Can we turn off GCC's warning on this until they
> improve it a bit?

Yes, would be nice if it attempted to follow the valid domain of variables.

I recently had to change:
unsigned char a, b;
unsigned int c;
...
if (a + b < c)
To stop a 'signedness' warning.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline

2018-11-14 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon and hv-balloon inflated memory will essentially result in
zero pages getting allocated by the hypervisor and the dump getting
filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

Also for XEN, calling into the kernel and asking the hypervisor if a
pfn is backed can be avoided if the duming tool would skip such pages
right from the beginning.

Dumping tools have no idea whether a given page is part of a balloon driver
and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
as all memory allocated during early boot is also PG_reserved, see
discussion at [1]. So some other way of indication is required and a new
page flag is frowned upon.

We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages as
logically offline. This flag can than e.g. also be used by virtio-mem in
the future to mark subsections as offline. Or by other code that wants to
put pages logically offline (e.g. later maybe poisoned pages that shall
no longer be used).

This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon already
set pages to PG_balloon (and now PG_offline).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

As I don't have access to neither XEN nor Hyper-V installation, this was
not tested yet (and a makedumpfile change will be required to skip
dumping these pages).

[1] https://lkml.org/lkml/2018/7/20/566

David Hildenbrand (6):
  mm: balloon: update comment about isolation/migration/compaction
  mm: convert PG_balloon to PG_offline
  kexec: export PG_offline to VMCOREINFO
  xen/balloon: mark inflated pages PG_offline
  hv_balloon: mark inflated pages PG_offline
  PM / Hibernate: exclude all PageOffline() pages

 Documentation/admin-guide/mm/pagemap.rst |  6 +
 drivers/hv/hv_balloon.c  | 14 --
 drivers/xen/balloon.c|  3 +++
 fs/proc/page.c   |  4 +--
 include/linux/balloon_compaction.h   | 34 +---
 include/linux/page-flags.h   | 11 +---
 include/uapi/linux/kernel-page-flags.h   |  1 +
 kernel/crash_core.c  |  2 ++
 kernel/power/snapshot.c  |  5 +++-
 tools/vm/page-types.c|  1 +
 10 files changed, 51 insertions(+), 30 deletions(-)

-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 1/6] mm: balloon: update comment about isolation/migration/compaction

2018-11-14 Thread David Hildenbrand
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general
non-lru movable page feature. The big comment block in
balloon_compaction.h contains quite some outdated information. Let's fix
this.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 include/linux/balloon_compaction.h | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 53051f3d8f25..cbe50da5a59d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -4,15 +4,18 @@
  *
  * Common interface definitions for making balloon pages movable by compaction.
  *
- * Despite being perfectly possible to perform ballooned pages migration, they
- * make a special corner case to compaction scans because balloon pages are not
- * enlisted at any LRU list like the other pages we do compact / migrate.
+ * Balloon page migration makes use of the general non-lru movable page
+ * feature.
+ *
+ * page->private is used to reference the responsible balloon device.
+ * page->mapping is used in context of non-lru page migration to reference
+ * the address space operations for page isolation/migration/compaction.
  *
  * As the page isolation scanning step a compaction thread does is a lockless
  * procedure (from a page standpoint), it might bring some racy situations 
while
  * performing balloon page compaction. In order to sort out these racy 
scenarios
  * and safely perform balloon's page compaction and migration we must, always,
- * ensure following these three simple rules:
+ * ensure following these simple rules:
  *
  *   i. when updating a balloon's page ->mapping element, strictly do it under
  *  the following lock order, independently of the far superior
@@ -21,19 +24,8 @@
  *   +--spin_lock_irq(&b_dev_info->pages_lock);
  * ... page->mapping updates here ...
  *
- *  ii. before isolating or dequeueing a balloon page from the balloon device
- *  pages list, the page reference counter must be raised by one and the
- *  extra refcount must be dropped when the page is enqueued back into
- *  the balloon device page list, thus a balloon page keeps its reference
- *  counter raised only while it is under our special handling;
- *
- * iii. after the lockless scan step have selected a potential balloon page for
- *  isolation, re-test the PageBalloon mark and the PagePrivate flag
- *  under the proper page lock, to ensure isolating a valid balloon page
- *  (not yet isolated, nor under release procedure)
- *
- *  iv. isolation or dequeueing procedure must clear PagePrivate flag under
- *  page lock together with removing page from balloon device page list.
+ *  ii. isolation or dequeueing procedure must remove the page from balloon
+ *  device page list under b_dev_info->pages_lock.
  *
  * The functions provided by this interface are placed to help on coping with
  * the aforementioned balloon page corner case, as well as to ensure the simple
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-14 Thread David Hildenbrand
Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale.

Cc: Andrew Morton 
Cc: Dave Young 
Cc: "Kirill A. Shutemov" 
Cc: Baoquan He 
Cc: Omar Sandoval 
Cc: Arnd Bergmann 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+   VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
 
arch_crash_save_vmcoreinfo();
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 2/6] mm: convert PG_balloon to PG_offline

2018-11-14 Thread David Hildenbrand
PG_balloon was introduced to implement page migration/compaction for pages
inflated in virtio-balloon. Nowadays, it is only a marker that a page is
part of virtio-balloon and therefore logically offline.

We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes
later on.

Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for the
guest to dump an unused page).  We can then e.g. exclude such pages from
dumps.

In following patches, we will make use of this bit also in other balloon
drivers.  While at it, document PGTABLE.

Cc: Jonathan Corbet 
Cc: Alexey Dobriyan 
Cc: Mike Rapoport 
Cc: Andrew Morton 
Cc: Christian Hansen 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Stephen Rothwell 
Cc: Matthew Wilcox 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Naoya Horiguchi 
Cc: Miles Chen 
Cc: David Rientjes 
Signed-off-by: David Hildenbrand 
---
 Documentation/admin-guide/mm/pagemap.rst |  6 ++
 fs/proc/page.c   |  4 ++--
 include/linux/balloon_compaction.h   |  8 
 include/linux/page-flags.h   | 11 +++
 include/uapi/linux/kernel-page-flags.h   |  1 +
 tools/vm/page-types.c|  1 +
 6 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst 
b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade2c231..9afd6bdc424b 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -78,6 +78,8 @@ number of times a page is mapped.
 23. BALLOON
 24. ZERO_PAGE
 25. IDLE
+26. PGTABLE
+27. OFFLINE
 
  * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -128,6 +130,10 @@ Short descriptions to the page flags
 Note that this flag may be stale in case the page was accessed via
 a PTE. To make sure the flag is up-to-date one has to read
 ``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+page is in use as a page table
+27 - OFFLINE
+page is logically offline
 
 IO related page flags
 -
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6c517b11acf8..378401af4d9d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page)
else if (page_count(page) == 0 && is_free_buddy_page(page))
u |= 1 << KPF_BUDDY;
 
-   if (PageBalloon(page))
-   u |= 1 << KPF_BALLOON;
+   if (PageOffline(page))
+   u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index cbe50da5a59d..f111c780ef1d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping,
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __SetPageOffline(page);
__SetPageMovable(page, balloon->inode->i_mapping);
set_page_private(page, (unsigned long)balloon);
list_add(&page->lru, &balloon->pages);
@@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct 
balloon_dev_info *balloon,
  */
 static inline void balloon_page_delete(struct page *page)
 {
-   __ClearPageBalloon(page);
+   __ClearPageOffline(page);
__ClearPageMovable(page);
set_page_private(page, 0);
/*
@@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __SetPageOffline(page);
list_add(&page->lru, &balloon->pages);
 }
 
 static inline void balloon_page_delete(struct page *page)
 {
-   __ClearPageBalloon(page);
+   __ClearPageOffline(page);
list_del(&page->lru);
 }
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..f91da3d0a67e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -670,7 +670,7 @@ PAGEFLAG_FALSE(DoubleMap)
 #define

[PATCH RFC 4/6] xen/balloon: mark inflated pages PG_offline

2018-11-14 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/xen/balloon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 12148289debd..14dd6b814db3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned 
int order)
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
__online_page_set_limits(p);
+   __SetPageOffline(p);
__balloon_append(p);
}
mutex_unlock(&balloon_mutex);
@@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long 
nr_pages)
xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);
 
/* Relinquish the page back to the allocator. */
+   __ClearPageOffline(page);
free_reserved_page(page);
}
 
@@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long 
nr_pages, gfp_t gfp)
state = BP_EAGAIN;
break;
}
+   __SetPageOffline(page);
adjust_managed_page_count(page, -1);
xenmem_reservation_scrub_page(page);
list_add(&page->lru, &pages);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 6/6] PM / Hibernate: exclude all PageOffline() pages

2018-11-14 Thread David Hildenbrand
The content of pages that are marked PG_offline is not of interest
(e.g. inflated by a balloon driver), let's skip these pages.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index b0308a2c6000..01db1d13481a 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
BUG_ON(!PageHighMem(page));
 
if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
-   PageReserved(page))
+   PageReserved(page) || PageOffline(page))
return NULL;
 
if (page_is_guard(page))
@@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
+   if (PageOffline(page))
+   return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH RFC 5/6] hv_balloon: mark inflated pages PG_offline

2018-11-14 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Kairui Song 
Cc: Vitaly Kuznetsov 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 5728dc470eeb..778b6f879d1c 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
-   if (!has_pfn_is_backed(has, page_to_pfn(pg)))
+   if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
+   if (!PageOffline(pg))
+   __SetPageOffline(pg);
return;
+   }
+   if (PageOffline(pg))
+   __ClearPageOffline(pg);
 
/* This frame is currently backed; online the page. */
__online_page_set_limits(pg);
@@ -1200,6 +1205,7 @@ static void free_balloon_pages(struct hv_dynmem_device 
*dm,
 
for (i = 0; i < num_pages; i++) {
pg = pfn_to_page(i + start_frame);
+   __ClearPageOffline(pg);
__free_page(pg);
dm->num_pages_ballooned--;
}
@@ -1212,7 +1218,7 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
struct dm_balloon_response *bl_resp,
int alloc_unit)
 {
-   unsigned int i = 0;
+   unsigned int i, j;
struct page *pg;
 
if (num_pages < alloc_unit)
@@ -1244,6 +1250,10 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
if (alloc_unit != 1)
split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
 
+   /* mark all pages offline */
+   for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++)
+   __SetPageOffline(pg + j);
+
bl_resp->range_count++;
bl_resp->range_array[i].finfo.start_page =
page_to_pfn(pg);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/6] mm: convert PG_balloon to PG_offline

2018-11-14 Thread David Hildenbrand
On 14.11.18 23:23, Matthew Wilcox wrote:
> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
>> Rename PG_balloon to PG_offline. This is an indicator that the page is
>> logically offline, the content stale and that it should not be touched
>> (e.g. a hypervisor would have to allocate backing storage in order for the
>> guest to dump an unused page).  We can then e.g. exclude such pages from
>> dumps.
>>
>> In following patches, we will make use of this bit also in other balloon
>> drivers.  While at it, document PGTABLE.
> 
> Thank you for documenting PGTABLE.  I didn't realise I also had this
> document to update when I added PGTABLE.

Thank you for looking into this :)

> 
>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>> @@ -78,6 +78,8 @@ number of times a page is mapped.
>>  23. BALLOON
>>  24. ZERO_PAGE
>>  25. IDLE
>> +26. PGTABLE
>> +27. OFFLINE
> 
> So the offline *user* bit is new ... even though the *kernel* bit
> just renames the balloon bit.  I'm not sure how I feel about this.
> I'm going to think about it some more.  Could you share your decision
> process with us?

BALLOON was/is documented as

"23 - BALLOON
balloon compaction page
"

and only includes all virtio-ballon pages after the non-lru migration
feature has been implemented for ballooned pages. Since then, this flag
does basically no longer stands for what it actually was supposed to do.

To not break uapi I decided to not rename it but instead to add a new flag.

> 
> I have no objection to renaming the balloon bit inside the kernel; I
> think that's a wise idea.  I'm just not sure whether we should rename
> the user balloon bit rather than adding a new bit.
> 

Can we rename without breaking uapi?

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline

2018-11-14 Thread David Hildenbrand
On 14.11.18 23:57, Nadav Amit wrote:
> From: David Hildenbrand
> Sent: November 14, 2018 at 9:16:58 PM GMT
>> Subject: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically 
>> offline
>>
>>
>> Right now, pages inflated as part of a balloon driver will be dumped
>> by dump tools like makedumpfile. While XEN is able to check in the
>> crash kernel whether a certain pfn is actuall backed by memory in the
>> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
>> virtio-balloon and hv-balloon inflated memory will essentially result in
>> zero pages getting allocated by the hypervisor and the dump getting
>> filled with this data.
> 
> Is there any reason that VMware balloon driver is not mentioned?

Definitely ...

... not ;) . I haven't looked at vmware's balloon driver yet (I only saw
that there was quite some activity recently). I guess it should have
similar problems. (I mean reading and dumping data nobody cares about is
certainly not desired)

Can you share if something like this is also desired for vmware's
implementation? (I tagged this as RFC to get some more feedback)

It should in theory be as simple as adding a handful of
_SetPageOffline()/_ClearPageOffline() at the right spots.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/6] mm: convert PG_balloon to PG_offline

2018-11-15 Thread David Hildenbrand
On 15.11.18 03:07, Mike Rapoport wrote:
> On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
>> On 14.11.18 23:23, Matthew Wilcox wrote:
>>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
>>>> Rename PG_balloon to PG_offline. This is an indicator that the page is
>>>> logically offline, the content stale and that it should not be touched
>>>> (e.g. a hypervisor would have to allocate backing storage in order for the
>>>> guest to dump an unused page).  We can then e.g. exclude such pages from
>>>> dumps.
>>>>
>>>> In following patches, we will make use of this bit also in other balloon
>>>> drivers.  While at it, document PGTABLE.
>>>
>>> Thank you for documenting PGTABLE.  I didn't realise I also had this
>>> document to update when I added PGTABLE.
>>
>> Thank you for looking into this :)
>>
>>>
>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>> @@ -78,6 +78,8 @@ number of times a page is mapped.
>>>>  23. BALLOON
>>>>  24. ZERO_PAGE
>>>>  25. IDLE
>>>> +26. PGTABLE
>>>> +27. OFFLINE
>>>
>>> So the offline *user* bit is new ... even though the *kernel* bit
>>> just renames the balloon bit.  I'm not sure how I feel about this.
>>> I'm going to think about it some more.  Could you share your decision
>>> process with us?
>>
>> BALLOON was/is documented as
>>
>> "23 - BALLOON
>> balloon compaction page
>> "
>>
>> and only includes all virtio-ballon pages after the non-lru migration
>> feature has been implemented for ballooned pages. Since then, this flag
>> does basically no longer stands for what it actually was supposed to do.
> 
> Perhaps I missing something, but how the user should interpret "23" when he
> reads /proc/kpageflags?

Looking at the history in more detail:

commit 09316c09dde33aae14f34489d9e3d243ec0d5938
Author: Konstantin Khlebnikov 
Date:   Thu Oct 9 15:29:32 2014 -0700

mm/balloon_compaction: add vmstat counters and kpageflags bit

Always mark pages with PageBalloon even if balloon compaction is
disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.


So KPF_BALLOON was exposed when virtio-balloon pages were always marked
with PG_balloon. So the documentation is actually wrong ("balloon page"
vs. "balloon compaction page").

I have no idea who actually used that information. I suspect this was
just some debugging aid.

> 
>> To not break uapi I decided to not rename it but instead to add a new flag.
> 
> I've got a feeling that uapi was anyway changed for the BALLON flag
> meaning.

Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE

a) Some applications might no longer compile (I guess that's ok)
b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which
should at least for virtio-balloon usage until now be fine - it is just
more generic)

So I guess it's up to Maintainers/Matthew to decide :)

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread David Hildenbrand
On 15.11.18 07:19, Dave Young wrote:
> Hi David,
> 
> On 11/14/18 at 10:17pm, David Hildenbrand wrote:
>> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
>> makedumpfile can directly skip pages that are logically offline and the
>> content therefore stale.
> 
> It would be good to copy some background info from cover letter to the
> patch description so that we can get better understanding why this is
> needed now.

Yes, will add more detail!

> 
> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
> fields. Ccing him so that he is aware of this.
> 
> Also cc Boris,  although I do not want the doc changes blocks this
> he might have different opinion :)

I'll be happy to help updating documentation (or updating it myself if
the doc updates go in first).

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread David Hildenbrand
On 15.11.18 12:10, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote:
>> It would be good to copy some background info from cover letter to the
>> patch description so that we can get better understanding why this is
>> needed now.
>>
>> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
>> fields. Ccing him so that he is aware of this.
>>
>> Also cc Boris,  although I do not want the doc changes blocks this
>> he might have different opinion :)
> 
> Yeah, my initial reaction is that exporting an mm-internal flag to
> userspace is a no-no.

Sorry to say, but that is the current practice without which
makedumpfile would not be able to work at all. (exclude user pages,
exclude page cache, exclude buddy pages). Let's not reinvent the wheel
here. This is how dumping works forever.

Also see how hwpoisoned pages are handled.

(please note that most distributions only support dumping via
makedumpfile, for said reasons)

> 
> What would be better, IMHO, is having a general method of telling the
> kdump kernel - maybe ranges of physical addresses - which to skip.

And that has to be updated whenever we change a page flag. But then the
dump kernel would have to be aware about "struct page" location and
format of some old kernel to be dump. Let's just not even discuss going
down that path.

> 
> Because the moment there's a set of pages which do not have PG_offline
> set but kdump would still like to skip, this breaks.

I don't understand your concern. PG_offline is only an optimization for
sections that are online. Offline sections can already be completely
ignored when dumping.

I don't see how there should be "set of pages which do not have
PG_offline". I mean if they don't have PG_offline they will simply be
dumped like before. Which worked forever. Sorry, I don't get your point.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread David Hildenbrand
On 15.11.18 12:52, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
>> Sorry to say, but that is the current practice without which
>> makedumpfile would not be able to work at all. (exclude user pages,
>> exclude page cache, exclude buddy pages). Let's not reinvent the wheel
>> here. This is how dumping works forever.
> 
> Sorry, but "we've always done this in the past" doesn't make it better.

Just saying that "I'm not the first to do it, don't hit me with a stick" :)

> 
>> I don't see how there should be "set of pages which do not have
>> PG_offline".
> 
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.

Indeed. And we still have without makedumpfile. I think you are aware of
this, but I'll explain it just for consistency: PG_hwpoison

At some point we detect a HW error and mask a page as PG_hwpoison.

makedumpfile knows how to treat that flag and can exclude it from the
dump (== not access it). No crash.

kdump itself has no clue about old "struct pages". Especially:
a) Where they are located in memory (e.g. SPARSE)
b) What their format is ("where are the flags")
c) What the meaning of flags is ("what does bit X mean")

In order to know such information, we would have to do parsing of quite
some information inside the kernel in kdump. Basically what makedumpfile
does just now. Is this feasible? I don't think so.

So we would need another approach to communicate such information as you
said. I can't think of any, but if anybody reading this has an idea,
please speak up. I am interested.


The *only* way right now we would have to handle such scenarios:

1. While dumping memory and we get a machine check, fake reading a zero
page instead of crashing.
2. While dumping memory and we get a fault, fake reading a zero page
instead of crashing.

> 
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.

Indeed, and the basic design is to export these flags. (let's say
"unfortunately", being able to handle such stuff in kdump directly would
be the dream).

> I was *simply* stating that it feels wrong to export mm flags like that.
> 
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.

Fair enough, noted. If you have an idea how to handle this in kdump,
please let me know.

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 6/6] PM / Hibernate: exclude all PageOffline() pages

2018-11-15 Thread David Hildenbrand
On 15.11.18 13:23, Michal Hocko wrote:
> On Wed 14-11-18 22:17:04, David Hildenbrand wrote:
> [...]
>> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
>> index b0308a2c6000..01db1d13481a 100644
>> --- a/kernel/power/snapshot.c
>> +++ b/kernel/power/snapshot.c
>> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
>> *zone, unsigned long pfn)
>>  BUG_ON(!PageHighMem(page));
>>  
>>  if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
>> -PageReserved(page))
>> +PageReserved(page) || PageOffline(page))
>>  return NULL;
>>  
>>  if (page_is_guard(page))
>> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
>> unsigned long pfn)
>>  if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
>>  return NULL;
>>  
>> +if (PageOffline(page))
>> +return NULL;
>> +
>>  if (PageReserved(page)
>>  && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
>>  return NULL;
> 
> Btw. now that you are touching this file could you also make
> s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch
> offline pfn ranges in general. A separate patch for that of course.
> 
> Thanks!
> 

Sure thing, will look into that!

Thanks!

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline

2018-11-16 Thread David Hildenbrand
On 14.11.18 22:16, David Hildenbrand wrote:
> Right now, pages inflated as part of a balloon driver will be dumped
> by dump tools like makedumpfile. While XEN is able to check in the
> crash kernel whether a certain pfn is actuall backed by memory in the
> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
> virtio-balloon and hv-balloon inflated memory will essentially result in
> zero pages getting allocated by the hypervisor and the dump getting
> filled with this data.
> 
> The allocation and reading of zero pages can directly be avoided if a
> dumping tool could know which pages only contain stale information not to
> be dumped.
> 
> Also for XEN, calling into the kernel and asking the hypervisor if a
> pfn is backed can be avoided if the duming tool would skip such pages
> right from the beginning.
> 
> Dumping tools have no idea whether a given page is part of a balloon driver
> and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
> as all memory allocated during early boot is also PG_reserved, see
> discussion at [1]. So some other way of indication is required and a new
> page flag is frowned upon.
> 
> We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
> suggest renaming it to something more generic (PG_offline) to mark pages as
> logically offline. This flag can than e.g. also be used by virtio-mem in
> the future to mark subsections as offline. Or by other code that wants to
> put pages logically offline (e.g. later maybe poisoned pages that shall
> no longer be used).
> 
> This series converts PG_balloon to PG_offline, allows dumping tools to
> query the value to detect such pages and marks pages in the hv-balloon
> and XEN balloon properly as PG_offline. Note that virtio-balloon already
> set pages to PG_balloon (and now PG_offline).
> 
> Please note that this is also helpful for a problem we were seeing under
> Hyper-V: Dumping logically offline memory (pages kept fake offline while
> onlining a section via online_page_callback) would under some condicions
> result in a kernel panic when dumping them.
> 
> As I don't have access to neither XEN nor Hyper-V installation, this was
> not tested yet (and a makedumpfile change will be required to skip
> dumping these pages).
> 
> [1] https://lkml.org/lkml/2018/7/20/566
> 
> David Hildenbrand (6):
>   mm: balloon: update comment about isolation/migration/compaction
>   mm: convert PG_balloon to PG_offline
>   kexec: export PG_offline to VMCOREINFO
>   xen/balloon: mark inflated pages PG_offline
>   hv_balloon: mark inflated pages PG_offline
>   PM / Hibernate: exclude all PageOffline() pages
> 
>  Documentation/admin-guide/mm/pagemap.rst |  6 +
>  drivers/hv/hv_balloon.c  | 14 --
>  drivers/xen/balloon.c|  3 +++
>  fs/proc/page.c   |  4 +--
>  include/linux/balloon_compaction.h   | 34 +---
>  include/linux/page-flags.h   | 11 +---
>  include/uapi/linux/kernel-page-flags.h   |  1 +
>  kernel/crash_core.c  |  2 ++
>  kernel/power/snapshot.c  |  5 +++-
>  tools/vm/page-types.c|  1 +
>  10 files changed, 51 insertions(+), 30 deletions(-)
> 

I just did a test with virtio-balloon (and a very simple makedumpfile
patch which I can supply on demand).

1. Guest with 8GB. Inflate balloon to 4GB via
 sudo virsh setmem f29 --size 4096M --live

2. Trigger a kernel panic in the guest
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

Original pages  : 0x001e1da8
  Excluded pages   : 0x001c9221
Pages filled with zero  : 0x50b0
Non-private cache pages : 0x00046547
Private cache pages : 0x2165
User process data pages : 0x48cf
Free pages  : 0x000771f6
Hwpoison pages  : 0x
Offline pages   : 0x0010
  Remaining pages  : 0x00018b87
  (The number of pages is reduced to 5%.)
Memory Hole : 0x0009e258
--
Total pages : 0x0028

(Offline patches matches the 4GB)

-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 0/8] mm/kdump: allow to exclude pages that are logically offline

2018-11-19 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon, hv-balloon and VMWare balloon inflated memory will
essentially result in zero pages getting allocated by the hypervisor and
the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

Also for XEN, calling into the kernel and asking the hypervisor if a
pfn is backed can be avoided if the duming tool would skip such pages
right from the beginning.

Dumping tools have no idea whether a given page is part of a balloon driver
and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
as all memory allocated during early boot is also PG_reserved, see
discussion at [1]. So some other way of indication is required and a new
page flag is frowned upon.

We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages as
logically offline. This flag can than e.g. also be used by virtio-mem in
the future to mark subsections as offline. Or by other code that wants to
put pages logically offline (e.g. later maybe poisoned pages that shall
no longer be used).

This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon already
set pages to PG_balloon (and now PG_offline).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

As I don't have access to neither XEN nor Hyper-V nor VMWare installations,
this was only tested with the virtio-balloon and pages were properly
skipped when dumping. I'll also attach the makedumpfile patch to this
series.

[1] https://lkml.org/lkml/2018/7/20/566

RFC -> v1:
- Add "PM / Hibernate: use pfn_to_online_page()"
- Add "vmw_balloon: mark inflated pages PG_offline"
- "mm: convert PG_balloon to PG_offline"
-- After discussions, also rename the UAPI bit name (KPF_BALLOON -> KPF_OFFLINE)

David Hildenbrand (8):
  mm: balloon: update comment about isolation/migration/compaction
  mm: convert PG_balloon to PG_offline
  kexec: export PG_offline to VMCOREINFO
  xen/balloon: mark inflated pages PG_offline
  hv_balloon: mark inflated pages PG_offline
  vmw_balloon: mark inflated pages PG_offline
  PM / Hibernate: use pfn_to_online_page()
  PM / Hibernate: exclude all PageOffline() pages

 Documentation/admin-guide/mm/pagemap.rst |  9 ---
 drivers/hv/hv_balloon.c  | 14 --
 drivers/misc/vmw_balloon.c   | 32 ++
 drivers/xen/balloon.c|  3 +++
 fs/proc/page.c   |  4 +--
 include/linux/balloon_compaction.h   | 34 +---
 include/linux/page-flags.h   | 11 +---
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 kernel/crash_core.c  |  2 ++
 kernel/power/snapshot.c  | 13 +
 tools/vm/page-types.c|  2 +-
 11 files changed, 87 insertions(+), 39 deletions(-)

-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 3/8] kexec: export PG_offline to VMCOREINFO

2018-11-19 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
other balloon inflated memory will essentially result in zero pages getting
allocated by the hypervisor and the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

We now have PG_offline which can be (and already is by virtio-balloon)
used for marking pages as logically offline. Follow up patches will
make use of this flag also in other balloon implementations.

Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale.

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

Cc: Andrew Morton 
Cc: Dave Young 
Cc: "Kirill A. Shutemov" 
Cc: Baoquan He 
Cc: Omar Sandoval 
Cc: Arnd Bergmann 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Cc: Lianbo Jiang 
Cc: Borislav Petkov 
Cc: Kazuhito Hagio 
Signed-off-by: David Hildenbrand 
---
 kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+   VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
 
arch_crash_save_vmcoreinfo();
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 1/8] mm: balloon: update comment about isolation/migration/compaction

2018-11-19 Thread David Hildenbrand
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general
non-lru movable page feature. The big comment block in
balloon_compaction.h contains quite some outdated information. Let's fix
this.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 include/linux/balloon_compaction.h | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 53051f3d8f25..cbe50da5a59d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -4,15 +4,18 @@
  *
  * Common interface definitions for making balloon pages movable by compaction.
  *
- * Despite being perfectly possible to perform ballooned pages migration, they
- * make a special corner case to compaction scans because balloon pages are not
- * enlisted at any LRU list like the other pages we do compact / migrate.
+ * Balloon page migration makes use of the general non-lru movable page
+ * feature.
+ *
+ * page->private is used to reference the responsible balloon device.
+ * page->mapping is used in context of non-lru page migration to reference
+ * the address space operations for page isolation/migration/compaction.
  *
  * As the page isolation scanning step a compaction thread does is a lockless
  * procedure (from a page standpoint), it might bring some racy situations 
while
  * performing balloon page compaction. In order to sort out these racy 
scenarios
  * and safely perform balloon's page compaction and migration we must, always,
- * ensure following these three simple rules:
+ * ensure following these simple rules:
  *
  *   i. when updating a balloon's page ->mapping element, strictly do it under
  *  the following lock order, independently of the far superior
@@ -21,19 +24,8 @@
  *   +--spin_lock_irq(&b_dev_info->pages_lock);
  * ... page->mapping updates here ...
  *
- *  ii. before isolating or dequeueing a balloon page from the balloon device
- *  pages list, the page reference counter must be raised by one and the
- *  extra refcount must be dropped when the page is enqueued back into
- *  the balloon device page list, thus a balloon page keeps its reference
- *  counter raised only while it is under our special handling;
- *
- * iii. after the lockless scan step have selected a potential balloon page for
- *  isolation, re-test the PageBalloon mark and the PagePrivate flag
- *  under the proper page lock, to ensure isolating a valid balloon page
- *  (not yet isolated, nor under release procedure)
- *
- *  iv. isolation or dequeueing procedure must clear PagePrivate flag under
- *  page lock together with removing page from balloon device page list.
+ *  ii. isolation or dequeueing procedure must remove the page from balloon
+ *  device page list under b_dev_info->pages_lock.
  *
  * The functions provided by this interface are placed to help on coping with
  * the aforementioned balloon page corner case, as well as to ensure the simple
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 2/8] mm: convert PG_balloon to PG_offline

2018-11-19 Thread David Hildenbrand
PG_balloon was introduced to implement page migration/compaction for pages
inflated in virtio-balloon. Nowadays, it is only a marker that a page is
part of virtio-balloon and therefore logically offline.

We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes
later on.

Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for the
guest to dump an unused page).  We can then e.g. exclude such pages from
dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we will
make use of this bit also in other balloon drivers. While at it, document
PGTABLE.

Cc: Jonathan Corbet 
Cc: Alexey Dobriyan 
Cc: Mike Rapoport 
Cc: Andrew Morton 
Cc: Christian Hansen 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Stephen Rothwell 
Cc: Matthew Wilcox 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Naoya Horiguchi 
Cc: Miles Chen 
Cc: David Rientjes 
Cc: Konstantin Khlebnikov 
Cc: Kazuhito Hagio 
Signed-off-by: David Hildenbrand 
---
 Documentation/admin-guide/mm/pagemap.rst |  9 ++---
 fs/proc/page.c   |  4 ++--
 include/linux/balloon_compaction.h   |  8 
 include/linux/page-flags.h   | 11 +++
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 tools/vm/page-types.c|  2 +-
 6 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst 
b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade2c231..340a5aee9b80 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -75,9 +75,10 @@ number of times a page is mapped.
 20. NOPAGE
 21. KSM
 22. THP
-23. BALLOON
+23. OFFLINE
 24. ZERO_PAGE
 25. IDLE
+26. PGTABLE
 
  * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -118,8 +119,8 @@ Short descriptions to the page flags
 identical memory pages dynamically shared between one or more processes
 22 - THP
 contiguous pages which construct transparent hugepages
-23 - BALLOON
-balloon compaction page
+23 - OFFLINE
+page is logically offline
 24 - ZERO_PAGE
 zero page for pfn_zero or huge_zero page
 25 - IDLE
@@ -128,6 +129,8 @@ Short descriptions to the page flags
 Note that this flag may be stale in case the page was accessed via
 a PTE. To make sure the flag is up-to-date one has to read
 ``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+page is in use as a page table
 
 IO related page flags
 -
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6c517b11acf8..378401af4d9d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page)
else if (page_count(page) == 0 && is_free_buddy_page(page))
u |= 1 << KPF_BUDDY;
 
-   if (PageBalloon(page))
-   u |= 1 << KPF_BALLOON;
+   if (PageOffline(page))
+   u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index cbe50da5a59d..f111c780ef1d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping,
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __SetPageOffline(page);
__SetPageMovable(page, balloon->inode->i_mapping);
set_page_private(page, (unsigned long)balloon);
list_add(&page->lru, &balloon->pages);
@@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct 
balloon_dev_info *balloon,
  */
 static inline void balloon_page_delete(struct page *page)
 {
-   __ClearPageBalloon(page);
+   __ClearPageOffline(page);
__ClearPageMovable(page);
set_page_private(page, 0);
/*
@@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   _

[PATCH v1 4/8] xen/balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/xen/balloon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 12148289debd..14dd6b814db3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned 
int order)
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
__online_page_set_limits(p);
+   __SetPageOffline(p);
__balloon_append(p);
}
mutex_unlock(&balloon_mutex);
@@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long 
nr_pages)
xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);
 
/* Relinquish the page back to the allocator. */
+   __ClearPageOffline(page);
free_reserved_page(page);
}
 
@@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long 
nr_pages, gfp_t gfp)
state = BP_EAGAIN;
break;
}
+   __SetPageOffline(page);
adjust_managed_page_count(page, -1);
xenmem_reservation_scrub_page(page);
list_add(&page->lru, &pages);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 5/8] hv_balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Kairui Song 
Cc: Vitaly Kuznetsov 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 211f3fe3a038..47719862e57f 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
-   if (!has_pfn_is_backed(has, page_to_pfn(pg)))
+   if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
+   if (!PageOffline(pg))
+   __SetPageOffline(pg);
return;
+   }
+   if (PageOffline(pg))
+   __ClearPageOffline(pg);
 
/* This frame is currently backed; online the page. */
__online_page_set_limits(pg);
@@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device 
*dm,
 
for (i = 0; i < num_pages; i++) {
pg = pfn_to_page(i + start_frame);
+   __ClearPageOffline(pg);
__free_page(pg);
dm->num_pages_ballooned--;
}
@@ -1213,7 +1219,7 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
struct dm_balloon_response *bl_resp,
int alloc_unit)
 {
-   unsigned int i = 0;
+   unsigned int i, j;
struct page *pg;
 
if (num_pages < alloc_unit)
@@ -1245,6 +1251,10 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
if (alloc_unit != 1)
split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
 
+   /* mark all pages offline */
+   for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++)
+   __SetPageOffline(pg + j);
+
bl_resp->range_count++;
bl_resp->range_array[i].finfo.start_page =
page_to_pfn(pg);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages

2018-11-19 Thread David Hildenbrand
The content of pages that are marked PG_offline is not of interest
(e.g. inflated by a balloon driver), let's skip these pages.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Pavel Machek 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 87e6dd57819f..8d7b4d458842 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
BUG_ON(!PageHighMem(page));
 
if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
-   PageReserved(page))
+   PageReserved(page) || PageOffline(page))
return NULL;
 
if (page_is_guard(page))
@@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
+   if (PageOffline(page))
+   return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 7/8] PM / Hibernate: use pfn_to_online_page()

2018-11-19 Thread David Hildenbrand
Let's use pfn_to_online_page() instead of pfn_to_page() when checking
for saveable pages to not save/restore offline memory sections.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Suggested-by: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 640b2034edd6..87e6dd57819f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(!PageHighMem(page));
@@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(PageHighMem(page));
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1 6/8] vmw_balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Xavier Deguillard 
Cc: Nadav Amit 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Julien Freche 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/misc/vmw_balloon.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index e6126a4b95d3..8cc8bd9a4e32 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -544,6 +544,36 @@ unsigned int vmballoon_page_order(enum 
vmballoon_page_size_type page_size)
return page_size == VMW_BALLOON_2M_PAGE ? VMW_BALLOON_2M_ORDER : 0;
 }
 
+/**
+ * vmballoon_mark_page_offline() - mark a page as offline
+ * @page: pointer for the page
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_offline(struct page *page,
+   enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++)
+   __SetPageOffline(page + i);
+}
+
+/**
+ * vmballoon_mark_page_online() - mark a page as online
+ * @page: pointer for the page
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_online(struct page *page,
+  enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++)
+   __ClearPageOffline(page + i);
+}
+
 /**
  * vmballoon_page_in_frames() - returns the number of frames in a page.
  * @page_size: the size of the page.
@@ -612,6 +642,7 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 ctl->page_size);
 
if (page) {
+   vmballoon_mark_page_offline(page, ctl->page_size);
/* Success. Add the page to the list and continue. */
list_add(&page->lru, &ctl->pages);
continue;
@@ -850,6 +881,7 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
 
list_for_each_entry_safe(page, tmp, page_list, lru) {
list_del(&page->lru);
+   vmballoon_mark_page_online(page, page_size);
__free_pages(page, vmballoon_page_order(page_size));
}
 
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v1] makedumpfile: exclude pages that are logically offline

2018-11-19 Thread David Hildenbrand
Linux marks pages that are logically offline via a page flag (map count).
Such pages e.g. include pages infated as part of a balloon driver or
pages that were not actually onlined when onlining the whole section.

While the hypervisor usually allows to read such inflated memory, we
basically read and dump data that is completely irrelevant. Also, this
might result in quite some overhead in the hypervisor. In addition,
we saw some problems under Hyper-V, whereby we can crash the kernel by
dumping, when reading memory of a partially onlined memory segment
(for memory added by the Hyper-V balloon driver).

Therefore, don't read and dump pages that are marked as being logically
offline.

Signed-off-by: David Hildenbrand 
---
 makedumpfile.c | 34 ++
 makedumpfile.h |  1 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 8923538..b8bfd4c 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private;
 mdf_pfn_t pfn_user;
 mdf_pfn_t pfn_free;
 mdf_pfn_t pfn_hwpoison;
+mdf_pfn_t pfn_offline;
 
 mdf_pfn_t num_dumped;
 
@@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor)
 && (SYMBOL(free_huge_page) == dtor));
 }
 
+static int
+isOffline(unsigned long flags, unsigned int _mapcount)
+{
+   if (NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER)
+   return FALSE;
+
+   if (flags & (1UL << NUMBER(PG_slab)))
+   return FALSE;
+
+   if (_mapcount == (int)NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE))
+   return TRUE;
+
+   return FALSE;
+}
+
 static int
 is_cache_page(unsigned long flags)
 {
@@ -2287,6 +2303,8 @@ write_vmcoreinfo_data(void)
WRITE_NUMBER("PG_hwpoison", PG_hwpoison);
 
WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   WRITE_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE",
+PAGE_OFFLINE_MAPCOUNT_VALUE);
WRITE_NUMBER("phys_base", phys_base);
 
WRITE_NUMBER("HUGETLB_PAGE_DTOR", HUGETLB_PAGE_DTOR);
@@ -2687,6 +2705,7 @@ read_vmcoreinfo(void)
READ_SRCFILE("pud_t", pud_t);
 
READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE);
READ_NUMBER("phys_base", phys_base);
 #ifdef __aarch64__
READ_NUMBER("VA_BITS", VA_BITS);
@@ -6041,6 +6060,12 @@ __exclude_unnecessary_pages(unsigned long mem_map,
else if (isHWPOISON(flags)) {
pfn_counter = &pfn_hwpoison;
}
+   /*
+* Exclude pages that are logically offline.
+*/
+   else if (isOffline(flags, _mapcount)) {
+   pfn_counter = &pfn_offline;
+   }
/*
 * Unexcludable page
 */
@@ -7522,7 +7547,7 @@ write_elf_pages_cyclic(struct cache_data *cd_header, 
struct cache_data *cd_page)
 */
if (info->flag_cyclic) {
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
}
 
@@ -8804,7 +8829,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data 
*cd_header, struct cache_d
 * Reset counter for debug message.
 */
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
 
/*
@@ -9749,7 +9774,7 @@ print_report(void)
pfn_original = info->max_mapnr - pfn_memhole;
 
pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private
-   + pfn_user + pfn_free + pfn_hwpoison;
+   + pfn_user + pfn_free + pfn_hwpoison + pfn_offline;
shrinking = (pfn_original - pfn_excluded) * 100;
shrinking = shrinking / pfn_original;
 
@@ -9763,6 +9788,7 @@ print_report(void)
REPORT_MSG("User process data pages : 0x%016llx\n", pfn_user);
REPORT_MSG("Free pages  : 0x%016llx\n", pfn_free);
REPORT_MSG("Hwpoison pages  : 0x%016llx\n", pfn_hwpoison);
+   REPORT_MSG("Offline pages   : 0x%016llx\n", pfn_offline);
REPORT_MSG("  Remaining pages  : 0x%016llx\n",
pfn_original - pfn_excluded);
REPORT_MSG("  (The number of pages is reduced to %lld%%.)\n",
@@ -9790,7 +9816,7 @@ print_mem_usage(void)
pfn_original = info->max_mapnr - pfn_memhole;
 
pfn_excluded = pfn_zero + pfn_cache + 

Re: [PATCH v1 4/8] xen/balloon: mark inflated pages PG_offline

2018-11-19 Thread David Hildenbrand
On 19.11.18 13:22, Juergen Gross wrote:
> On 19/11/2018 11:16, David Hildenbrand wrote:
>> Mark inflated and never onlined pages PG_offline, to tell the world that
>> the content is stale and should not be dumped.
>>
>> Cc: Boris Ostrovsky 
>> Cc: Juergen Gross 
>> Cc: Stefano Stabellini 
>> Cc: Andrew Morton 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  drivers/xen/balloon.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
>> index 12148289debd..14dd6b814db3 100644
>> --- a/drivers/xen/balloon.c
>> +++ b/drivers/xen/balloon.c
>> @@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, 
>> unsigned int order)
>>  for (i = 0; i < size; i++) {
>>  p = pfn_to_page(start_pfn + i);
>>  __online_page_set_limits(p);
>> +__SetPageOffline(p);
>>  __balloon_append(p);
>>  }
> 
> This seems not to be based on current master. Could you please tell
> against which tree this should be reviewed?
> 
Hi Juergen,

this is based on linux-next/master.

Thanks!

> 
> Juergen
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 2/8] mm: convert PG_balloon to PG_offline

2018-11-19 Thread David Hildenbrand
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50ce1bddaf56..f91da3d0a67e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -670,7 +670,7 @@ PAGEFLAG_FALSE(DoubleMap)
>  #define PAGE_TYPE_BASE   0xf000
>  /* Reserve   0x007f to catch underflows of page_mapcount */
>  #define PG_buddy 0x0080
> -#define PG_balloon   0x0100
> +#define PG_offline   0x0100
>  #define PG_kmemcg0x0200
>  #define PG_table 0x0400
>  
> @@ -700,10 +700,13 @@ static __always_inline void __ClearPage##uname(struct 
> page *page)   \
>  PAGE_TYPE_OPS(Buddy, buddy)
>  
>  /*
> - * PageBalloon() is true for pages that are on the balloon page list
> - * (see mm/balloon_compaction.c).
> + * PageOffline() indicates that the pages is logically offline although the

s/pages/page/

:)


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 5/8] hv_balloon: mark inflated pages PG_offline

2018-11-20 Thread David Hildenbrand
On 20.11.18 09:45, Pankaj Gupta wrote:
> 
> Hi David,
> 
>>
>> Mark inflated and never onlined pages PG_offline, to tell the world that
>> the content is stale and should not be dumped.
>>
>> Cc: "K. Y. Srinivasan" 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> Cc: Kairui Song 
>> Cc: Vitaly Kuznetsov 
>> Cc: Andrew Morton 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  drivers/hv/hv_balloon.c | 14 --
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
>> index 211f3fe3a038..47719862e57f 100644
>> --- a/drivers/hv/hv_balloon.c
>> +++ b/drivers/hv/hv_balloon.c
>> @@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
>>  /* Check if the particular page is backed and can be onlined and online it.
>>  */
>>  static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
>>  {
>> -if (!has_pfn_is_backed(has, page_to_pfn(pg)))
>> +if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
>> +if (!PageOffline(pg))
>> +__SetPageOffline(pg);
>>  return;
>> +}
>> +if (PageOffline(pg))
>> +__ClearPageOffline(pg);
>>  
>>  /* This frame is currently backed; online the page. */
>>  __online_page_set_limits(pg);
>> @@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device
>> *dm,
>>  
>>  for (i = 0; i < num_pages; i++) {
>>  pg = pfn_to_page(i + start_frame);
>> +__ClearPageOffline(pg);
> 
> Just thinking, do we need to care for clearing PageOffline flag before freeing
> a balloon'd page?

Yes we have to otherwise the code will crash when trying to set PageBuddy.

(only one page type at a time may be set right now, and it makes sense.
A page that is offline cannot e.g. be a buddy page)

So PageOffline is completely managed by the page owner.

> 
> Thanks,
> Pankaj


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 3/8] kexec: export PG_offline to VMCOREINFO

2018-11-21 Thread David Hildenbrand
On 21.11.18 07:04, Baoquan He wrote:
> On 11/19/18 at 11:16am, David Hildenbrand wrote:
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 933cb3e45b98..093c9f917ed0 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
>>  VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
>>  #ifdef CONFIG_HUGETLB_PAGE
>>  VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
>> +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
>> +VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
>>  #endif
> 
> This solution looks good to me. One small concern is why we don't
> export PG_offline to vmcoreinfo directly, then define
> PAGE_OFFLINE_MAPCOUNT_VALUE in makedumpfile. We have been exporting
> kernel data/MACRO directly, why this one is exceptional.
> 

1. We are much more similar to PG_buddy (in contrast to actual page
flags), and for PG_buddy it is historically handled like this (and I
think it makes sense to expose these as actual MAPCOUNT_VALUEs).

2. Right now only one page type per page is supported. Therefore only
exactly one value in mapcount indicates e.g. PageBuddy()/PageOffline().

Now, if we ever decide to change this (e.g. treat them like real flags),
it is much easier to switch to PG_offline/PG_buddy then. We can directly
see in makedumpfile that .*_MAPCOUNT_VALUE is no longer available but
instead e.g. PG_offline and PG_buddy. Instead we would no see a change
in makedumpfile and would have to rely on other properties.

If there are no strong opinions I will leave it like this.

Thanks!

> Thanks
> Baoquan
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 6/8] vmw_balloon: mark inflated pages PG_offline

2018-11-21 Thread David Hildenbrand
On 21.11.18 04:22, Nadav Amit wrote:
> Thanks for this patch!
> 
>> On Nov 19, 2018, at 2:16 AM, David Hildenbrand  wrote:
>>
>> Mark inflated and never onlined pages PG_offline, to tell the world that
>> the content is stale and should not be dumped.
>>
>> Cc: Xavier Deguillard 
>> Cc: Nadav Amit 
>> Cc: Arnd Bergmann 
>> Cc: Greg Kroah-Hartman 
>> Cc: Julien Freche 
>> Cc: Andrew Morton 
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
>> ---
>> drivers/misc/vmw_balloon.c | 32 
>> 1 file changed, 32 insertions(+)
>>
>> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
>> index e6126a4b95d3..8cc8bd9a4e32 100644
>> --- a/drivers/misc/vmw_balloon.c
>> +++ b/drivers/misc/vmw_balloon.c
>> @@ -544,6 +544,36 @@ unsigned int vmballoon_page_order(enum 
>> vmballoon_page_size_type page_size)
>>  return page_size == VMW_BALLOON_2M_PAGE ? VMW_BALLOON_2M_ORDER : 0;
>> }
>>
>> +/**
>> + * vmballoon_mark_page_offline() - mark a page as offline
>> + * @page: pointer for the page
> 
> If possible, please add a period at the end of the sentence (yes, I know I
> got it wrong in some places too).

Sure :)

> 
>> + * @page_size: the size of the page.
>> + */
>> +static void
>> +vmballoon_mark_page_offline(struct page *page,
>> +enum vmballoon_page_size_type page_size)
>> +{
>> +int i;
>> +
>> +for (i = 0; i < 1ULL << vmballoon_page_order(page_size); i++)
> 
> Can you please do instead:
> 
>   unsigned int;
> 
>   for (i = 0; i < vmballoon_page_in_frames(page_size); i++)
> 

Will do, will have to move both functions a little bit down in the file
(exactly one function).


> We would like to test it in the next few days, but in the meanwhile, after
> you address these minor issues:
> 
> Acked-by: Nadav Amit 

Thanks!

> 
> Thanks again,
> Nadav 
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages

2018-11-21 Thread David Hildenbrand
On 21.11.18 12:35, William Kucharski wrote:
> If you are adding PageOffline(page) to the condition list of the already 
> existing if in
> saveable_highmem_page(), why explicitly add it as a separate statement in 
> saveable_page()?
> 
> It would seem more consistent to make the second check:
> 
> - if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
> + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) ||
> + PageOffline(page))
> 
> instead.
> 
> It's admittedly a nit but it just seems cleaner to either do that or, if your 
> intention
> was to separate the Page checks from the swsusp checks, to break the calls to
> PageReserved() and PageOffline() into their own check in 
> saveable_highmem_page().

I'll split PageReserved() and PageOffline() off from the swsusp checks,
thanks for your comment!

> 
> Thanks!
> -- Bill


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1] makedumpfile: exclude pages that are logically offline

2018-11-21 Thread David Hildenbrand
On 21.11.18 15:58, Kazuhito Hagio wrote:
> Hi David,
> 
>> Linux marks pages that are logically offline via a page flag (map count).
>> Such pages e.g. include pages infated as part of a balloon driver or
>> pages that were not actually onlined when onlining the whole section.
>>
>> While the hypervisor usually allows to read such inflated memory, we
>> basically read and dump data that is completely irrelevant. Also, this
>> might result in quite some overhead in the hypervisor. In addition,
>> we saw some problems under Hyper-V, whereby we can crash the kernel by
>> dumping, when reading memory of a partially onlined memory segment
>> (for memory added by the Hyper-V balloon driver).
>>
>> Therefore, don't read and dump pages that are marked as being logically
>> offline.
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  makedumpfile.c | 34 ++
>>  makedumpfile.h |  1 +
>>  2 files changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/makedumpfile.c b/makedumpfile.c
>> index 8923538..b8bfd4c 100644
>> --- a/makedumpfile.c
>> +++ b/makedumpfile.c
>> @@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private;
>>  mdf_pfn_t pfn_user;
>>  mdf_pfn_t pfn_free;
>>  mdf_pfn_t pfn_hwpoison;
>> +mdf_pfn_t pfn_offline;
>>
>>  mdf_pfn_t num_dumped;
>>
>> @@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor)
>>  && (SYMBOL(free_huge_page) == dtor));
>>  }
>>
>> +static int
>> +isOffline(unsigned long flags, unsigned int _mapcount)
>> +{
>> +if (NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER)
>> +return FALSE;
> 
> This is NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE), isn't it?
> If so, I will correct it when merging.
> 
> Otherwise, looks good to me.
> 
> Thanks!
> Kazu

Indeed,

I will most probably resend either way along with a new mm series!

Thanks a lot!


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next 0/4] VLAN tag handling cleanup

2018-11-21 Thread David Miller
From: Michał Mirosław 
Date: Tue, 20 Nov 2018 13:20:30 +0100

> This is a cleanup set after VLAN_TAG_PRESENT removal. The CFI bit
> handling is made similar to how other tag fields are used.

Series applied, thanks.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 2/8] mm: convert PG_balloon to PG_offline

2018-11-22 Thread David Hildenbrand
PG_balloon was introduced to implement page migration/compaction for pages
inflated in virtio-balloon. Nowadays, it is only a marker that a page is
part of virtio-balloon and therefore logically offline.

We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes
later on.

Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for the
guest to dump an unused page).  We can then e.g. exclude such pages from
dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we will
make use of this bit also in other balloon drivers. While at it, document
PGTABLE.

Cc: Jonathan Corbet 
Cc: Alexey Dobriyan 
Cc: Mike Rapoport 
Cc: Andrew Morton 
Cc: Christian Hansen 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Stephen Rothwell 
Cc: Matthew Wilcox 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Naoya Horiguchi 
Cc: Miles Chen 
Cc: David Rientjes 
Cc: Konstantin Khlebnikov 
Cc: Kazuhito Hagio 
Acked-by: Konstantin Khlebnikov 
Acked-by: Michael S. Tsirkin 
Acked-by: Pankaj gupta 
Signed-off-by: David Hildenbrand 
---
 Documentation/admin-guide/mm/pagemap.rst |  9 ++---
 fs/proc/page.c   |  4 ++--
 include/linux/balloon_compaction.h   |  8 
 include/linux/page-flags.h   | 11 +++
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 tools/vm/page-types.c|  2 +-
 6 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst 
b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade2c231..340a5aee9b80 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -75,9 +75,10 @@ number of times a page is mapped.
 20. NOPAGE
 21. KSM
 22. THP
-23. BALLOON
+23. OFFLINE
 24. ZERO_PAGE
 25. IDLE
+26. PGTABLE
 
  * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -118,8 +119,8 @@ Short descriptions to the page flags
 identical memory pages dynamically shared between one or more processes
 22 - THP
 contiguous pages which construct transparent hugepages
-23 - BALLOON
-balloon compaction page
+23 - OFFLINE
+page is logically offline
 24 - ZERO_PAGE
 zero page for pfn_zero or huge_zero page
 25 - IDLE
@@ -128,6 +129,8 @@ Short descriptions to the page flags
 Note that this flag may be stale in case the page was accessed via
 a PTE. To make sure the flag is up-to-date one has to read
 ``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+page is in use as a page table
 
 IO related page flags
 -
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6c517b11acf8..378401af4d9d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page)
else if (page_count(page) == 0 && is_free_buddy_page(page))
u |= 1 << KPF_BUDDY;
 
-   if (PageBalloon(page))
-   u |= 1 << KPF_BALLOON;
+   if (PageOffline(page))
+   u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index cbe50da5a59d..f111c780ef1d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping,
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
   struct page *page)
 {
-   __SetPageBalloon(page);
+   __SetPageOffline(page);
__SetPageMovable(page, balloon->inode->i_mapping);
set_page_private(page, (unsigned long)balloon);
list_add(&page->lru, &balloon->pages);
@@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct 
balloon_dev_info *balloon,
  */
 static inline void balloon_page_delete(struct page *page)
 {
-   __ClearPageBalloon(page);
+   __ClearPageOffline(page);
__ClearPageMovable(page);
set_page_private(page, 0);
/*
@@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct b

[PATCH v2 0/8] mm/kdump: allow to exclude pages that are logically offline

2018-11-22 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actually backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon, hv-balloon and VMWare balloon inflated memory will
essentially result in zero pages getting allocated by the hypervisor and
the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

Also for XEN, calling into the kernel and asking the hypervisor if a
pfn is backed can be avoided if the duming tool would skip such pages
right from the beginning.

Dumping tools have no idea whether a given page is part of a balloon driver
and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
as all memory allocated during early boot is also PG_reserved, see
discussion at [1]. So some other way of indication is required and a new
page flag is frowned upon.

We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages as
logically offline. This flag can than e.g. also be used by virtio-mem in
the future to mark subsections as offline. Or by other code that wants to
put pages logically offline (e.g. later maybe poisoned pages that shall
no longer be used).

This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon already
set pages to PG_balloon (and now PG_offline).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

As I don't have access to neither XEN nor Hyper-V nor VMWare installations,
this was only tested with the virtio-balloon and pages were properly
skipped when dumping. I'll also attach the makedumpfile patch to this
series.

[1] https://lkml.org/lkml/2018/7/20/566

v1 -> v2:
- "kexec: export PG_offline to VMCOREINFO"
-- Add description why it is exported as a macro
- "vmw_balloon: mark inflated pages PG_offline"
-- Use helper function + adapt comments
- "PM / Hibernate: exclude all PageOffline() pages"
-- Perform the check separate from swsusp checks.
- Added RBs/ACKs


David Hildenbrand (8):
  mm: balloon: update comment about isolation/migration/compaction
  mm: convert PG_balloon to PG_offline
  kexec: export PG_offline to VMCOREINFO
  xen/balloon: mark inflated pages PG_offline
  hv_balloon: mark inflated pages PG_offline
  vmw_balloon: mark inflated pages PG_offline
  PM / Hibernate: use pfn_to_online_page()
  PM / Hibernate: exclude all PageOffline() pages

 Documentation/admin-guide/mm/pagemap.rst |  9 ---
 drivers/hv/hv_balloon.c  | 14 --
 drivers/misc/vmw_balloon.c   | 32 ++
 drivers/xen/balloon.c|  3 +++
 fs/proc/page.c   |  4 +--
 include/linux/balloon_compaction.h   | 34 +---
 include/linux/page-flags.h   | 11 +---
 include/uapi/linux/kernel-page-flags.h   |  2 +-
 kernel/crash_core.c  |  2 ++
 kernel/power/snapshot.c  | 17 +++-
 tools/vm/page-types.c|  2 +-
 11 files changed, 90 insertions(+), 40 deletions(-)

-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 1/8] mm: balloon: update comment about isolation/migration/compaction

2018-11-22 Thread David Hildenbrand
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general
non-lru movable page feature. The big comment block in
balloon_compaction.h contains quite some outdated information. Let's fix
this.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Michael S. Tsirkin 
Signed-off-by: David Hildenbrand 
---
 include/linux/balloon_compaction.h | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 53051f3d8f25..cbe50da5a59d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -4,15 +4,18 @@
  *
  * Common interface definitions for making balloon pages movable by compaction.
  *
- * Despite being perfectly possible to perform ballooned pages migration, they
- * make a special corner case to compaction scans because balloon pages are not
- * enlisted at any LRU list like the other pages we do compact / migrate.
+ * Balloon page migration makes use of the general non-lru movable page
+ * feature.
+ *
+ * page->private is used to reference the responsible balloon device.
+ * page->mapping is used in context of non-lru page migration to reference
+ * the address space operations for page isolation/migration/compaction.
  *
  * As the page isolation scanning step a compaction thread does is a lockless
  * procedure (from a page standpoint), it might bring some racy situations 
while
  * performing balloon page compaction. In order to sort out these racy 
scenarios
  * and safely perform balloon's page compaction and migration we must, always,
- * ensure following these three simple rules:
+ * ensure following these simple rules:
  *
  *   i. when updating a balloon's page ->mapping element, strictly do it under
  *  the following lock order, independently of the far superior
@@ -21,19 +24,8 @@
  *   +--spin_lock_irq(&b_dev_info->pages_lock);
  * ... page->mapping updates here ...
  *
- *  ii. before isolating or dequeueing a balloon page from the balloon device
- *  pages list, the page reference counter must be raised by one and the
- *  extra refcount must be dropped when the page is enqueued back into
- *  the balloon device page list, thus a balloon page keeps its reference
- *  counter raised only while it is under our special handling;
- *
- * iii. after the lockless scan step have selected a potential balloon page for
- *  isolation, re-test the PageBalloon mark and the PagePrivate flag
- *  under the proper page lock, to ensure isolating a valid balloon page
- *  (not yet isolated, nor under release procedure)
- *
- *  iv. isolation or dequeueing procedure must clear PagePrivate flag under
- *  page lock together with removing page from balloon device page list.
+ *  ii. isolation or dequeueing procedure must remove the page from balloon
+ *  device page list under b_dev_info->pages_lock.
  *
  * The functions provided by this interface are placed to help on coping with
  * the aforementioned balloon page corner case, as well as to ensure the simple
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 3/8] kexec: export PG_offline to VMCOREINFO

2018-11-22 Thread David Hildenbrand
Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
other balloon inflated memory will essentially result in zero pages getting
allocated by the hypervisor and the dump getting filled with this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.

We now have PG_offline which can be (and already is by virtio-balloon)
used for marking pages as logically offline. Follow up patches will
make use of this flag also in other balloon implementations.

Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale. (we export is as a macro to match how it is
done for PG_buddy. This way it is clearer that this is not actually a flag
but only a very specific mapcount value to represent page types).

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

Cc: Andrew Morton 
Cc: Dave Young 
Cc: "Kirill A. Shutemov" 
Cc: Baoquan He 
Cc: Omar Sandoval 
Cc: Arnd Bergmann 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Cc: Lianbo Jiang 
Cc: Borislav Petkov 
Cc: Kazuhito Hagio 
Acked-by: Michael S. Tsirkin 
Acked-by: Dave Young 
Signed-off-by: David Hildenbrand 
---
 kernel/crash_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
 #ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+   VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 #endif
 
arch_crash_save_vmcoreinfo();
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 8/8] PM / Hibernate: exclude all PageOffline() pages

2018-11-22 Thread David Hildenbrand
The content of pages that are marked PG_offline is not of interest
(e.g. inflated by a balloon driver), let's skip these pages.

In saveable_highmem_page(), move the PageReserved() check to a new
check along with the PageOffline() check to separate it from the
swsusp checks.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Pavel Machek 
Acked-by: Rafael J. Wysocki 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 87e6dd57819f..4802b039b89f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1221,8 +1221,10 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
 
BUG_ON(!PageHighMem(page));
 
-   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
-   PageReserved(page))
+   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page))
+   return NULL;
+
+   if (PageReserved(page) || PageOffline(page))
return NULL;
 
if (page_is_guard(page))
@@ -1286,6 +1288,9 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
+   if (PageOffline(page))
+   return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 5/8] hv_balloon: mark inflated pages PG_offline

2018-11-22 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Kairui Song 
Cc: Vitaly Kuznetsov 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Pankaj gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 211f3fe3a038..47719862e57f 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
-   if (!has_pfn_is_backed(has, page_to_pfn(pg)))
+   if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
+   if (!PageOffline(pg))
+   __SetPageOffline(pg);
return;
+   }
+   if (PageOffline(pg))
+   __ClearPageOffline(pg);
 
/* This frame is currently backed; online the page. */
__online_page_set_limits(pg);
@@ -1201,6 +1206,7 @@ static void free_balloon_pages(struct hv_dynmem_device 
*dm,
 
for (i = 0; i < num_pages; i++) {
pg = pfn_to_page(i + start_frame);
+   __ClearPageOffline(pg);
__free_page(pg);
dm->num_pages_ballooned--;
}
@@ -1213,7 +1219,7 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
struct dm_balloon_response *bl_resp,
int alloc_unit)
 {
-   unsigned int i = 0;
+   unsigned int i, j;
struct page *pg;
 
if (num_pages < alloc_unit)
@@ -1245,6 +1251,10 @@ static unsigned int alloc_balloon_pages(struct 
hv_dynmem_device *dm,
if (alloc_unit != 1)
split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
 
+   /* mark all pages offline */
+   for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++)
+   __SetPageOffline(pg + j);
+
bl_resp->range_count++;
bl_resp->range_array[i].finfo.start_page =
page_to_pfn(pg);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 6/8] vmw_balloon: mark inflated pages PG_offline

2018-11-22 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Xavier Deguillard 
Cc: Nadav Amit 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Julien Freche 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Acked-by: Nadav Amit 
Signed-off-by: David Hildenbrand 
---
 drivers/misc/vmw_balloon.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index e6126a4b95d3..877611b5659b 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -556,6 +556,36 @@ vmballoon_page_in_frames(enum vmballoon_page_size_type 
page_size)
return 1 << vmballoon_page_order(page_size);
 }
 
+/**
+ * vmballoon_mark_page_offline() - mark a page as offline
+ * @page: pointer for the page.
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_offline(struct page *page,
+   enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < vmballoon_page_in_frames(page_size); i++)
+   __SetPageOffline(page + i);
+}
+
+/**
+ * vmballoon_mark_page_online() - mark a page as online
+ * @page: pointer for the page.
+ * @page_size: the size of the page.
+ */
+static void
+vmballoon_mark_page_online(struct page *page,
+  enum vmballoon_page_size_type page_size)
+{
+   int i;
+
+   for (i = 0; i < vmballoon_page_in_frames(page_size); i++)
+   __ClearPageOffline(page + i);
+}
+
 /**
  * vmballoon_send_get_target() - Retrieve desired balloon size from the host.
  *
@@ -612,6 +642,7 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 ctl->page_size);
 
if (page) {
+   vmballoon_mark_page_offline(page, ctl->page_size);
/* Success. Add the page to the list and continue. */
list_add(&page->lru, &ctl->pages);
continue;
@@ -850,6 +881,7 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
 
list_for_each_entry_safe(page, tmp, page_list, lru) {
list_del(&page->lru);
+   vmballoon_mark_page_online(page, page_size);
__free_pages(page, vmballoon_page_order(page_size));
}
 
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 7/8] PM / Hibernate: use pfn_to_online_page()

2018-11-22 Thread David Hildenbrand
Let's use pfn_to_online_page() instead of pfn_to_page() when checking
for saveable pages to not save/restore offline memory sections.

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Acked-by: Pavel Machek 
Acked-by: Rafael J. Wysocki 
Signed-off-by: David Hildenbrand 
---
 kernel/power/snapshot.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 640b2034edd6..87e6dd57819f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone 
*zone, unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(!PageHighMem(page));
@@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, 
unsigned long pfn)
if (!pfn_valid(pfn))
return NULL;
 
-   page = pfn_to_page(pfn);
-   if (page_zone(page) != zone)
+   page = pfn_to_online_page(pfn);
+   if (!page || page_zone(page) != zone)
return NULL;
 
BUG_ON(PageHighMem(page));
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2 4/8] xen/balloon: mark inflated pages PG_offline

2018-11-22 Thread David Hildenbrand
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 drivers/xen/balloon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 12148289debd..14dd6b814db3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned 
int order)
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
__online_page_set_limits(p);
+   __SetPageOffline(p);
__balloon_append(p);
}
mutex_unlock(&balloon_mutex);
@@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long 
nr_pages)
xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);
 
/* Relinquish the page back to the allocator. */
+   __ClearPageOffline(page);
free_reserved_page(page);
}
 
@@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long 
nr_pages, gfp_t gfp)
state = BP_EAGAIN;
break;
}
+   __SetPageOffline(page);
adjust_managed_page_count(page, -1);
xenmem_reservation_scrub_page(page);
list_add(&page->lru, &pages);
-- 
2.17.2

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH v2] makedumpfile: exclude pages that are logically offline

2018-11-22 Thread David Hildenbrand
Linux marks pages that are logically offline via a page flag (map count).
Such pages e.g. include pages infated as part of a balloon driver or
pages that were not actually onlined when onlining the whole section.

While the hypervisor usually allows to read such inflated memory, we
basically read and dump data that is completely irrelevant. Also, this
might result in quite some overhead in the hypervisor. In addition,
we saw some problems under Hyper-V, whereby we can crash the kernel by
dumping, when reading memory of a partially onlined memory segment
(for memory added by the Hyper-V balloon driver).

Therefore, don't read and dump pages that are marked as being logically
offline.

Signed-off-by: David Hildenbrand 
---

v1 -> v2:
- Fix PAGE_BUDDY_MAPCOUNT_VALUE vs. PAGE_OFFLINE_MAPCOUNT_VALUE

 makedumpfile.c | 34 ++
 makedumpfile.h |  1 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 8923538..a5f2ea9 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -88,6 +88,7 @@ mdf_pfn_t pfn_cache_private;
 mdf_pfn_t pfn_user;
 mdf_pfn_t pfn_free;
 mdf_pfn_t pfn_hwpoison;
+mdf_pfn_t pfn_offline;
 
 mdf_pfn_t num_dumped;
 
@@ -249,6 +250,21 @@ isHugetlb(unsigned long dtor)
 && (SYMBOL(free_huge_page) == dtor));
 }
 
+static int
+isOffline(unsigned long flags, unsigned int _mapcount)
+{
+   if (NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE) == NOT_FOUND_NUMBER)
+   return FALSE;
+
+   if (flags & (1UL << NUMBER(PG_slab)))
+   return FALSE;
+
+   if (_mapcount == (int)NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE))
+   return TRUE;
+
+   return FALSE;
+}
+
 static int
 is_cache_page(unsigned long flags)
 {
@@ -2287,6 +2303,8 @@ write_vmcoreinfo_data(void)
WRITE_NUMBER("PG_hwpoison", PG_hwpoison);
 
WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   WRITE_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE",
+PAGE_OFFLINE_MAPCOUNT_VALUE);
WRITE_NUMBER("phys_base", phys_base);
 
WRITE_NUMBER("HUGETLB_PAGE_DTOR", HUGETLB_PAGE_DTOR);
@@ -2687,6 +2705,7 @@ read_vmcoreinfo(void)
READ_SRCFILE("pud_t", pud_t);
 
READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
+   READ_NUMBER("PAGE_OFFLINE_MAPCOUNT_VALUE", PAGE_OFFLINE_MAPCOUNT_VALUE);
READ_NUMBER("phys_base", phys_base);
 #ifdef __aarch64__
READ_NUMBER("VA_BITS", VA_BITS);
@@ -6041,6 +6060,12 @@ __exclude_unnecessary_pages(unsigned long mem_map,
else if (isHWPOISON(flags)) {
pfn_counter = &pfn_hwpoison;
}
+   /*
+* Exclude pages that are logically offline.
+*/
+   else if (isOffline(flags, _mapcount)) {
+   pfn_counter = &pfn_offline;
+   }
/*
 * Unexcludable page
 */
@@ -7522,7 +7547,7 @@ write_elf_pages_cyclic(struct cache_data *cd_header, 
struct cache_data *cd_page)
 */
if (info->flag_cyclic) {
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
}
 
@@ -8804,7 +8829,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data 
*cd_header, struct cache_d
 * Reset counter for debug message.
 */
pfn_zero = pfn_cache = pfn_cache_private = 0;
-   pfn_user = pfn_free = pfn_hwpoison = 0;
+   pfn_user = pfn_free = pfn_hwpoison = pfn_offline = 0;
pfn_memhole = info->max_mapnr;
 
/*
@@ -9749,7 +9774,7 @@ print_report(void)
pfn_original = info->max_mapnr - pfn_memhole;
 
pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private
-   + pfn_user + pfn_free + pfn_hwpoison;
+   + pfn_user + pfn_free + pfn_hwpoison + pfn_offline;
shrinking = (pfn_original - pfn_excluded) * 100;
shrinking = shrinking / pfn_original;
 
@@ -9763,6 +9788,7 @@ print_report(void)
REPORT_MSG("User process data pages : 0x%016llx\n", pfn_user);
REPORT_MSG("Free pages  : 0x%016llx\n", pfn_free);
REPORT_MSG("Hwpoison pages  : 0x%016llx\n", pfn_hwpoison);
+   REPORT_MSG("Offline pages   : 0x%016llx\n", pfn_offline);
REPORT_MSG("  Remaining pages  : 0x%016llx\n",
pfn_original - pfn_excluded);
REPORT_MSG("  (The number of pages is reduced to %lld%%.)\n",
@@ -9790,7 +9816,7 @@ print_mem_usage(void)
pfn_original = inf

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-23 Thread David Hildenbrand
On 28.09.18 17:03, David Hildenbrand wrote:
> How to/when to online hotplugged memory is hard to manage for
> distributions because different memory types are to be treated differently.
> Right now, we need complicated udev rules that e.g. check if we are
> running on s390x, on a physical system or on a virtualized system. But
> there is also sometimes the demand to really online memory immediately
> while adding in the kernel and not to wait for user space to make a
> decision. And on virtualized systems there might be different
> requirements, depending on "how" the memory was added (and if it will
> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> 
> On the one hand, we have physical systems where we sometimes
> want to be able to unplug memory again - e.g. a DIMM - so we have to online
> it to the MOVABLE zone optionally. That decision is usually made in user
> space.
> 
> On the other hand, we have memory that should never be onlined
> automatically, only when asked for by an administrator. Such memory only
> applies to virtualized environments like s390x, where the concept of
> "standby" memory exists. Memory is detected and added during boot, so it
> can be onlined when requested by the admininistrator or some tooling.
> Only when onlining, memory will be allocated in the hypervisor.
> 
> But then, we also have paravirtualized devices (namely xen and hyper-v
> balloons), that hotplug memory that will never ever be removed from a
> system right now using offline_pages/remove_memory. If at all, this memory
> is logically unplugged and handed back to the hypervisor via ballooning.
> 
> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).
> 
> Therefore, introduce memory block types and online memory depending on
> it when adding the memory. Expose the memory type to user space, so user
> space handlers can start to process only "normal" memory. Other memory
> block types can be ignored. One thing less to worry about in user space.
> 

So I was looking into alternatives.

1. Provide only "normal" and "standby" memory types to user space. This
way user space can make smarter decisions about how to online memory.
Not really sure if this is the right way to go.


2. Use device driver information (as mentioned by Michal S.).

The problem right now is that there are no drivers for memory block
devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
will not contain a "DRIVER" information and we ave no idea what kind of
memory block device we hold in our hands.

$ udevadm info -q all -a /sys/devices/system/memory/memory0

  looking at device '/devices/system/memory/memory0':
KERNEL=="memory0"
SUBSYSTEM=="memory"
DRIVER==""
ATTR{online}=="1"
ATTR{phys_device}=="0"
ATTR{phys_index}==""
ATTR{removable}=="0"
ATTR{state}=="online"
ATTR{valid_zones}=="none"


If we would provide "fake" drivers for the memory block devices we want
to treat in a special way in user space (e.g. standby memory on s390x),
user space could use that information to make smarter decisions.

Adding such drivers might work. My suggestion would be to let ordinary
DIMMs be without a driver for now and only special case standby memory
and eventually paravirtualized memory devices (XEN and Hyper-V).

Any thoughts?


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH net-next 00/12] switchdev: Convert switchdev_port_obj_{add,del}() to notifiers

2018-11-23 Thread David Miller
From: Petr Machata 
Date: Thu, 22 Nov 2018 23:27:52 +

> An offloading driver may need to have access to switchdev events on
> ports that aren't directly under its control. An example is a VXLAN port
> attached to a bridge offloaded by a driver. The driver needs to know
> about VLANs configured on the VXLAN device. However the VXLAN device
> isn't stashed between the bridge and a front-panel-port device (such as
> is the case e.g. for LAG devices), so the usual switchdev ops don't
> reach the driver.
> 
> VXLAN is likely not the only device type like this: in theory any L2
> tunnel device that needs offloading will prompt requirement of this
> sort.
> 
> A way to fix this is to give up the notion of port object addition /
> deletion as a switchdev operation, which assumes somewhat tight coupling
> between the message producer and consumer. And instead send the message
> over a notifier chain.
 ...

Series applied, thank you.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-26 Thread David Hildenbrand
On 23.11.18 19:06, Michal Suchánek wrote:
> On Fri, 23 Nov 2018 12:13:58 +0100
> David Hildenbrand  wrote:
> 
>> On 28.09.18 17:03, David Hildenbrand wrote:
>>> How to/when to online hotplugged memory is hard to manage for
>>> distributions because different memory types are to be treated differently.
>>> Right now, we need complicated udev rules that e.g. check if we are
>>> running on s390x, on a physical system or on a virtualized system. But
>>> there is also sometimes the demand to really online memory immediately
>>> while adding in the kernel and not to wait for user space to make a
>>> decision. And on virtualized systems there might be different
>>> requirements, depending on "how" the memory was added (and if it will
>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>
>>> On the one hand, we have physical systems where we sometimes
>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>> space.
>>>
>>> On the other hand, we have memory that should never be onlined
>>> automatically, only when asked for by an administrator. Such memory only
>>> applies to virtualized environments like s390x, where the concept of
>>> "standby" memory exists. Memory is detected and added during boot, so it
>>> can be onlined when requested by the admininistrator or some tooling.
>>> Only when onlining, memory will be allocated in the hypervisor.
>>>
>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>> balloons), that hotplug memory that will never ever be removed from a
>>> system right now using offline_pages/remove_memory. If at all, this memory
>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>
>>> For paravirtualized devices it is relevant that memory is onlined as
>>> quickly as possible after adding - and that it is added to the NORMAL
>>> zone. Otherwise, it could happen that too much memory in a row is added
>>> (but not onlined), resulting in out-of-memory conditions due to the
>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>> as delays might be very problematic and lead to crashes (e.g. zone
>>> imbalance).
>>>
>>> Therefore, introduce memory block types and online memory depending on
>>> it when adding the memory. Expose the memory type to user space, so user
>>> space handlers can start to process only "normal" memory. Other memory
>>> block types can be ignored. One thing less to worry about in user space.
>>>   
>>
>> So I was looking into alternatives.
>>
>> 1. Provide only "normal" and "standby" memory types to user space. This
>> way user space can make smarter decisions about how to online memory.
>> Not really sure if this is the right way to go.
>>
>>
>> 2. Use device driver information (as mentioned by Michal S.).
>>
>> The problem right now is that there are no drivers for memory block
>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>> will not contain a "DRIVER" information and we ave no idea what kind of
>> memory block device we hold in our hands.
>>
>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>
>>   looking at device '/devices/system/memory/memory0':
>> KERNEL=="memory0"
>> SUBSYSTEM=="memory"
>> DRIVER==""
>> ATTR{online}=="1"
>> ATTR{phys_device}=="0"
>> ATTR{phys_index}==""
>> ATTR{removable}=="0"
>> ATTR{state}=="online"
>> ATTR{valid_zones}=="none"
>>
>>
>> If we would provide "fake" drivers for the memory block devices we want
>> to treat in a special way in user space (e.g. standby memory on s390x),
>> user space could use that information to make smarter decisions.
>>
>> Adding such drivers might work. My suggestion would be to let ordinary
>> DIMMs be without a driver for now and only special case standby memory
>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>
>> Any thoughts?
> 
> If we are going to fake the driver information we may as well add the
> type attribute and be done with it.
> 
> I think the problem with the patch was more with the semantic than th

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-26 Thread David Hildenbrand
On 26.11.18 13:30, David Hildenbrand wrote:
> On 23.11.18 19:06, Michal Suchánek wrote:
>> On Fri, 23 Nov 2018 12:13:58 +0100
>> David Hildenbrand  wrote:
>>
>>> On 28.09.18 17:03, David Hildenbrand wrote:
>>>> How to/when to online hotplugged memory is hard to manage for
>>>> distributions because different memory types are to be treated differently.
>>>> Right now, we need complicated udev rules that e.g. check if we are
>>>> running on s390x, on a physical system or on a virtualized system. But
>>>> there is also sometimes the demand to really online memory immediately
>>>> while adding in the kernel and not to wait for user space to make a
>>>> decision. And on virtualized systems there might be different
>>>> requirements, depending on "how" the memory was added (and if it will
>>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>>
>>>> On the one hand, we have physical systems where we sometimes
>>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>>> space.
>>>>
>>>> On the other hand, we have memory that should never be onlined
>>>> automatically, only when asked for by an administrator. Such memory only
>>>> applies to virtualized environments like s390x, where the concept of
>>>> "standby" memory exists. Memory is detected and added during boot, so it
>>>> can be onlined when requested by the admininistrator or some tooling.
>>>> Only when onlining, memory will be allocated in the hypervisor.
>>>>
>>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>>> balloons), that hotplug memory that will never ever be removed from a
>>>> system right now using offline_pages/remove_memory. If at all, this memory
>>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>>
>>>> Therefore, introduce memory block types and online memory depending on
>>>> it when adding the memory. Expose the memory type to user space, so user
>>>> space handlers can start to process only "normal" memory. Other memory
>>>> block types can be ignored. One thing less to worry about in user space.
>>>>   
>>>
>>> So I was looking into alternatives.
>>>
>>> 1. Provide only "normal" and "standby" memory types to user space. This
>>> way user space can make smarter decisions about how to online memory.
>>> Not really sure if this is the right way to go.
>>>
>>>
>>> 2. Use device driver information (as mentioned by Michal S.).
>>>
>>> The problem right now is that there are no drivers for memory block
>>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>>> will not contain a "DRIVER" information and we ave no idea what kind of
>>> memory block device we hold in our hands.
>>>
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>>
>>>   looking at device '/devices/system/memory/memory0':
>>> KERNEL=="memory0"
>>> SUBSYSTEM=="memory"
>>> DRIVER==""
>>> ATTR{online}=="1"
>>> ATTR{phys_device}=="0"
>>> ATTR{phys_index}==""
>>> ATTR{removable}=="0"
>>> ATTR{state}=="online"
>>> ATTR{valid_zones}=="none"
>>>
>>>
>>> If we would provide "fake" drivers for the memory block devices we want
>>> to treat in a special way in user space (e.g. standby memory on s390x),
>>> user space could use that information to make smarter decisions.
>>>
>>> Adding such drivers might work. My suggestion would be to let ordinary
>>> DIMMs be without a dr

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-26 Thread David Hildenbrand
On 26.11.18 15:20, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 14:33:29 +0100
> David Hildenbrand  wrote:
> 
>> On 26.11.18 13:30, David Hildenbrand wrote:
>>> On 23.11.18 19:06, Michal Suchánek wrote:  
> 
>>>>
>>>> If we are going to fake the driver information we may as well add the
>>>> type attribute and be done with it.
>>>>
>>>> I think the problem with the patch was more with the semantic than the
>>>> attribute itself.
>>>>
>>>> What is normal, paravirtualized, and standby memory?
>>>>
>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>> adding memory you might have.
>>>>
>>>> I can understand "memory designated as standby by the cluster
>>>> administrator".
>>>>
>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>> conflated into one property.
>>>>
>>>> paravirtualized means nothing at all in relationship to memory type and
>>>> the desired online policy to me.  
>>>
>>> Right, so with whatever we come up, it should allow to make a decision
>>> in user space about
>>> - if memory is to be onlined automatically  
>>
>> And I will think about if we really should model standby memory. Maybe
>> it is really better to have in user space something like (as Dan noted)
> 
> If it is possible to designate the memory as standby or online in the
> s390 admin interface and the kernel does have access to this
> information it makes sense to forward it to userspace (as separate
> s390-specific property). If not then you need to make some kind of
> assumption like below and the user can tune the script according to
> their usecase.

Also true, standby memory really represents a distinct type of memory
block (memory seems to be there but really isn't). Right now I am
thinking about something like this (tried to formulate it on a very
generic level because we can't predict which mechanism might want to
make use of these types in the future).


/*
 * Memory block types allow user space to formulate rules if and how to
 * online memory blocks. The types are exposed to user space as text
 * strings in sysfs. While the typical online strategies are described
 * along with the types, there are use cases where that can differ (e.g.
 * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
 * due to zone imbalance or because memory unplug is not intended).
 *
 * MEMORY_BLOCK_NONE:
 *  No memory block is to be created (e.g. device memory). Used internally
 *  only.
 *
 * MEMORY_BLOCK_REMOVABLE:
 *  This memory block type should be treated as if it can be
 *  removed/unplugged from the system again. E.g. there is a hardware
 *  interface to unplug such memory. This memory block type is usually
 *  onlined to the MOVABLE zone, to e.g. make offlining of it more
 *  reliable. Examples include ACPI and PPC DIMMs.
 *
 * MEMORY_BLOCK_UNREMOVABLE:
 *  This memory block type should be treated as if it can not be
 *  removed/unplugged again. E.g. there is no hardware interface to
 *  unplug such memory. This memory block type is usually onlined to
 *  the NORMAL zone, as offlining is not beneficial. Examples include boot
 *  memory on most architectures and memory added via balloon devices.
 *
 * MEMORY_BLOCK_STANDBY:
 *  The memory block type should be treated as if it can be
 *  removed/unplugged again, however the actual memory hot(un)plug is
 *  performed by onlining/offlining. In virtual environments, such memory
 *  is usually added during boot and never removed. Onlining memory will
 *  result in memory getting allocated to a VM. This memory type is usually
 *  not onlined automatically but explicitly by the administrator. One
 *  example is standby memory on s390x.
 */

> 
>>
>> if (isS390x() && type == "dimm") {
>>  /* don't online, on s390x system DIMMs are standby memory */
>> }
> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-11-27 Thread David Hildenbrand
On 27.11.18 17:32, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 16:59:14 +0100
> David Hildenbrand  wrote:
> 
>> On 26.11.18 15:20, Michal Suchánek wrote:
>>> On Mon, 26 Nov 2018 14:33:29 +0100
>>> David Hildenbrand  wrote:
>>>   
>>>> On 26.11.18 13:30, David Hildenbrand wrote:  
>>>>> On 23.11.18 19:06, Michal Suchánek wrote:
>>>   
>>>>>>
>>>>>> If we are going to fake the driver information we may as well add the
>>>>>> type attribute and be done with it.
>>>>>>
>>>>>> I think the problem with the patch was more with the semantic than the
>>>>>> attribute itself.
>>>>>>
>>>>>> What is normal, paravirtualized, and standby memory?
>>>>>>
>>>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>>>> adding memory you might have.
>>>>>>
>>>>>> I can understand "memory designated as standby by the cluster
>>>>>> administrator".
>>>>>>
>>>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>>>> conflated into one property.
>>>>>>
>>>>>> paravirtualized means nothing at all in relationship to memory type and
>>>>>> the desired online policy to me.
>>>>>
>>>>> Right, so with whatever we come up, it should allow to make a decision
>>>>> in user space about
>>>>> - if memory is to be onlined automatically
>>>>
>>>> And I will think about if we really should model standby memory. Maybe
>>>> it is really better to have in user space something like (as Dan noted)  
>>>
>>> If it is possible to designate the memory as standby or online in the
>>> s390 admin interface and the kernel does have access to this
>>> information it makes sense to forward it to userspace (as separate
>>> s390-specific property). If not then you need to make some kind of
>>> assumption like below and the user can tune the script according to
>>> their usecase.  
>>
>> Also true, standby memory really represents a distinct type of memory
>> block (memory seems to be there but really isn't). Right now I am
>> thinking about something like this (tried to formulate it on a very
>> generic level because we can't predict which mechanism might want to
>> make use of these types in the future).
>>
>>
>> /*
>>  * Memory block types allow user space to formulate rules if and how to
>>  * online memory blocks. The types are exposed to user space as text
>>  * strings in sysfs. While the typical online strategies are described
>>  * along with the types, there are use cases where that can differ (e.g.
>>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>>  * due to zone imbalance or because memory unplug is not intended).
>>  *
>>  * MEMORY_BLOCK_NONE:
>>  *  No memory block is to be created (e.g. device memory). Used internally
>>  *  only.
>>  *
>>  * MEMORY_BLOCK_REMOVABLE:
>>  *  This memory block type should be treated as if it can be
>>  *  removed/unplugged from the system again. E.g. there is a hardware
>>  *  interface to unplug such memory. This memory block type is usually
>>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>>  *  reliable. Examples include ACPI and PPC DIMMs.
>>  *
>>  * MEMORY_BLOCK_UNREMOVABLE:
>>  *  This memory block type should be treated as if it can not be
>>  *  removed/unplugged again. E.g. there is no hardware interface to
>>  *  unplug such memory. This memory block type is usually onlined to
>>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>>  *  memory on most architectures and memory added via balloon devices.
> 
> AFAIK baloon device can be inflated as well so this does not really
> describe how this memory type works in any meaningful way. Also it
> should not be possible to see this kind of memory from userspace. The
> baloon driver just takes existing memory that is properly backed,
> allocates it for itself, and allows the hypervisor to use it. Thus it
> creates the equivalent to s390 standby memory which is not backed in
> the VM. When memory is reclaimed from hypervisor the baloon driver
> frees it making it available to the VM kernel again. However, the whole
> time the memory appears present in the machine and no hotplug events
> should be visib

  1   2   3   4   5   6   7   8   9   10   >