On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.willi...@intel.com> wrote: > On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.k...@hpe.com> wrote: >>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu <toshi.k...@hpe.com> >>> wrote: >>> > On Tue, 2016-08-23 at 15:32 -0700, Dan Williams wrote: >>> >> On Tue, Aug 23, 2016 at 11:43 AM, Toshi Kani <toshi.k...@hpe.com> >>> >> wrote: >>> > : >>> >> I'm not sure about this fix. The point of honoring >>> >> vmem_altmap_offset() is because a portion of the resource that is >>> >> passed to devm_memremap_pages() also contains the metadata info >>> block >>> >> for the device. The offset says "use everything past this point for >>> >> pages". This may work for avoiding a crash, but it may corrupt info >>> >> block metadata in the process. Can you provide more information >>> >> about the failing scenario to be sure that we are not triggering a >>> >> fault on an address that is not meant to have a page mapping? I.e. >>> >> what is the host physical address of the page that caused this fault, >>> >> and is it valid? >>> > >>> > The fault address in question was the 2nd page of an NVDIMM range. I >>> > assumed this fault address was valid and needed to be handled. Here is >>> > some info about the base and patched cases. Let me know if you need >>> > more info. >>> > >>> > Base >>> > ==== >>> > >>> > The following NVDIMM range was set to /dev/dax. >>> >>> With ndctl create-namespace or manually via sysfs? Specifically I'm >>> looking for what the 'align' attribute was set to when the >>> configuration was established. Can you provide a dump of the sysfs >>> attributes for the /dev/dax parent device? >> >> I used the ndctl command below. >> ndctl create-namespace -f -e namespace0.0 -m dax >> >> Here is additional info from my note for the base case. >> >> p {struct dev_pagemap} 0xffff88046d0453f0 >> $3 = { >> altmap = 0xffff88046d045410, >> res = 0xffff88046d0453a8, >> ref = 0xffff88046d0452f0, >> dev = 0xffff880464790410 >> } >> >> crash> p {struct vmem_altmap} 0xffff88046d045410 >> $6 = { >> base_pfn = 0x480000, >> reserve = 0x2, // PHYS_PFN(SZ_8K) >> free = 0x101fe, >> align = 0x1fe, >> alloc = 0x10000 >> } > > Ah, so, on second look the 0x490200000 data offset looks correct. The > total size of the address range is 16GB which equates to 256MB needed > for struct page, plus 2MB more to re-align the data on the next 2MB > boundary. > > The question now is why is the guest faulting on an access to an > address less than 0x490200000?
Does the attached patch fix this for you?
From 506cdd2c4ec0f9283ac4eb92259f2e8a71c5b637 Mon Sep 17 00:00:00 2001 From: Dan Williams <dan.j.willi...@intel.com> Date: Tue, 23 Aug 2016 19:59:31 -0700 Subject: [PATCH] dax: fix device-dax region base The data offset for a dax region needs to account for an altmap reservation in the resource range. Otherwise, device-dax is allowing mappings directly into the memmap or device info-block area, with crash signatures like the following: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff811ac851>] get_zone_device_page+0x11/0x30 Call Trace: follow_devmap_pmd+0x298/0x2c0 follow_page_mask+0x275/0x530 __get_user_pages+0xe3/0x750 __gfn_to_pfn_memslot+0x1b2/0x450 [kvm] ? hrtimer_try_to_cancel+0x2c/0x120 ? kvm_read_l1_tsc+0x55/0x60 [kvm] try_async_pf+0x66/0x230 [kvm] ? kvm_host_page_size+0x90/0xa0 [kvm] tdp_page_fault+0x130/0x280 [kvm] kvm_mmu_page_fault+0x5f/0xf0 [kvm] handle_ept_violation+0x94/0x180 [kvm_intel] vmx_handle_exit+0x1d3/0x1440 [kvm_intel] ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel] ? vmx_vcpu_run+0x2d1/0x490 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm] ? wake_up_q+0x44/0x80 kvm_vcpu_ioctl+0x33c/0x620 [kvm] ? __vfs_write+0x37/0x160 do_vfs_ioctl+0xa2/0x5d0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1a/0xa4 Cc: <sta...@vger.kernel.org> Cc: Andrew Morton <a...@linux-foundation.org> Fixes: ab68f2622136 ("/dev/dax, pmem: direct access to persistent memory") Reported-by: Abhilash Kumar Mulumudi <m.abhilash-ku...@hpe.com> Reported-by: Toshi Kani <toshi.k...@hpe.com> Signed-off-by: Dan Williams <dan.j.willi...@intel.com> --- drivers/dax/pmem.c | 2 ++ include/linux/memremap.h | 1 + kernel/memremap.c | 9 +++++++++ 3 files changed, 12 insertions(+) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index dfb168568af1..0603637df162 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -116,6 +116,8 @@ static int dax_pmem_probe(struct device *dev) if (rc) return rc; + res.start += vmem_altmap_resource_offset(altmap); + nd_region = to_nd_region(dev->parent); dax_region = alloc_dax_region(dev, nd_region->id, &res, le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 93416196ba64..7265b83cdbea 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -24,6 +24,7 @@ struct vmem_altmap { }; unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); +resource_size_t vmem_altmap_resource_offset(struct vmem_altmap *altmap); void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); #ifdef CONFIG_ZONE_DEVICE diff --git a/kernel/memremap.c b/kernel/memremap.c index 251d16b4cb41..941e897c52a8 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -384,6 +384,15 @@ unsigned long vmem_altmap_offset(struct vmem_altmap *altmap) return altmap->reserve + altmap->free; } +resource_size_t vmem_altmap_resource_offset(struct vmem_altmap *altmap) +{ + /* number of bytes from base where data starts after the memmap */ + if (!altmap) + return 0; + return vmem_altmap_offset(altmap) * PAGE_SIZE; +} +EXPORT_SYMBOL_GPL(vmem_altmap_resource_offset); + void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns) { altmap->alloc -= nr_pfns; -- 2.5.5