From: Kirill Tkhai <ktk...@virtuozzo.com> Adds latency calculation for: kstat_glob.swap_in kstat_glob.page_in kstat_glob.alloc_lat And fail count in: kstat_glob.alloc_fails
Also incorporates fixups patches: kstat: Make kstat_glob::swap_in percpu - core part ve/mm/kstat: Port diff-ve-kstat-disable-interrupts-around-seqcount-write-lock Related buglinks: https://jira.sw.ru/browse/PCLIN-31259 https://jira.sw.ru/browse/PSBM-33650 Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> Signed-off-by: Konstantin Khlebnikov <khlebni...@openvz.org> Signed-off-by: Vladimir Davydov <vdavy...@parallels.com> Rebase to vz8: The commit [1] is trying to reimplement swap_in part of this patch, but loses "goto out" hunk added in vz7.150.1 on rebase, bring the hunk back. Note: On rebase I would prefere merging [1] to this patch instead of merging this to [1]. Add vzstat.h where needed and replace __GFP_WAIT with it's successor __GFP_RECLAIM, skip kstat_init as it is already there. https://jira.sw.ru/browse/PSBM-127780 (cherry-picked from vz7 commit 9caa91f6a857 ("core: Add glob_kstat, percpu kstat and account mm stat")) Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> +++ vzstat: account "page_in" and "swap_in" in nanoseconds Up to now "page_in" and "swap_in" in /proc/vz/latency has been provided in cpu cycles while other latencies are in nanoseconds there. Let's make a single measure unit for all latencies, so provide swap_in and page_in in nanoseconds as well. Note: we left time accounting using direct rdtsc() with converting to ns afterwards. We understand there are some issues possible with correctness and using ktime_to_ns(ktime_get()) would be better (as it's done for other latencies), but switching to ktime_get() results in 2% performance loss on first memory access (pagefault + memory read), so decided not to slowdown fastpath and be aware of possible stats incorrectness. https://pmc.acronis.com/browse/VSTOR-16659 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> (cherry-picked from vz7 commit aedfe36c7fc5 ("vzstat: account "page_in" and "swap_in" in nanoseconds")) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> +++ kstat: Make kstat_glob::swap_in percpu Patchset description: Make kstat_glob::swap_in percpu and cleanup This patchset continues escaping of kstat_glb_lock and makes swap_in percpu. Also, newly unused primitives are dropped and reduced memory usage by using percpu seqcount (instead of separate percpu seqcount for every kstat percpu variable). Kirill Tkhai (4): kstat: Make kstat_glob::swap_in percpu kstat: Drop global kstat_lat_struct kstat: Drop cpu argument in KSTAT_LAT_PCPU_ADD() kstat: Make global percpu kstat_pcpu_seq instead of percpu seq for every variable ========================================== This patch description: Using of global local is not good for scalability. Better we make swap_in percpu, and it will be updated lockless like other statistics (e.g., page_in). Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> Ported to vz8: - Dropped all patchset but this patch, since it is already partially included - Introduced start in do_swap_page to use it for kstat_glob.swap_in (cherry picked from vz7 commit ed033a381e01 ("kstat: Make kstat_glob::swap_in percpu")) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> Reviewed-by: Kirill Tkhai <ktk...@virtuozzo.com> +++ vzstat: account "page_in" and "swap_in" in nanoseconds Up to now "page_in" and "swap_in" in /proc/vz/latency has been provided in cpu cycles while other latencies are in nanoseconds there. Let's make a single measure unit for all latencies, so provide swap_in and page_in in nanoseconds as well. Note: we left time accounting using direct rdtsc() with converting to ns afterwards. We understand there are some issues possible with correctness and using ktime_to_ns(ktime_get()) would be better (as it's done for other latencies), but switching to ktime_get() results in 2% performance loss on first memory access (pagefault + memory read), so decided not to slowdown fastpath and be aware of possible stats incorrectness. https://pmc.acronis.com/browse/VSTOR-16659 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> (cherry-picked from vz7 commit aedfe36c7fc5 ("vzstat: account "page_in" and "swap_in" in nanoseconds")) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> +++ mm/page_alloc: use sched_clock() instead of jiffies to measure latency sched_clock() (which is rdtsc() on x86) gives us more precise result than jiffies. Q: Why do we need greater accuracy? A: Because if we target to, say, 10000 IOPS (per cpu) then 1 ms memory allocation latency is too much and we need to achieve less alloc latency and thus measure it. https://pmc.acronis.com/browse/VSTOR-19040 Signed-off-by: Andrey Ryabinin <aryabi...@virtuozzo.com> (cherry-picked from vz7 commit 99407f6d6f50 ("mm/page_alloc: use sched_clock() instead of jiffies to measure latency")) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> +++ ve/kstat/alloc_lat: Don't separate GFP_HIGHMEM and !GFP_HIGHMEM allocation latencies We use mostly 64-bit systems this days. Since they don't have highmem it's better to not segregate GFP_HIGHMEM and !GFP_HIGHMEM latencies. For backward compatibility we still output alochigh/alochighmp fields in /proc/vz/latency but show only zeroes. https://jira.sw.ru/browse/PSBM-81395 Signed-off-by: Andrey Ryabinin <aryabi...@virtuozzo.com> https://jira.sw.ru/browse/PSBM-127780 (cherry-picked from vz7 commit 1fcbaf6d1fb2 ("ve/kstat/alloc_lat: Don't separate GFP_HIGHMEM and !GFP_HIGHMEM allocation latencies")) Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> (cherry-picked from vz8 commit ad75d76f5a08 ("core: Add glob_kstat, percpu kstat and account mm stat")) Signed-off-by: Nikita Yushchenko <nikita.yushche...@virtuozzo.com> --- mm/memory.c | 22 +++++++++++++++++++--- mm/page_alloc.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 51 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 2d0bc5ab5884..2511db99634e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -83,6 +83,7 @@ #include <linux/uaccess.h> #include <asm/tlb.h> #include <asm/tlbflush.h> +#include <asm/tsc.h> #include "pgalloc-track.h" #include "internal.h" @@ -3470,6 +3471,8 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) return 0; } +#define CLKS2NSEC(c) ((c) * 1000000 / tsc_khz) + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3489,7 +3492,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) int exclusive = 0; vm_fault_t ret = 0; void *shadow = NULL; + cycles_t start; + start = get_cycles(); if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) goto out; @@ -3693,6 +3698,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: if (si) put_swap_device(si); + + local_irq_disable(); + KSTAT_LAT_PCPU_ADD(&kstat_glob.swap_in, + CLKS2NSEC(get_cycles() - start)); + local_irq_enable(); + return ret; out_nomap: pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -3704,9 +3715,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) unlock_page(swapcache); put_page(swapcache); } - if (si) - put_swap_device(si); - return ret; + goto out; } /* @@ -3834,6 +3843,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + cycles_t start; /* * Preallocate pte before we take page_lock because this might lead to @@ -3857,6 +3867,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) smp_wmb(); /* See comment in __pte_alloc() */ } + start = get_cycles(); ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) @@ -3875,6 +3886,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) else VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + local_irq_disable(); + KSTAT_LAT_PCPU_ADD(&kstat_glob.page_in, + CLKS2NSEC(get_cycles() - start)); + local_irq_enable(); + return ret; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e511e1e45c5..ec4f4301a134 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -72,6 +72,7 @@ #include <linux/padata.h> #include <linux/khugepaged.h> #include <linux/buffer_head.h> +#include <linux/vzstat.h> #include <asm/sections.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -5404,6 +5405,34 @@ static __always_inline void warn_high_order(int order, gfp_t gfp_mask) } } +static void __alloc_collect_stats(gfp_t gfp_mask, unsigned int order, + struct page *page, u64 time) +{ +#ifdef CONFIG_VE + unsigned long flags; + u64 current_clock, delta; + int ind, cpu; + + current_clock = sched_clock(); + delta = current_clock - time; + if (!(gfp_mask & __GFP_RECLAIM)) + ind = KSTAT_ALLOCSTAT_ATOMIC; + else + if (order > 0) + ind = KSTAT_ALLOCSTAT_LOW_MP; + else + ind = KSTAT_ALLOCSTAT_LOW; + + local_irq_save(flags); + cpu = smp_processor_id(); + KSTAT_LAT_PCPU_ADD(&kstat_glob.alloc_lat[ind], delta); + + if (!page) + kstat_glob.alloc_fails[cpu][ind]++; + local_irq_restore(flags); +#endif +} + /* * This is the 'heart' of the zoned buddy allocator. */ @@ -5414,6 +5443,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, unsigned int alloc_flags = ALLOC_WMARK_LOW; gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */ struct alloc_context ac = { }; + u64 start; /* * There are several places where we assume that the order value is sane @@ -5447,6 +5477,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, */ alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); + start = sched_clock(); /* First allocation attempt */ page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); if (likely(page)) @@ -5470,6 +5501,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, page = NULL; } + __alloc_collect_stats(alloc_gfp, order, page, start); trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); return page; -- 2.30.2 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel