[PATCH V2 1/3] x86/Hyper-V: Fix definition of struct hv_vp_assist_page
From: Tianyu Lan The struct hv_vp_assist_page was defined incorrectly. The "vtl_control" should be u64[3], "nested_enlightenments _control" should be a u64 and there is 7 reserved bytes following "enlighten_vmentry". This patch is to fix it. Signed-off-by: Tianyu Lan -- Change since v1: Move definition of struct hv_nested_enlightenments_control into this patch to fix offset issue. --- arch/x86/include/asm/hyperv-tlfs.h | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index af78cd72b8f3..cf0b2a04271d 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -514,14 +514,24 @@ struct hv_timer_message_payload { __u64 delivery_time;/* When the message was delivered */ } __packed; +struct hv_nested_enlightenments_control { + struct { + __u32 directhypercall:1; + __u32 reserved:31; + } features; + struct { + __u32 reserved; + } hypercallControls; +} __packed; + /* Define virtual processor assist page structure. */ struct hv_vp_assist_page { __u32 apic_assist; - __u32 reserved; - __u64 vtl_control[2]; - __u64 nested_enlightenments_control[2]; - __u32 enlighten_vmentry; - __u32 padding; + __u32 reserved1; + __u64 vtl_control[3]; + struct hv_nested_enlightenments_control nested_control; + __u8 enlighten_vmentry; + __u8 reserved2[7]; __u64 current_nested_vmcs; } __packed; -- 2.14.5
[PATCH V2 2/3] KVM/Hyper-V: Add new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH
From: Tianyu Lan This patch adds new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH and let user space to enable direct tlb flush function when only Hyper-V hypervsior capability is exposed to VM. This patch also adds enable_direct_tlbflush callback in the struct kvm_x86_ops and platforms may use it to implement direct tlb flush support. Signed-off-by: Tianyu Lan --- Change since v1: Update description of KVM_CAP_HYPERV_DIRECT_TLBFLUSH in the KVM API doc. --- Documentation/virtual/kvm/api.txt | 13 + arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c| 8 include/uapi/linux/kvm.h | 1 + 4 files changed, 24 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2cd6250b2896..0c6e1b25d0c8 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -5289,3 +5289,16 @@ Architectures: x86 This capability indicates that KVM supports paravirtualized Hyper-V IPI send hypercalls: HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. +8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH + +Architecture: x86 + +This capability indicates that KVM running on top of Hyper-V hypervisor +enables Direct TLB flush for its guests meaning that TLB flush +hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. +Due to the different ABI for hypercall parameters between Hyper-V and +KVM, enabling this capability effectively disables all hypercall +handling by KVM (as some KVM hypercall may be mistakenly treated as TLB +flush hypercalls by Hyper-V) so userspace should disable KVM identification +in CPUID and only exposes Hyper-V identification. In this case, guest +thinks it's running on Hyper-V and only uses Hyper-V hypercalls. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 0cc5b611a113..667d154e89d4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1205,6 +1205,8 @@ struct kvm_x86_ops { uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu); bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu); + + int (*enable_direct_tlbflush)(struct kvm_vcpu *vcpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d7b9e6a0939..a9d8ee7f7bf0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3183,6 +3183,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = kvm_x86_ops->get_nested_state ? kvm_x86_ops->get_nested_state(NULL, NULL, 0) : 0; break; + case KVM_CAP_HYPERV_DIRECT_TLBFLUSH: + r = kvm_x86_ops->enable_direct_tlbflush ? 1 : 0; + break; default: break; } @@ -3953,6 +3956,11 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, r = -EFAULT; } return r; + case KVM_CAP_HYPERV_DIRECT_TLBFLUSH: + if (!kvm_x86_ops->enable_direct_tlbflush) + return -ENOTTY; + + return kvm_x86_ops->enable_direct_tlbflush(vcpu); default: return -EINVAL; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a7c19540ce21..cb959bc925b1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -996,6 +996,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172 #define KVM_CAP_PMU_EVENT_FILTER 173 +#define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 174 #ifdef KVM_CAP_IRQ_ROUTING -- 2.14.5
[PATCH V2 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support
From: Tianyu Lan This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V in L0 can delegate L1 hypervisor to handle tlb flush request from L2 guest when direct tlb flush is enabled in L1. Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable feature from user space. User space should enable this feature only when Hyper-V hypervisor capability is exposed to guest and KVM profile is hided. There is a parameter conflict between KVM and Hyper-V hypercall. We hope L2 guest doesn't use KVM hypercall when the feature is enabled. Detail please see comment of new API "KVM_CAP_HYPERV_DIRECT_TLBFLUSH" Change since v1: - Fix offset issue in the patch 1. - Update description of KVM KVM_CAP_HYPERV_DIRECT_TLBFLUSH. Tianyu Lan (2): x86/Hyper-V: Fix definition of struct hv_vp_assist_page KVM/Hyper-V: Add new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH Vitaly Kuznetsov (1): KVM/Hyper-V/VMX: Add direct tlb flush support Documentation/virtual/kvm/api.txt | 12 arch/x86/include/asm/hyperv-tlfs.h | 24 +++- arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/kvm/vmx/evmcs.h | 2 ++ arch/x86/kvm/vmx/vmx.c | 38 ++ arch/x86/kvm/x86.c | 8 include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 1 + 8 files changed, 83 insertions(+), 5 deletions(-) -- 2.14.5
Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
On Tue 13-08-19 17:29:09, Jann Horn wrote: > On Tue, Aug 13, 2019 at 12:09 PM Michal Hocko wrote: > > On Mon 12-08-19 20:14:38, Jann Horn wrote: > > > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > > > wrote: > > > > The page_idle tracking feature currently requires looking up the pagemap > > > > for a process followed by interacting with /sys/kernel/mm/page_idle. > > > > Looking up PFN from pagemap in Android devices is not supported by > > > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > > > > > > > This patch adds support to directly interact with page_idle tracking at > > > > the PID level by introducing a /proc//page_idle file. It follows > > > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > > > > looking up PFN through pagemap is not needed since the interface uses > > > > virtual frame numbers, and at the same time also does not require > > > > SYS_ADMIN. > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > profiles and pin points code paths which allocates and leaves memory > > > > idle for long periods of time. This method solves the security issue > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > better results than the pagemap lookup, the theory being that the window > > > > where the address space can change is reduced by eliminating the > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > process's mmap_sem is held for the duration of the access. > > > > > > What happens when you use this interface on shared pages, like memory > > > inherited from the zygote, library file mappings and so on? If two > > > profilers ran concurrently for two different processes that both map > > > the same libraries, would they end up messing up each other's data? > > > > Yup PageIdle state is shared. That is the page_idle semantic even now > > IIRC. > > > > > Can this be used to observe which library pages other processes are > > > accessing, even if you don't have access to those processes, as long > > > as you can map the same libraries? I realize that there are already a > > > bunch of ways to do that with side channels and such; but if you're > > > adding an interface that allows this by design, it seems to me like > > > something that should be gated behind some sort of privilege check. > > > > Hmm, you need to be priviledged to get the pfn now and without that you > > cannot get to any page so the new interface is weakening the rules. > > Maybe we should limit setting the idle state to processes with the write > > status. Or do you think that even observing idle status is useful for > > practical side channel attacks? If yes, is that a problem of the > > profiler which does potentially dangerous things? > > I suppose read-only access isn't a real problem as long as the > profiler isn't writing the idle state in a very tight loop... but I > don't see a usecase where you'd actually want that? As far as I can > tell, if you can't write the idle state, being able to read it is > pretty much useless. > > If the profiler only wants to profile process-private memory, then > that should be implementable in a safe way in principle, I think, but > since Joel said that they want to profile CoW memory as well, I think > that's inherently somewhat dangerous. I cannot really say how useful that would be but I can see that implementing ownership checks would be really non-trivial for shared pages. Reducing the interface to exclusive pages would make it easier as you noted but less helpful. Besides that the attack vector shouldn't be really much different from the page cache access, right? So essentially can_do_mincore model. I guess we want to document that page idle tracking should be used with care because it potentially opens a side channel opportunity if used on sensitive data. -- Michal Hocko SUSE Labs
Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
On Tue 13-08-19 11:36:59, Joel Fernandes wrote: > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote: > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote: > > > Idle page tracking currently does not work well in the following > > > scenario: > > > 1. mark page-A idle which was present at that time. > > > 2. run workload > > > 3. page-A is not touched by workload > > > 4. *sudden* memory pressure happen so finally page A is finally swapped > > > out > > > 5. now see the page A - it appears as if it was accessed (pte unmapped > > > so idle bit not set in output) - but it's incorrect. > > > > > > To fix this, we store the idle information into a new idle bit of the > > > swap PTE during swapping of anonymous pages. > > > > > > Also in the future, madvise extensions will allow a system process > > > manager (like Android's ActivityManager) to swap pages out of a process > > > that it knows will be cold. To an external process like a heap profiler > > > that is doing idle tracking on another process, this procedure will > > > interfere with the idle page tracking similar to the above steps. > > > > This could be solved by checking the !present/swapped out pages > > right? Whoever decided to put the page out to the swap just made it > > idle effectively. So the monitor can make some educated guess for > > tracking. If that is fundamentally not possible then please describe > > why. > > But the monitoring process (profiler) does not have control over the 'whoever > made it effectively idle' process. Why does that matter? Whether it is a global/memcg reclaim or somebody calling MADV_PAGEOUT or whatever it is a decision to make the page not hot. Sure you could argue that a missing idle bit on swap entries might mean that the swap out decision was pre-mature/sub-optimal/wrong but is this the aim of the interface? > As you said it will be a guess, it will not be accurate. Yes and the point I am trying to make is that having some space and not giving a guarantee sounds like a safer option for this interface because ... > > I am curious what is your concern with using a bit in the swap PTE? ... It is a promiss of the semantic I find limiting for future. The bit in the pte might turn out insufficient (e.g. pte reclaim) so teaching the userspace to consider this a hard guarantee is a ticket to problems later on. Maybe I am overly paranoid because I have seen so many "nice to have" features turning into a maintenance burden in the past. If this is really considered mostly debugging purpouse interface then a certain level of imprecision should be tolerateable. If there is a really strong real world usecase that simply has no other way to go then this might be added later. Adding an information is always safer than take it away. That being said, if I am a minority voice here then I will not really stand in the way and won't nack the patch. I will not ack it neither though. -- Michal Hocko SUSE Labs
Re: [PATCH v8 01/27] Documentation/x86: Add CET description
* Yu-cheng Yu: > +ENDBR > +The compiler inserts an ENDBR at all valid branch targets. Any > +CALL/JMP to a target without an ENDBR triggers a control > +protection fault. Is this really correct? I think ENDBR is needed only for indirect branch targets where the jump/call does not have a NOTRACK prefix. In general, for security hardening, it seems best to minimize the number of ENDBR instructions, and use NOTRACK for indirect jumps which derive the branch target address from information that cannot be modified. Thanks, Florian
PLEASE CONFIRM PURCHASE ORDER
Could you please confirm if your recieved our purchase order last week. If no please confirm let me resend it to you. NARESH KUMAR Executive Purchase Saiapextrading Ltd Dubai, KSA. (T/F): +96-2667-264 777 / 778 (Mo): +96 94284 02803 Website - http://www.saiapexgeneraltrading.com
[PATCH 2/2] docs: kbuild: remove cc-ldoption from document again
Commit 055efab3120b ("kbuild: drop support for cc-ldoption") correctly removed the cc-ldoption from Documentation/kbuild/makefiles.txt, but commit cd238effefa2 ("docs: kbuild: convert docs to ReST and rename to *.rst") revived it. I guess it was a rebase mistake. Remove it again. Fixes: cd238effefa2 ("docs: kbuild: convert docs to ReST and rename to *.rst") Signed-off-by: Masahiro Yamada --- Documentation/kbuild/makefiles.rst | 15 --- 1 file changed, 15 deletions(-) diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst index b4c28c543d72..7971729d1fd4 100644 --- a/Documentation/kbuild/makefiles.rst +++ b/Documentation/kbuild/makefiles.rst @@ -471,21 +471,6 @@ more details, with real examples. The second argument is optional, and if supplied will be used if first argument is not supported. -cc-ldoption - cc-ldoption is used to check if $(CC) when used to link object files - supports the given option. An optional second option may be - specified if first option are not supported. - - Example:: - - #arch/x86/kernel/Makefile - vsyscall-flags += $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) - - In the above example, vsyscall-flags will be assigned the option - -Wl$(comma)--hash-style=sysv if it is supported by $(CC). - The second argument is optional, and if supplied will be used - if first argument is not supported. - as-instr as-instr checks if the assembler reports a specific instruction and then outputs either option1 or option2 -- 2.17.1
[PATCH 1/2] docs: kbuild: fix invalid ReST syntax
I see the following warnings when I open this document with a ReST viewer, retext: /home/masahiro/ref/linux/Documentation/kbuild/makefiles.rst:1142: (WARNING/2) Inline emphasis start-string without end-string. /home/masahiro/ref/linux/Documentation/kbuild/makefiles.rst:1152: (WARNING/2) Inline emphasis start-string without end-string. /home/masahiro/ref/linux/Documentation/kbuild/makefiles.rst:1154: (WARNING/2) Inline emphasis start-string without end-string. These hunks were added by commit e846f0dc57f4 ("kbuild: add support for ensuring headers are self-contained") and commit 1e21cbfada87 ("kbuild: support header-test-pattern-y"), respectively. They were written not for ReST but for the plain text, and merged via the kbuild tree. In the same development cycle, this document was converted to ReST by commit cd238effefa2 ("docs: kbuild: convert docs to ReST and rename to *.rst"), and merged via the doc sub-system. Merging them together into Linus' tree resulted in the current situation. To fix the syntax, surround the asterisks with back-quotes, and use :: for the code sample. Fixes: 39ceda5ce1b0 ("Merge tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild") Signed-off-by: Masahiro Yamada --- Documentation/kbuild/makefiles.rst | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst index f4f0f7ffde2b..b4c28c543d72 100644 --- a/Documentation/kbuild/makefiles.rst +++ b/Documentation/kbuild/makefiles.rst @@ -1139,7 +1139,7 @@ When kbuild executes, the following steps are followed (roughly): header-test-y - header-test-y specifies headers (*.h) in the current directory that + header-test-y specifies headers (`*.h`) in the current directory that should be compile tested to ensure they are self-contained, i.e. compilable as standalone units. If CONFIG_HEADER_TEST is enabled, this builds them as part of extra-y. @@ -1147,11 +1147,11 @@ When kbuild executes, the following steps are followed (roughly): header-test-pattern-y This works as a weaker version of header-test-y, and accepts wildcard - patterns. The typical usage is: + patterns. The typical usage is:: - header-test-pattern-y += *.h + header-test-pattern-y += *.h - This specifies all the files that matches to '*.h' in the current + This specifies all the files that matches to `*.h` in the current directory, but the files in 'header-test-' are excluded. 6.7 Commands useful for building a boot image -- 2.17.1
[PATCH 0/2] docs: two fixes for Kbuild document after the ReST conversion
The ReST conversion was merged in the previous merge window. Iron out some issues. Masahiro Yamada (2): docs: kbuild: fix invalid ReST syntax docs: kbuild: remove cc-ldoption from document again Documentation/kbuild/makefiles.rst | 23 --- 1 file changed, 4 insertions(+), 19 deletions(-) -- 2.17.1
Re: [RFC 01/19] kbuild: Fixes to rules for host-cshlib and host-cxxshlib
On Wed, 2019-08-14 at 07:52 +0200, Knut Omang wrote: > On Wed, 2019-08-14 at 11:02 +0900, Masahiro Yamada wrote: > > Hi Knut, > > > > On Wed, Aug 14, 2019 at 1:19 AM Knut Omang wrote: > > > On Tue, 2019-08-13 at 23:01 +0900, Masahiro Yamada wrote: > > > > On Tue, Aug 13, 2019 at 3:13 PM Knut Omang > > > > wrote: > > > > > C++ libraries interfacing to C APIs might sometimes need some glue > > > > > logic more easily written in C. > > > > > Allow a C++ library to also contain 0 or more C objects. > > > > > > > > > > Also fix rules for both C and C++ shared libraries: > > > > > - C++ shared libraries depended on .c instead of .cc files > > > > > - Rules were referenced as -objs instead of the intended > > > > > -cobjs and -cxxobjs following the pattern from hostprogs*. > > > > > > > > > > Signed-off-by: Knut Omang > > > > > > > > How is this patch related to the rest of this series? > > > > > > This is just my (likely naive) way I to get what I had working > > > using autotools in the Github version of KTF) translated into something > > > comparable using kbuild only. We need to build a shared library consisting > > > of a few C++ files and a very simple C file, and a couple of simple > > > binaries, > > > and the rule in there does seem to take .c files and subject them to the > > > C++ compiler, which makes this difficult to achieve? > > > > Looking at the diff stat of the cover-letter, > > the rest of this patch series is touching only > > Documentation/ and tools/testing/kselftests/. > > > > So, this one is unused by the rest of the changes, isn't it? > > Am I missing something? > > > > > > > > > > This patch breaks GCC-plugins. > > > > Did you really compile-test this patch before the submission? > > > > > > Sorry for my ignorance here: > > > I ran through the kernel build and installed the resulting kernel > > > on a VM that I used to test this, if that's what you are asking > > > about? > > > > > > Do I need some unusual .config options or run a special make target > > > to trigger the problem you see? > > > > > > I used a recent Fedora config with default values for new options, > > > and ran the normal default make target (also with O=) and make selftests > > > to test the patch itself. > > > > I just built allmodconfig for arm. > > > > (The 0-day bot tests allmodconfig for most of architectures, > > so you may receive error reports anyway.) > > > > > > With your patch, I got the following: > > > > > > masahiro@grover:~/ref/linux$ make ARCH=arm > > CROSS_COMPILE=- allmodconfig all > > HOSTCC scripts/basic/fixdep > > HOSTCC scripts/kconfig/conf.o > > HOSTCC scripts/kconfig/confdata.o > > HOSTCC scripts/kconfig/expr.o > > LEX scripts/kconfig/lexer.lex.c > > YACCscripts/kconfig/parser.tab.h > > HOSTCC scripts/kconfig/lexer.lex.o > > YACCscripts/kconfig/parser.tab.c > > HOSTCC scripts/kconfig/parser.tab.o > > HOSTCC scripts/kconfig/preprocess.o > > HOSTCC scripts/kconfig/symbol.o > > HOSTLD scripts/kconfig/conf > > scripts/kconfig/conf --allmodconfig Kconfig > > # > > # configuration written to .config > > # > > SYSHDR arch/arm/include/generated/uapi/asm/unistd-common.h > > SYSHDR arch/arm/include/generated/uapi/asm/unistd-oabi.h > > SYSHDR arch/arm/include/generated/uapi/asm/unistd-eabi.h > > HOSTCC scripts/dtc/dtc.o > > HOSTCC scripts/dtc/flattree.o > > HOSTCC scripts/dtc/fstree.o > > HOSTCC scripts/dtc/data.o > > HOSTCC scripts/dtc/livetree.o > > HOSTCC scripts/dtc/treesource.o > > HOSTCC scripts/dtc/srcpos.o > > HOSTCC scripts/dtc/checks.o > > HOSTCC scripts/dtc/util.o > > LEX scripts/dtc/dtc-lexer.lex.c > > YACCscripts/dtc/dtc-parser.tab.h > > HOSTCC scripts/dtc/dtc-lexer.lex.o > > YACCscripts/dtc/dtc-parser.tab.c > > HOSTCC scripts/dtc/dtc-parser.tab.o > > HOSTCC scripts/dtc/yamltree.o > > HOSTLD scripts/dtc/dtc > > CC scripts/gcc-plugins/latent_entropy_plugin.o > > cc1: error: cannot load plugin > > ./scripts/gcc-plugins/arm_ssp_per_task_plugin.so > >./scripts/gcc-plugins/arm_ssp_per_task_plugin.so: cannot open > > shared object file: No such file or directory > > cc1: error: cannot load plugin ./scripts/gcc-plugins/structleak_plugin.so > >./scripts/gcc-plugins/structleak_plugin.so: cannot open shared > > object file: No such file or directory > > cc1: error: cannot load plugin > > ./scripts/gcc-plugins/latent_entropy_plugin.so > >./scripts/gcc-plugins/latent_entropy_plugin.so: cannot open shared > > object file: No such file or directory > > cc1: error: cannot load plugin > > ./scripts/gcc-plugins/randomize_layout_plugin.so > >./scripts/gcc-plugins/randomize_layout_plugin.so: cannot open > > shared object file: No such file or directory > > make[3]: *** [scripts/Makefile.build;281: > > scripts/gcc-plugins/latent_entropy_plugin.o] Error 1 > > make[2]: *** [scripts/Makefile.build;497: scripts/gcc-plugins] Error 2 > > make[1]: *** [Make
Re: [PATCH V2 1/3] x86/Hyper-V: Fix definition of struct hv_vp_assist_page
On 14/08/19 09:34, lantianyu1...@gmail.com wrote: > From: Tianyu Lan > > The struct hv_vp_assist_page was defined incorrectly. > The "vtl_control" should be u64[3], "nested_enlightenments > _control" should be a u64 and there is 7 reserved bytes > following "enlighten_vmentry". This patch is to fix it. How did the assignment to vp_ap->current_nested_vmcs work then? Does the guest simply not care? Paolo > Signed-off-by: Tianyu Lan > -- > Change since v1: >Move definition of struct hv_nested_enlightenments_control >into this patch to fix offset issue. > --- > arch/x86/include/asm/hyperv-tlfs.h | 20 +++- > 1 file changed, 15 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/hyperv-tlfs.h > b/arch/x86/include/asm/hyperv-tlfs.h > index af78cd72b8f3..cf0b2a04271d 100644 > --- a/arch/x86/include/asm/hyperv-tlfs.h > +++ b/arch/x86/include/asm/hyperv-tlfs.h > @@ -514,14 +514,24 @@ struct hv_timer_message_payload { > __u64 delivery_time;/* When the message was delivered */ > } __packed; > > +struct hv_nested_enlightenments_control { > + struct { > + __u32 directhypercall:1; > + __u32 reserved:31; > + } features; > + struct { > + __u32 reserved; > + } hypercallControls; > +} __packed; > + > /* Define virtual processor assist page structure. */ > struct hv_vp_assist_page { > __u32 apic_assist; > - __u32 reserved; > - __u64 vtl_control[2]; > - __u64 nested_enlightenments_control[2]; > - __u32 enlighten_vmentry; > - __u32 padding; > + __u32 reserved1; > + __u64 vtl_control[3]; > + struct hv_nested_enlightenments_control nested_control; > + __u8 enlighten_vmentry; > + __u8 reserved2[7]; > __u64 current_nested_vmcs; > } __packed; > >
Re: [PATCH V2 1/3] x86/Hyper-V: Fix definition of struct hv_vp_assist_page
On 14/08/19 15:26, Paolo Bonzini wrote: > On 14/08/19 09:34, lantianyu1...@gmail.com wrote: >> From: Tianyu Lan >> >> The struct hv_vp_assist_page was defined incorrectly. >> The "vtl_control" should be u64[3], "nested_enlightenments >> _control" should be a u64 and there is 7 reserved bytes >> following "enlighten_vmentry". This patch is to fix it. > > How did the assignment to vp_ap->current_nested_vmcs work then? Does > the guest simply not care? ... nevermind, I miscounted the length of vtl_control. Paolo > Paolo > >> Signed-off-by: Tianyu Lan >> -- >> Change since v1: >>Move definition of struct hv_nested_enlightenments_control >>into this patch to fix offset issue. >> --- >> arch/x86/include/asm/hyperv-tlfs.h | 20 +++- >> 1 file changed, 15 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/include/asm/hyperv-tlfs.h >> b/arch/x86/include/asm/hyperv-tlfs.h >> index af78cd72b8f3..cf0b2a04271d 100644 >> --- a/arch/x86/include/asm/hyperv-tlfs.h >> +++ b/arch/x86/include/asm/hyperv-tlfs.h >> @@ -514,14 +514,24 @@ struct hv_timer_message_payload { >> __u64 delivery_time;/* When the message was delivered */ >> } __packed; >> >> +struct hv_nested_enlightenments_control { >> +struct { >> +__u32 directhypercall:1; >> +__u32 reserved:31; >> +} features; >> +struct { >> +__u32 reserved; >> +} hypercallControls; >> +} __packed; >> + >> /* Define virtual processor assist page structure. */ >> struct hv_vp_assist_page { >> __u32 apic_assist; >> -__u32 reserved; >> -__u64 vtl_control[2]; >> -__u64 nested_enlightenments_control[2]; >> -__u32 enlighten_vmentry; >> -__u32 padding; >> +__u32 reserved1; >> +__u64 vtl_control[3]; >> +struct hv_nested_enlightenments_control nested_control; >> +__u8 enlighten_vmentry; >> +__u8 reserved2[7]; >> __u64 current_nested_vmcs; >> } __packed; >> >> >
Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote: [snip] > > This patch adds basic batching support for kfree_rcu(). It is "basic" > > because we do none of the slab management, dynamic allocation, code > > moving or any of the other things, some of which previous attempts did > > [2]. These fancier improvements can be follow-up patches and there are > > different ideas being discussed in those regards. > > > > Torture tests follow in the next patch and show improvements of around > > 400% in reduction of number of grace periods on a 16 CPU system. More > > s/400% in reduction/a 5x reduction/ Ok. That's more clear. > > details and test data are in that patch. > > > > This is an effort to start simple, and build up from there. In the > > future, an extension to use kfree_bulk and possibly per-slab batching > > could be done to further improve performance due to cache-locality and > > slab-specific bulk free optimizations. By using an array of pointers, > > the worker thread processing the work would need to read lesser data > > since it does not need to deal with large rcu_head(s) any longer. > > This should be combined with the second paragraph -- they both discuss > possible follow-on work. Ack. > > There is an implication with rcu_barrier() with this patch. Since the > > kfree_rcu() calls can be batched, and may not be handed yet to the RCU > > machinery in fact, the monitor may not have even run yet to do the > > queue_rcu_work(), there seems no easy way of implementing rcu_barrier() > > to wait for those kfree_rcu()s that are already made. So this means a > > kfree_rcu() followed by an rcu_barrier() does not imply that memory will > > be freed once rcu_barrier() returns. > > The usual approach (should kfree_rcu_barrier() in fact be needed) is to > record the address of the most recent block being kfree_rcu()'d on > each CPU, count the number of recorded addresses, and have the > kfree() side check the address, and do the usual atomic_dec_and_test() > and wakeup dance. This gets a bit more crazy should the kfree_rcu() > requests be grouped per slab, of course! So yes, good to avoid in > the meantime. Good idea! > > Another implication is higher active memory usage (although not > > run-away..) until the kfree_rcu() flooding ends, in comparison to > > without batching. More details about this are in the second patch which > > adds an rcuperf test. > > But this is adjustable, at least via patching at build time, via > KFREE_DRAIN_JIFFIES, right? Yes. > > Finally, in the near future we will get rid of kfree_rcu special casing > > within RCU such as in rcu_do_batch and switch everything to just > > batching. Currently we don't do that since timer subsystem is not yet up > > and we cannot schedule the kfree_rcu() monitor as the timer subsystem's > > lock are not initialized. That would also mean getting rid of > > kfree_call_rcu_nobatch() entirely. > > Please consistently put "()" after function names. Done. > > [1] > > http://lore.kernel.org/lkml/20190723035725-mutt-send-email-...@kernel.org > > [2] https://lkml.org/lkml/2017/12/19/824 > > > > Cc: Rao Shoaib > > Cc: max.byungchul.p...@gmail.com > > Cc: byungchul.p...@lge.com > > Cc: kernel-t...@android.com > > Cc: kernel-t...@lge.com > > Co-developed-by: Byungchul Park > > Signed-off-by: Byungchul Park > > Signed-off-by: Joel Fernandes (Google) > > > > --- > > v2->v3: Just some code comment changes thanks to Byungchul. > > > > RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow. > >Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce > > OOM occurrence. > >Removed sleeps in rcuperf test, just using > > cond_resched()in loop. > >Better code comments ;) > > > > .../admin-guide/kernel-parameters.txt | 17 ++ > > include/linux/rcutiny.h | 5 + > > include/linux/rcutree.h | 1 + > > kernel/rcu/tree.c | 204 +- > > 4 files changed, 221 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > b/Documentation/admin-guide/kernel-parameters.txt > > index 7ccd158b3894..a9156ca5de24 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -3895,6 +3895,23 @@ > > test until boot completes in order to avoid > > interference. > > > > + rcuperf.kfree_rcu_test= [KNL] > > + Set to measure performance of kfree_rcu() flooding. > > + > > + rcuperf.kfree_nthreads= [KNL] > > + The number of threads running loops of kfree_rcu(). > > + > > + rcuperf.kfree_alloc_num= [KNL] > > + Number of allocations and frees done in an iteration. > > + > > + rcuperf.kfree_loops= [KNL] > > + Number of loops doing rcuperf.kfree_alloc_nu
[PATCH 1/3] kbuild: move KBUILD_LDS, KBUILD_VMLINUX_{OBJS,LIBS} to makefiles.rst
These three variables are not intended to be tweaked by users. Move them from kbuild.rst to makefiles.rst. Signed-off-by: Masahiro Yamada --- I will apply to linux-kbuild this to avoid conflicts. Documentation/kbuild/kbuild.rst| 14 -- Documentation/kbuild/makefiles.rst | 14 ++ 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/Documentation/kbuild/kbuild.rst b/Documentation/kbuild/kbuild.rst index 61b2181ed3ea..62f9d86c082c 100644 --- a/Documentation/kbuild/kbuild.rst +++ b/Documentation/kbuild/kbuild.rst @@ -258,17 +258,3 @@ KBUILD_BUILD_USER, KBUILD_BUILD_HOST These two variables allow to override the user@host string displayed during boot and in /proc/version. The default value is the output of the commands whoami and host, respectively. - -KBUILD_LDS --- -The linker script with full path. Assigned by the top-level Makefile. - -KBUILD_VMLINUX_OBJS -All object files for vmlinux. They are linked to vmlinux in the same -order as listed in KBUILD_VMLINUX_OBJS. - -KBUILD_VMLINUX_LIBS -All .a "lib" files for vmlinux. KBUILD_VMLINUX_OBJS and KBUILD_VMLINUX_LIBS -together specify all the object files used to link vmlinux. diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst index f4f0f7ffde2b..d3448d2c8017 100644 --- a/Documentation/kbuild/makefiles.rst +++ b/Documentation/kbuild/makefiles.rst @@ -995,6 +995,20 @@ When kbuild executes, the following steps are followed (roughly): top-level Makefile has set any other flags. This provides a means for an architecture to override the defaults. +KBUILD_LDS + + The linker script with full path. Assigned by the top-level Makefile. + +KBUILD_VMLINUX_OBJS + + All object files for vmlinux. They are linked to vmlinux in the same + order as listed in KBUILD_VMLINUX_OBJS. + +KBUILD_VMLINUX_LIBS + + All .a "lib" files for vmlinux. KBUILD_VMLINUX_OBJS and + KBUILD_VMLINUX_LIBS together specify all the object files used to + link vmlinux. 6.2 Add prerequisites to archheaders -- 2.17.1
Re: [PATCH v8 01/27] Documentation/x86: Add CET description
On Wed, 2019-08-14 at 10:07 +0200, Florian Weimer wrote: > * Yu-cheng Yu: > > > +ENDBR > > +The compiler inserts an ENDBR at all valid branch targets. Any > > +CALL/JMP to a target without an ENDBR triggers a control > > +protection fault. > > Is this really correct? I think ENDBR is needed only for indirect > branch targets where the jump/call does not have a NOTRACK prefix. You are right. I will fix the wording. Yu-cheng
[PATCH 2/3] kbuild: rebuild modules when module linker scripts are updated
Currently, the timestamp of module linker scripts are not checked. Add them to the dependency of modules so they are correctly rebuilt. Signed-off-by: Masahiro Yamada --- Documentation/kbuild/makefiles.rst | 5 + Makefile | 3 ++- arch/arm/Makefile | 2 +- arch/arm64/Makefile| 2 +- arch/ia64/Makefile | 2 +- arch/m68k/Makefile | 2 +- arch/parisc/Makefile | 2 +- arch/powerpc/Makefile | 2 +- arch/riscv/Makefile| 2 +- scripts/Makefile.modpost | 5 +++-- 10 files changed, 17 insertions(+), 10 deletions(-) diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst index d3448d2c8017..36ba92e199d2 100644 --- a/Documentation/kbuild/makefiles.rst +++ b/Documentation/kbuild/makefiles.rst @@ -999,6 +999,11 @@ When kbuild executes, the following steps are followed (roughly): The linker script with full path. Assigned by the top-level Makefile. +KBUILD_LDS_MODULE + + The module linker script with full path. Assigned by the top-level + Makefile and additionally by the arch Makefile. + KBUILD_VMLINUX_OBJS All object files for vmlinux. They are linked to vmlinux in the same diff --git a/Makefile b/Makefile index 164ca615e2f6..af808837a1f2 100644 --- a/Makefile +++ b/Makefile @@ -485,7 +485,8 @@ KBUILD_AFLAGS_KERNEL := KBUILD_CFLAGS_KERNEL := KBUILD_AFLAGS_MODULE := -DMODULE KBUILD_CFLAGS_MODULE := -DMODULE -KBUILD_LDFLAGS_MODULE := -T $(srctree)/scripts/module-common.lds +KBUILD_LDFLAGS_MODULE := +export KBUILD_LDS_MODULE := $(srctree)/scripts/module-common.lds KBUILD_LDFLAGS := GCC_PLUGINS_CFLAGS := CLANG_FLAGS := diff --git a/arch/arm/Makefile b/arch/arm/Makefile index c3624ca6c0bc..fbe50eec8f34 100644 --- a/arch/arm/Makefile +++ b/arch/arm/Makefile @@ -17,7 +17,7 @@ KBUILD_LDFLAGS_MODULE += --be8 endif ifeq ($(CONFIG_ARM_MODULE_PLTS),y) -KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/arm/kernel/module.lds +KBUILD_LDS_MODULE += $(srctree)/arch/arm/kernel/module.lds endif GZFLAGS:=-9 diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 61de992bbea3..d4ed1869e536 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -101,7 +101,7 @@ endif CHECKFLAGS += -D__aarch64__ ifeq ($(CONFIG_ARM64_MODULE_PLTS),y) -KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/arm64/kernel/module.lds +KBUILD_LDS_MODULE += $(srctree)/arch/arm64/kernel/module.lds endif # Default value diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile index 171290f9f1de..5c3bcaee5980 100644 --- a/arch/ia64/Makefile +++ b/arch/ia64/Makefile @@ -20,7 +20,7 @@ CHECKFLAGS+= -D__ia64=1 -D__ia64__=1 -D_LP64 -D__LP64__ OBJCOPYFLAGS := --strip-all LDFLAGS_vmlinux:= -static -KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/ia64/module.lds +KBUILD_LDS_MODULE += $(srctree)/arch/ia64/module.lds KBUILD_AFLAGS_KERNEL := -mconstant-gp EXTRA := diff --git a/arch/m68k/Makefile b/arch/m68k/Makefile index 482513b9af2c..5d9288384096 100644 --- a/arch/m68k/Makefile +++ b/arch/m68k/Makefile @@ -73,7 +73,7 @@ KBUILD_AFLAGS += -D__uClinux__ endif KBUILD_LDFLAGS := -m m68kelf -KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/m68k/kernel/module.lds +KBUILD_LDS_MODULE += $(srctree)/arch/m68k/kernel/module.lds ifdef CONFIG_SUN3 LDFLAGS_vmlinux = -N diff --git a/arch/parisc/Makefile b/arch/parisc/Makefile index 3b77d729057f..36b834f1c933 100644 --- a/arch/parisc/Makefile +++ b/arch/parisc/Makefile @@ -60,7 +60,7 @@ KBUILD_CFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY=1 \ -DFTRACE_PATCHABLE_FUNCTION_SIZE=$(NOP_COUNT) CC_FLAGS_FTRACE := -fpatchable-function-entry=$(NOP_COUNT),$(shell echo $$(($(NOP_COUNT)-1))) -KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/parisc/kernel/module.lds +KBUILD_LDS_MODULE += $(srctree)/arch/parisc/kernel/module.lds endif OBJCOPY_FLAGS =-O binary -R .note -R .comment -S diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index c345b79414a9..b2227855de20 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -67,7 +67,7 @@ UTS_MACHINE := $(subst $(space),,$(machine-y)) ifdef CONFIG_PPC32 KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o else -KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/powerpc/kernel/module.lds +KBUILD_LDS_MODULE += $(srctree)/arch/powerpc/kernel/module.lds ifeq ($(call ld-ifversion, -ge, 22500, y),y) # Have the linker provide sfpr if possible. # There is a corresponding test in arch/powerpc/lib/Makefile diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile index 7a117be8297c..426d989125a8 100644 --- a/arch/riscv/Makefile +++ b/arch/riscv/Makefile @@ -52,7 +52,7 @@ ifeq ($(CONFIG_CMODEL_MEDANY),y) KBUILD_CFLAGS += -mcmodel=medany endif ifeq ($(CONFIG_MODULE_SECTIONS),y) - KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/riscv/kernel/module.lds + KBUILD_LDS_MODULE
Re: [PATCH v8 09/27] mm/mmap: Prevent Shadow Stack VMA merges
On Tue, 2019-08-13 at 15:34 -0700, Dave Hansen wrote: > On 8/13/19 1:52 PM, Yu-cheng Yu wrote: > > To prevent function call/return spills into the next shadow stack > > area, do not merge shadow stack areas. > > How does this prevent call/return spills? It does not. I will fix the description. Yu-cheng
Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
On Wed, Aug 14, 2019 at 10:05:31AM +0200, Michal Hocko wrote: > On Tue 13-08-19 11:36:59, Joel Fernandes wrote: > > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote: > > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote: > > > > Idle page tracking currently does not work well in the following > > > > scenario: > > > > 1. mark page-A idle which was present at that time. > > > > 2. run workload > > > > 3. page-A is not touched by workload > > > > 4. *sudden* memory pressure happen so finally page A is finally > > > > swapped out > > > > 5. now see the page A - it appears as if it was accessed (pte unmapped > > > > so idle bit not set in output) - but it's incorrect. > > > > > > > > To fix this, we store the idle information into a new idle bit of the > > > > swap PTE during swapping of anonymous pages. > > > > > > > > Also in the future, madvise extensions will allow a system process > > > > manager (like Android's ActivityManager) to swap pages out of a process > > > > that it knows will be cold. To an external process like a heap profiler > > > > that is doing idle tracking on another process, this procedure will > > > > interfere with the idle page tracking similar to the above steps. > > > > > > This could be solved by checking the !present/swapped out pages > > > right? Whoever decided to put the page out to the swap just made it > > > idle effectively. So the monitor can make some educated guess for > > > tracking. If that is fundamentally not possible then please describe > > > why. > > > > But the monitoring process (profiler) does not have control over the > > 'whoever > > made it effectively idle' process. > > Why does that matter? Whether it is a global/memcg reclaim or somebody > calling MADV_PAGEOUT or whatever it is a decision to make the page not > hot. Sure you could argue that a missing idle bit on swap entries might > mean that the swap out decision was pre-mature/sub-optimal/wrong but is > this the aim of the interface? > > > As you said it will be a guess, it will not be accurate. > > Yes and the point I am trying to make is that having some space and not > giving a guarantee sounds like a safer option for this interface because I do see your point of view, but jJust because a future (and possibly not going to happen) usecase which you mentioned as pte reclaim, makes you feel that userspace may be subject to inaccuracies anyway, doesn't mean we should make everything inaccurate.. We already know idle page tracking is not completely accurate. But that doesn't mean we miss out on the opportunity to make the "non pte-reclaim" usecase inaccurate as well. IMO, we should do our best for today, and not hypothesize. How likely is pte reclaim and is there a thread to describe that direction? > > I am curious what is your concern with using a bit in the swap PTE? > > ... It is a promiss of the semantic I find limiting for future. The bit > in the pte might turn out insufficient (e.g. pte reclaim) so teaching > the userspace to consider this a hard guarantee is a ticket to problems > later on. Maybe I am overly paranoid because I have seen so many "nice > to have" features turning into a maintenance burden in the past. > > If this is really considered mostly debugging purpouse interface then a > certain level of imprecision should be tolerateable. If there is a > really strong real world usecase that simply has no other way to go > then this might be added later. Adding an information is always safer > than take it away. > > That being said, if I am a minority voice here then I will not really > stand in the way and won't nack the patch. I will not ack it neither > though. Ok. thanks, - Joel
Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR
On Wed, 2019-07-24 at 22:22 +0800, Wu Hao wrote: > On Wed, Jul 24, 2019 at 11:35:32AM +0200, Greg KH wrote: > > On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote: > > > > > > @@ -67,8 +69,43 @@ > > > #define PR_WAIT_TIMEOUT 800 > > > #define PR_HOST_STATUS_IDLE 0 > > > > > > +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512) > > > + > > > +#include > > > +#include > > > + > > > +static inline int is_cpu_avx512_enabled(void) > > > +{ > > > + return cpu_feature_enabled(X86_FEATURE_AVX512F); > > > +} > > > > That's a very arch specific function, why would a driver ever care about > > this? > > Yes, this is only applied to a specific FPGA solution, which FPGA > has been integrated with XEON. Hardware indicates this using register > to software. As it's cpu integrated solution, so CPU always has this > AVX512 capability. The only check we do, is make sure this is not > manually disabled by kernel. > > With this hardware, software could use AVX512 to accelerate the FPGA > partial reconfiguration as mentioned in the patch commit message. > It brings performance benifits to people who uses it. This is only one > optimization (512 vs 32bit data write to hw) for a specific hardware. I thought earlier you said that 512 bit accesses were required for this particular integrated-only version of the device, and not just an optimization? > > > +#else > > > +static inline int is_cpu_avx512_enabled(void) > > > +{ > > > + return 0; > > > +} > > > + > > > +static inline void copy512(const void *src, void __iomem *dst) > > > +{ > > > + WARN_ON_ONCE(1); > > > > Are you trying to get reports from syzbot? :) > > Oh.. no.. I will remove it. :) > > Thank you very much! What's wrong with this? The driver should never call copy512() if is_cpu_avx512_enabled() returns 0, and if syzbot can somehow make the driver do so, then yes we do want a report. -Scott
Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote: > On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu wrote: > > > > When a task does fork(), its shadow stack (SHSTK) must be duplicated > > for the child. This patch implements a flow similar to copy-on-write > > of an anonymous page, but for SHSTK. > > > > A SHSTK PTE must be RO and dirty. This dirty bit requirement is used > > to effect the copying. In copy_one_pte(), clear the dirty bit from a > > SHSTK PTE to cause a page fault upon the next SHSTK access. At that > > time, fix the PTE and copy/re-use the page. > > Is using VM_SHSTK and special-casing all of this really better than > using a special mapping or other pseudo-file-backed VMA and putting > all the magic in the vm_operations? A special mapping is cleaner. However, we also need to exclude normal [RO + dirty] pages from shadow stack. Yu-cheng
Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
On 8/14/19 9:27 AM, Yu-cheng Yu wrote: > On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote: >> On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu wrote: >>> When a task does fork(), its shadow stack (SHSTK) must be duplicated >>> for the child. This patch implements a flow similar to copy-on-write >>> of an anonymous page, but for SHSTK. >>> >>> A SHSTK PTE must be RO and dirty. This dirty bit requirement is used >>> to effect the copying. In copy_one_pte(), clear the dirty bit from a >>> SHSTK PTE to cause a page fault upon the next SHSTK access. At that >>> time, fix the PTE and copy/re-use the page. >> Is using VM_SHSTK and special-casing all of this really better than >> using a special mapping or other pseudo-file-backed VMA and putting >> all the magic in the vm_operations? > A special mapping is cleaner. However, we also need to exclude normal [RO + > dirty] pages from shadow stack. I don't understand what you are saying. Are you saying that we need this VM_SHSTK flag in order to exclude RO+HW-Dirty pages from being created in non-shadow-stack VMAs?
Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
On Tue, 2019-08-13 at 16:02 -0700, Dave Hansen wrote: [...] > Please also reconcile the supervisor XSAVE portion of your patches with > the ones that Fenghua has been sending around. I've given quite a bit > of feedback to improve those. Please consolidate and agree on a common > set of patches with him. XSAVES supervisor is now a six-patch set. Maybe we can make it a separate series? I will consolidate and send it out. Yu-cheng
Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW
On 8/14/19 9:42 AM, Yu-cheng Yu wrote: > On Tue, 2019-08-13 at 16:02 -0700, Dave Hansen wrote: > [...] >> Please also reconcile the supervisor XSAVE portion of your patches with >> the ones that Fenghua has been sending around. I've given quite a bit >> of feedback to improve those. Please consolidate and agree on a common >> set of patches with him. > XSAVES supervisor is now a six-patch set. Maybe we can make it a separate > series? I will consolidate and send it out. A separate series would be great. Please also make sure it's in a (temporary) git tree somewhere so that it's easy to base other sets on top of it.
Re: [RFC 06/19] ktf: A simple debugfs interface to test results
On Tue, 2019-08-13 at 10:21 +0200, Greg Kroah-Hartman wrote: > On Tue, Aug 13, 2019 at 08:09:21AM +0200, Knut Omang wrote: > > From: Alan Maguire > > > > While test results is available via netlink from user space, sometimes > > it may be useful to be able to access the results from the kernel as well, > > for instance due to a crash. Make that possible via debugfs. > > > > ktf_debugfs.h: Support for creating a debugfs representation of test > > > > Signed-off-by: Alan Maguire > > Signed-off-by: Knut Omang > > --- > > tools/testing/selftests/ktf/kernel/ktf_debugfs.c | 356 - > > tools/testing/selftests/ktf/kernel/ktf_debugfs.h | 34 ++- > > 2 files changed, 390 insertions(+) > > create mode 100644 tools/testing/selftests/ktf/kernel/ktf_debugfs.c > > create mode 100644 tools/testing/selftests/ktf/kernel/ktf_debugfs.h > > > > diff --git a/tools/testing/selftests/ktf/kernel/ktf_debugfs.c > > b/tools/testing/selftests/ktf/kernel/ktf_debugfs.c > > new file mode 100644 > > index 000..a20fbd2 > > --- /dev/null > > +++ b/tools/testing/selftests/ktf/kernel/ktf_debugfs.c > > @@ -0,0 +1,356 @@ > > +/* > > + * Copyright (c) 2009, 2017, Oracle and/or its affiliates. All rights > > reserved. > > + *Author: Alan Maguire > > + * > > + * SPDX-License-Identifier: GPL-2.0 > > Has to be the first line of the file, did you run this through > checkpatch? Yes, the code has been subject to continuous integration which uses a version of my runchecks tool (https://lkml.org/lkml/2018/1/19/157) to ensure that it is not possible to "forget" to run checkpatch (or sparse, smatch doc.check for that sake) Ironically though I fell victim to my own tooling here, as I postponed fixing the SPDX_LICENSE_TAG class of issues once that test appeared, while working on something else, and just forgot to re-enable it again.. > > +static int ktf_run_test_open(struct inode *inode, struct file *file) > > +{ > > + struct ktf_test *t; > > + > > + if (!try_module_get(THIS_MODULE)) > > + return -EIO; > > This is an anti-pattern, and one guaranteed to not work properly. NEVER > do this. Sorry, I didn't know this, and the origin is probably my responsibility. I know the feeling of never being able to get rid of bad examples because they keep getting copied.. The pattern seemed to be widely used the first time I saw it, and although somewhat awkward, it seemed to be the standard way then, but as you know, my Infiniband driver ( https://github.com/oracle/linux-uek/blob/uek4/qu7/drivers/infiniband/hw/sif/sif_debug.c) unfortunately never made it to the scrutiny of LKML, since the hardware project got cancelled. The -EIO return value was also copied from merged kernel code back then. I notice the discussion and your response here: http://linux-kernel.2935.n7.nabble.com/debugfs-and-module-unloading-td865175.html I assume that means that protection against module unload while a debugfs file is open is now safe. On older kernels, having this code in place is far better than an unprotected debugfs entry/exit - I have tested it extensively in the past :-) Back when I first used it, I had this cool set of polymorphic debugfs file code to list the set of active MRs, CQs, QPs, AHs etc that the whole infiniband driver, database and hardware teams loved so much that multiple users ended up using it in multiple windows from within watch for live observations of state changes, and often also running driver load/unloads for testing purposes. I perfectly agree with you that reducing the hole for a race condition is generally a bad idea, but from the above mail thread it seems that's the only available choice for older kernels? (I am asking because I still want to be able to support rather old kernels with the github version of KTF) Anyway, great to know that a better solution now exists! We'll fix the rest of the issues below as well for the next version.. Thanks! Knut > > + > > + t = (struct ktf_test *)inode->i_private; > > + > > + return single_open(file, ktf_debugfs_run, t); > > +} > > + > > +static int ktf_debugfs_release(struct inode *inode, struct file *file) > > +{ > > + module_put(THIS_MODULE); > > Same here, not ok. > > > > + return single_release(inode, file); > > +} > > + > > +static const struct file_operations ktf_run_test_fops = { > > + .open = ktf_run_test_open, > > + .read = seq_read, > > + .llseek = seq_lseek, > > + .release = ktf_debugfs_release, > > +}; > > + > > +static int ktf_results_test_open(struct inode *inode, struct file *file) > > +{ > > + struct ktf_test *t; > > + > > + if (!try_module_get(THIS_MODULE)) > > + return -EIO; > > Nope! > > And why -EIO? That is not an io issue. Agreed > > > +void ktf_debugfs_create_test(struct ktf_test *t) > > +{ > > + struct ktf_case *testset = ktf_case_find(t->tclass); > > + > > + if (!testset) > > + return; > > + > > + memset(&t->debugfs, 0, sizeof(t->debugfs)); > > + > > +
Re: [PATCH v8 15/27] mm: Handle shadow stack page fault
On Wed, 2019-08-14 at 09:48 -0700, Dave Hansen wrote: > On 8/14/19 9:27 AM, Yu-cheng Yu wrote: > > On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote: > > > On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu wrote: > > > > When a task does fork(), its shadow stack (SHSTK) must be duplicated > > > > for the child. This patch implements a flow similar to copy-on-write > > > > of an anonymous page, but for SHSTK. > > > > > > > > A SHSTK PTE must be RO and dirty. This dirty bit requirement is used > > > > to effect the copying. In copy_one_pte(), clear the dirty bit from a > > > > SHSTK PTE to cause a page fault upon the next SHSTK access. At that > > > > time, fix the PTE and copy/re-use the page. > > > > > > Is using VM_SHSTK and special-casing all of this really better than > > > using a special mapping or other pseudo-file-backed VMA and putting > > > all the magic in the vm_operations? > > > > A special mapping is cleaner. However, we also need to exclude normal [RO + > > dirty] pages from shadow stack. > > I don't understand what you are saying. > > Are you saying that we need this VM_SHSTK flag in order to exclude > RO+HW-Dirty pages from being created in non-shadow-stack VMAs? We use VM_SHSTK for page fault handling (the special-casing). If we have a special mapping, all these become cleaner (but more code). However, we still need most of the PTE macros (e.g. ptep_set_wrprotect, PAGE_DIRTY_SW, etc.). Yu-cheng
Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote: > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote: [snip] > > > - * Queue an RCU callback for lazy invocation after a grace period. > > > - * This will likely be later named something like "call_rcu_lazy()", > > > - * but this change will require some way of tagging the lazy RCU > > > - * callbacks in the list of pending callbacks. Until then, this > > > - * function may only be called from __kfree_rcu(). > > > + * Maximum number of kfree(s) to batch, if this limit is hit then the > > > batch of > > > + * kfree(s) is queued for freeing after a grace period, right away. > > > */ > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > +struct kfree_rcu_cpu { > > > + /* The rcu_work node for queuing work with queue_rcu_work(). The work > > > + * is done after a grace period. > > > + */ > > > + struct rcu_work rcu_work; > > > + > > > + /* The list of objects being queued in a batch but are not yet > > > + * scheduled to be freed. > > > + */ > > > + struct rcu_head *head; > > > + > > > + /* The list of objects that have now left ->head and are queued for > > > + * freeing after a grace period. > > > + */ > > > + struct rcu_head *head_free; > > > > So this is not yet the one that does multiple batches concurrently > > awaiting grace periods, correct? Or am I missing something subtle? > > Yes, it is not. I honestly, still did not understand that idea. Or how it > would improve things. May be we can discuss at LPC on pen and paper? But I > think that can also be a follow-up optimization. I got it now. Basically we can benefit a bit more by having another list (that is have multiple kfree_rcu batches in flight). I will think more about it - but hopefully we don't need to gate this patch by that. It'll be interesting to see what rcuperf says about such an improvement :) thanks, - Joel
Re: [PATCH v13 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework
Quoting Brendan Higgins (2019-08-14 03:03:47) > On Tue, Aug 13, 2019 at 10:52 PM Brendan Higgins > wrote: > > > > ## TL;DR > > > > This revision addresses comments from Stephen and Bjorn Helgaas. Most > > changes are pretty minor stuff that doesn't affect the API in anyway. > > One significant change, however, is that I added support for freeing > > kunit_resource managed resources before the test case is finished via > > kunit_resource_destroy(). Additionally, Bjorn pointed out that I broke > > KUnit on certain configurations (like the default one for x86, whoops). > > > > Based on Stephen's feedback on the previous change, I think we are > > pretty close. I am not expecting any significant changes from here on > > out. > > Stephen, it looks like you have just replied with "Reviewed-bys" on > all the remaining emails that you looked at. Is there anything else > that we are missing? Or is this ready for Shuah to apply? > I think it's good to go! Thanks for the persistence.
Re: [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
On Wed 14-08-19 12:32:03, Joel Fernandes wrote: > On Wed, Aug 14, 2019 at 10:05:31AM +0200, Michal Hocko wrote: > > On Tue 13-08-19 11:36:59, Joel Fernandes wrote: > > > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote: > > > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote: > > > > > Idle page tracking currently does not work well in the following > > > > > scenario: > > > > > 1. mark page-A idle which was present at that time. > > > > > 2. run workload > > > > > 3. page-A is not touched by workload > > > > > 4. *sudden* memory pressure happen so finally page A is finally > > > > > swapped out > > > > > 5. now see the page A - it appears as if it was accessed (pte > > > > > unmapped > > > > > so idle bit not set in output) - but it's incorrect. > > > > > > > > > > To fix this, we store the idle information into a new idle bit of the > > > > > swap PTE during swapping of anonymous pages. > > > > > > > > > > Also in the future, madvise extensions will allow a system process > > > > > manager (like Android's ActivityManager) to swap pages out of a > > > > > process > > > > > that it knows will be cold. To an external process like a heap > > > > > profiler > > > > > that is doing idle tracking on another process, this procedure will > > > > > interfere with the idle page tracking similar to the above steps. > > > > > > > > This could be solved by checking the !present/swapped out pages > > > > right? Whoever decided to put the page out to the swap just made it > > > > idle effectively. So the monitor can make some educated guess for > > > > tracking. If that is fundamentally not possible then please describe > > > > why. > > > > > > But the monitoring process (profiler) does not have control over the > > > 'whoever > > > made it effectively idle' process. > > > > Why does that matter? Whether it is a global/memcg reclaim or somebody > > calling MADV_PAGEOUT or whatever it is a decision to make the page not > > hot. Sure you could argue that a missing idle bit on swap entries might > > mean that the swap out decision was pre-mature/sub-optimal/wrong but is > > this the aim of the interface? > > > > > As you said it will be a guess, it will not be accurate. > > > > Yes and the point I am trying to make is that having some space and not > > giving a guarantee sounds like a safer option for this interface because > > I do see your point of view, but jJust because a future (and possibly not > going to happen) usecase which you mentioned as pte reclaim, makes you feel > that userspace may be subject to inaccuracies anyway, doesn't mean we should > make everything inaccurate.. We already know idle page tracking is not > completely accurate. But that doesn't mean we miss out on the opportunity to > make the "non pte-reclaim" usecase inaccurate as well. Just keep in mind that you will add more burden to future features because they would have to somehow overcome this user visible behavior and we will get to the usual question - Is this going to break something that relies on the idle bit being stable? > IMO, we should do our best for today, and not hypothesize. How likely is pte > reclaim and is there a thread to describe that direction? Not that I am aware of now but with large NVDIMM mapped files I can see that this will get more and more interesting. -- Michal Hocko SUSE Labs
Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
On Wed, Aug 14, 2019 at 01:22:33PM -0400, Joel Fernandes wrote: > On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote: > > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote: > [snip] > > > > - * Queue an RCU callback for lazy invocation after a grace period. > > > > - * This will likely be later named something like "call_rcu_lazy()", > > > > - * but this change will require some way of tagging the lazy RCU > > > > - * callbacks in the list of pending callbacks. Until then, this > > > > - * function may only be called from __kfree_rcu(). > > > > + * Maximum number of kfree(s) to batch, if this limit is hit then the > > > > batch of > > > > + * kfree(s) is queued for freeing after a grace period, right away. > > > > */ > > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > > +struct kfree_rcu_cpu { > > > > + /* The rcu_work node for queuing work with queue_rcu_work(). > > > > The work > > > > +* is done after a grace period. > > > > +*/ > > > > + struct rcu_work rcu_work; > > > > + > > > > + /* The list of objects being queued in a batch but are not yet > > > > +* scheduled to be freed. > > > > +*/ > > > > + struct rcu_head *head; > > > > + > > > > + /* The list of objects that have now left ->head and are queued > > > > for > > > > +* freeing after a grace period. > > > > +*/ > > > > + struct rcu_head *head_free; > > > > > > So this is not yet the one that does multiple batches concurrently > > > awaiting grace periods, correct? Or am I missing something subtle? > > > > Yes, it is not. I honestly, still did not understand that idea. Or how it > > would improve things. May be we can discuss at LPC on pen and paper? But I > > think that can also be a follow-up optimization. > > I got it now. Basically we can benefit a bit more by having another list > (that is have multiple kfree_rcu batches in flight). I will think more about > it - but hopefully we don't need to gate this patch by that. I am willing to take this as a later optimization. > It'll be interesting to see what rcuperf says about such an improvement :) Indeed, no guarantees either way. The reason for hope assumes a busy system where each grace period is immediately followed by another grace period. On such a system, the current setup allows each CPU to make use only of every second grace period for its kfree_rcu() work. The hope would therefore be that this would reduce the memory footprint substantially with no increase in overhead. But no way to know without trying it! ;-) Thanx, Paul
Re: [PATCH 1/3] workqueue: Convert for_each_wq to use built-in list check (v2)
Hello, Joel. On Sun, Aug 11, 2019 at 06:11:09PM -0400, Joel Fernandes (Google) wrote: > list_for_each_entry_rcu now has support to check for RCU reader sections > as well as lock. Just use the support in it, instead of explicitly > checking in the caller. > > Signed-off-by: Joel Fernandes (Google) Acked-by: Tejun Heo > #define for_each_pwq(pwq, wq) > \ > - list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \ > - if (({ assert_rcu_or_wq_mutex(wq); false; })) { } \ > - else > + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, \ > + lock_is_held(&(wq->mutex).dep_map)) Why not lockdep_is_held() tho? Thanks. -- tejun
[PATCH v5 14/18] compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
The ppp_idle structure is defined in terms of __kernel_time_t, which is defined as 'long' on all architectures, and this usage is not affected by the y2038 problem since it transports a time interval rather than an absolute time. However, the ppp user space defines the same structure as time_t, which may be 64-bit wide on new libc versions even on 32-bit architectures. It's easy enough to just handle both possible structure layouts on all architectures, to deal with the possibility that a user space ppp implementation comes with its own ppp_idle structure definition, as well as to document the fact that the driver is y2038-safe. Doing this also avoids the need for a special compat mode translation, since 32-bit and 64-bit kernels now support the same interfaces. The old 32-bit structure is also available on native 64-bit architectures now, but this is harmless. Signed-off-by: Arnd Bergmann --- Documentation/networking/ppp_generic.txt | 2 ++ drivers/net/ppp/ppp_generic.c| 19 ++ fs/compat_ioctl.c| 32 ++-- include/uapi/linux/ppp-ioctl.h | 2 ++ include/uapi/linux/ppp_defs.h| 14 +++ 5 files changed, 34 insertions(+), 35 deletions(-) diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt index 61daf4b39600..fd563aff5fc9 100644 --- a/Documentation/networking/ppp_generic.txt +++ b/Documentation/networking/ppp_generic.txt @@ -378,6 +378,8 @@ an interface unit are: CONFIG_PPP_FILTER option is enabled, the set of packets which reset the transmit and receive idle timers is restricted to those which pass the `active' packet filter. + Two versions of this command exist, to deal with user space + expecting times as either 32-bit or 64-bit time_t seconds. * PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the number of connection slots) for the TCP header compressor and diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c index 2ab67bad6224..6b4e227cb002 100644 --- a/drivers/net/ppp/ppp_generic.c +++ b/drivers/net/ppp/ppp_generic.c @@ -612,7 +612,8 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg) struct ppp_file *pf; struct ppp *ppp; int err = -EFAULT, val, val2, i; - struct ppp_idle idle; + struct ppp_idle32 idle32; + struct ppp_idle64 idle64; struct npioctl npi; int unit, cflags; struct slcompress *vj; @@ -735,10 +736,18 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg) err = 0; break; - case PPPIOCGIDLE: - idle.xmit_idle = (jiffies - ppp->last_xmit) / HZ; - idle.recv_idle = (jiffies - ppp->last_recv) / HZ; - if (copy_to_user(argp, &idle, sizeof(idle))) + case PPPIOCGIDLE32: +idle32.xmit_idle = (jiffies - ppp->last_xmit) / HZ; +idle32.recv_idle = (jiffies - ppp->last_recv) / HZ; +if (copy_to_user(argp, &idle32, sizeof(idle32))) + break; + err = 0; + break; + + case PPPIOCGIDLE64: + idle64.xmit_idle = (jiffies - ppp->last_xmit) / HZ; + idle64.recv_idle = (jiffies - ppp->last_recv) / HZ; + if (copy_to_user(argp, &idle64, sizeof(idle64))) break; err = 0; break; diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c index 0b5a732d7afd..f97cf698cfdd 100644 --- a/fs/compat_ioctl.c +++ b/fs/compat_ioctl.c @@ -99,33 +99,6 @@ static int sg_grt_trans(struct file *file, } #endif /* CONFIG_BLOCK */ -struct ppp_idle32 { - compat_time_t xmit_idle; - compat_time_t recv_idle; -}; -#define PPPIOCGIDLE32 _IOR('t', 63, struct ppp_idle32) - -static int ppp_gidle(struct file *file, unsigned int cmd, - struct ppp_idle32 __user *idle32) -{ - struct ppp_idle __user *idle; - __kernel_time_t xmit, recv; - int err; - - idle = compat_alloc_user_space(sizeof(*idle)); - - err = do_ioctl(file, PPPIOCGIDLE, (unsigned long) idle); - - if (!err) { - if (get_user(xmit, &idle->xmit_idle) || - get_user(recv, &idle->recv_idle) || - put_user(xmit, &idle32->xmit_idle) || - put_user(recv, &idle32->recv_idle)) - err = -EFAULT; - } - return err; -} - /* * simple reversible transform to make our table more evenly * distributed after sorting. @@ -192,7 +165,8 @@ COMPATIBLE_IOCTL(PPPIOCGDEBUG) COMPATIBLE_IOCTL(PPPIOCSDEBUG) /* PPPIOCSPASS is translated */ /* PPPIOCSACTIVE is translated */ -/* PPPIOCGIDLE is translated */ +COMPATIBLE_IOCTL(PPPIOCGIDLE32) +COMPATIBLE_IOCTL(PPPIOCGIDLE64) COMPATIBLE_IOCTL(PPPIOCNEWUNIT) COMPATIBLE_IOCTL(PPPIOCATTACH) COMPATIBLE_IOCTL(
[PATCH] x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
From: Tom Lendacky There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. RDRAND support is indicated by CPUID Fn0001_ECX[30]. This bit can be reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND support using CPUID, including the kernel, will believe that RDRAND is not supported. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "rdrand_force" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Additionally, update the suspend and resume path to save and restore the MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in place after resuming from suspend. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing the RDRAND instruction. So any code that determined the support based on family and model won't #UD. Signed-off-by: Tom Lendacky --- .../admin-guide/kernel-parameters.txt | 8 ++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/kernel/cpu/amd.c | 42 ++ arch/x86/power/cpu.c | 83 --- 4 files changed, 121 insertions(+), 13 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 47d981a86e2f..f47eb33958c1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4090,6 +4090,14 @@ Run specified binary instead of /init from the ramdisk, used for early userspace startup. See initrd. + rdrand_force[X86] + On certain AMD processors, the advertisement of the + RDRAND instruction has been disabled by the kernel + because of buggy BIOS support, specifically around the + suspend/resume path. This option allows for overriding + that decision if it is known that the BIOS support for + RDRAND is not buggy on the system. + rdt=[HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 6b4fc2788078..29ae2b66b9e9 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -381,6 +381,7 @@ #define MSR_AMD64_PATCH_LEVEL 0x008b #define MSR_AMD64_TSC_RATIO0xc104 #define MSR_AMD64_NB_CFG 0xc001001f +#define MSR_AMD64_CPUID_FN_00010xc0011004 #define MSR_AMD64_PATCH_LOADER 0xc0010020 #define MSR_AMD64_OSVW_ID_LENGTH 0xc0010140 #define MSR_AMD64_OSVW_STATUS 0xc0010141 diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 3afe07d602dd..86ff1464302b 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -804,6 +804,40 @@ static void init_amd_ln(struct cpuinfo_x86 *c) msr_set_bit(MSR_AMD64_DE_CFG, 31); } +static bool rdrand_force; + +static int __init rdrand_force_cmdline(char *str) +{ + rdrand_force = true; + + return 0; +} +early_param("rdrand_force", rdrand_force_cmdline); + +static void init_hide_rdrand(struct cpuinfo_x86 *c) +{ + /* +* The nordrand option can clear X86_FEATURE_RDRAND, so check for +* RDRAND support using the CPUID function directly. +*/ + if (!(cpuid_ecx(1) & BIT(30)) || rdrand_force) + return; + + msr_clear_bit(MSR_AMD64_CPUID_FN_0001, 62); + clear_cpu_cap(c, X86_FEATURE_RDRAND); + pr_info_once("hiding RDRAND via CPUID\n"); +} + +static void init_amd_jg(struct cpuinfo_x86 *c) +{ + /* +* Some BIOS implementations do not restore proper RDRAND support +* across suspend and resume. Check on whether to hide the RDRAND +* instruction support via CPUID. +*/ + init_hide_rdrand(c); +} + static void init_amd_bd(struct cpuinfo_x86 *c) { u64 value; @@ -818,6 +852,13 @@ static void init_amd_bd(struct cpuinfo_x86 *c) wrmsrl_safe(MSR_F15H_IC_CFG, value); } } + + /* +* Some BIOS implementations do not restore proper RDRAND support +* across suspend and resume. Check on whether to hide the RDRAND +* instruction support via CPUID. +*/ + init_hide_rdrand(c); } static void init_amd_zn(st
Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
On Wed, Aug 14, 2019 at 11:44:29AM -0700, Paul E. McKenney wrote: > On Wed, Aug 14, 2019 at 01:22:33PM -0400, Joel Fernandes wrote: > > On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote: > > > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote: > > [snip] > > > > > - * Queue an RCU callback for lazy invocation after a grace period. > > > > > - * This will likely be later named something like "call_rcu_lazy()", > > > > > - * but this change will require some way of tagging the lazy RCU > > > > > - * callbacks in the list of pending callbacks. Until then, this > > > > > - * function may only be called from __kfree_rcu(). > > > > > + * Maximum number of kfree(s) to batch, if this limit is hit then > > > > > the batch of > > > > > + * kfree(s) is queued for freeing after a grace period, right away. > > > > > */ > > > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > > > +struct kfree_rcu_cpu { > > > > > + /* The rcu_work node for queuing work with queue_rcu_work(). > > > > > The work > > > > > + * is done after a grace period. > > > > > + */ > > > > > + struct rcu_work rcu_work; > > > > > + > > > > > + /* The list of objects being queued in a batch but are not yet > > > > > + * scheduled to be freed. > > > > > + */ > > > > > + struct rcu_head *head; > > > > > + > > > > > + /* The list of objects that have now left ->head and are queued > > > > > for > > > > > + * freeing after a grace period. > > > > > + */ > > > > > + struct rcu_head *head_free; > > > > > > > > So this is not yet the one that does multiple batches concurrently > > > > awaiting grace periods, correct? Or am I missing something subtle? > > > > > > Yes, it is not. I honestly, still did not understand that idea. Or how it > > > would improve things. May be we can discuss at LPC on pen and paper? But I > > > think that can also be a follow-up optimization. > > > > I got it now. Basically we can benefit a bit more by having another list > > (that is have multiple kfree_rcu batches in flight). I will think more about > > it - but hopefully we don't need to gate this patch by that. > > I am willing to take this as a later optimization. > > > It'll be interesting to see what rcuperf says about such an improvement :) > > Indeed, no guarantees either way. The reason for hope assumes a busy > system where each grace period is immediately followed by another > grace period. On such a system, the current setup allows each CPU to > make use only of every second grace period for its kfree_rcu() work. > The hope would therefore be that this would reduce the memory footprint > substantially with no increase in overhead. Good news! I was able to bring down memory foot print by almost 30% by adding another batch. Below is the patch. Thanks for the suggestion! I can add this as a patch on top of the initial one, for easier review. The memory consumed drops from 300-350MB to 200-250MB. Increasing KFREE_N_BATCHES did not cause a reduction in memory, though. ---8<--- From: "Joel Fernandes (Google)" Subject: [PATCH] WIP: Multiple batches Signed-off-by: Joel Fernandes (Google) --- kernel/rcu/tree.c | 58 +-- 1 file changed, 41 insertions(+), 17 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 1d1847cadea2..a272c893dbdc 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2596,26 +2596,35 @@ EXPORT_SYMBOL_GPL(call_rcu); /* Maximum number of jiffies to wait before draining a batch. */ #define KFREE_DRAIN_JIFFIES (HZ / 50) +#define KFREE_N_BATCHES 2 + +struct kfree_rcu_work { + /* The rcu_work node for queuing work with queue_rcu_work(). The work +* is done after a grace period. +*/ + struct rcu_work rcu_work; + + /* The list of objects that have now left ->head and are queued for +* freeing after a grace period. +*/ + struct rcu_head *head_free; + + struct kfree_rcu_cpu *krc; +}; +static DEFINE_PER_CPU(__typeof__(struct kfree_rcu_work)[KFREE_N_BATCHES], krw); /* * Maximum number of kfree(s) to batch, if this limit is hit then the batch of * kfree(s) is queued for freeing after a grace period, right away. */ struct kfree_rcu_cpu { - /* The rcu_work node for queuing work with queue_rcu_work(). The work -* is done after a grace period. -*/ - struct rcu_work rcu_work; /* The list of objects being queued in a batch but are not yet * scheduled to be freed. */ struct rcu_head *head; - /* The list of objects that have now left ->head and are queued for -* freeing after a grace period. -*/ - struct rcu_head *head_free; + struct kfree_rcu_work *krw; /* Protect concurrent access to this structure. */ spinlock_t lock; @@ -2638,12 +2647,15 @@ static void kfree_rcu_work(struc
Re: [PATCH 1/3] workqueue: Convert for_each_wq to use built-in list check (v2)
On Wed, Aug 14, 2019 at 12:48:41PM -0700, Tejun Heo wrote: > Hello, Joel. > > On Sun, Aug 11, 2019 at 06:11:09PM -0400, Joel Fernandes (Google) wrote: > > list_for_each_entry_rcu now has support to check for RCU reader sections > > as well as lock. Just use the support in it, instead of explicitly > > checking in the caller. > > > > Signed-off-by: Joel Fernandes (Google) > > Acked-by: Tejun Heo Thanks. > > #define for_each_pwq(pwq, wq) > > \ > > - list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \ > > - if (({ assert_rcu_or_wq_mutex(wq); false; })) { } \ > > - else > > + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, \ > > +lock_is_held(&(wq->mutex).dep_map)) > > Why not lockdep_is_held() tho? Yes, that's better. thanks, - Joel
Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
On Wed, Aug 14, 2019 at 06:34:13PM -0400, Joel Fernandes wrote: > On Wed, Aug 14, 2019 at 11:44:29AM -0700, Paul E. McKenney wrote: > > On Wed, Aug 14, 2019 at 01:22:33PM -0400, Joel Fernandes wrote: > > > On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote: > > > > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote: > > > [snip] > > > > > > - * Queue an RCU callback for lazy invocation after a grace period. > > > > > > - * This will likely be later named something like > > > > > > "call_rcu_lazy()", > > > > > > - * but this change will require some way of tagging the lazy RCU > > > > > > - * callbacks in the list of pending callbacks. Until then, this > > > > > > - * function may only be called from __kfree_rcu(). > > > > > > + * Maximum number of kfree(s) to batch, if this limit is hit then > > > > > > the batch of > > > > > > + * kfree(s) is queued for freeing after a grace period, right away. > > > > > > */ > > > > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > > > > +struct kfree_rcu_cpu { > > > > > > + /* The rcu_work node for queuing work with queue_rcu_work(). > > > > > > The work > > > > > > +* is done after a grace period. > > > > > > +*/ > > > > > > + struct rcu_work rcu_work; > > > > > > + > > > > > > + /* The list of objects being queued in a batch but are not yet > > > > > > +* scheduled to be freed. > > > > > > +*/ > > > > > > + struct rcu_head *head; > > > > > > + > > > > > > + /* The list of objects that have now left ->head and are queued > > > > > > for > > > > > > +* freeing after a grace period. > > > > > > +*/ > > > > > > + struct rcu_head *head_free; > > > > > > > > > > So this is not yet the one that does multiple batches concurrently > > > > > awaiting grace periods, correct? Or am I missing something subtle? > > > > > > > > Yes, it is not. I honestly, still did not understand that idea. Or how > > > > it > > > > would improve things. May be we can discuss at LPC on pen and paper? > > > > But I > > > > think that can also be a follow-up optimization. > > > > > > I got it now. Basically we can benefit a bit more by having another list > > > (that is have multiple kfree_rcu batches in flight). I will think more > > > about > > > it - but hopefully we don't need to gate this patch by that. > > > > I am willing to take this as a later optimization. > > > > > It'll be interesting to see what rcuperf says about such an improvement :) > > > > Indeed, no guarantees either way. The reason for hope assumes a busy > > system where each grace period is immediately followed by another > > grace period. On such a system, the current setup allows each CPU to > > make use only of every second grace period for its kfree_rcu() work. > > The hope would therefore be that this would reduce the memory footprint > > substantially with no increase in overhead. > > Good news! I was able to bring down memory foot print by almost 30% by adding > another batch. Below is the patch. Thanks for the suggestion! Nice! > I can add this as a patch on top of the initial one, for easier review. Yes, please! > The memory consumed drops from 300-350MB to 200-250MB. Increasing > KFREE_N_BATCHES did not cause a reduction in memory, though. OK, good to know. Thanx, Paul > ---8<--- > > From: "Joel Fernandes (Google)" > Subject: [PATCH] WIP: Multiple batches > > Signed-off-by: Joel Fernandes (Google) > --- > kernel/rcu/tree.c | 58 +-- > 1 file changed, 41 insertions(+), 17 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 1d1847cadea2..a272c893dbdc 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -2596,26 +2596,35 @@ EXPORT_SYMBOL_GPL(call_rcu); > > /* Maximum number of jiffies to wait before draining a batch. */ > #define KFREE_DRAIN_JIFFIES (HZ / 50) > +#define KFREE_N_BATCHES 2 > + > +struct kfree_rcu_work { > + /* The rcu_work node for queuing work with queue_rcu_work(). The work > + * is done after a grace period. > + */ > + struct rcu_work rcu_work; > + > + /* The list of objects that have now left ->head and are queued for > + * freeing after a grace period. > + */ > + struct rcu_head *head_free; > + > + struct kfree_rcu_cpu *krc; > +}; > +static DEFINE_PER_CPU(__typeof__(struct kfree_rcu_work)[KFREE_N_BATCHES], > krw); > > /* > * Maximum number of kfree(s) to batch, if this limit is hit then the batch > of > * kfree(s) is queued for freeing after a grace period, right away. > */ > struct kfree_rcu_cpu { > - /* The rcu_work node for queuing work with queue_rcu_work(). The work > - * is done after a grace period. > - */ > - struct rcu_work rcu_work; > > /* The list of objects being queued in a batch but are not yet >* scheduled to be freed. >*/ >
Non-random RDRAND Re: [PATCH] x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
On Wed 2019-08-14 21:17:41, Lendacky, Thomas wrote: > From: Tom Lendacky > > There have been reports of RDRAND issues after resuming from suspend on > some AMD family 15h and family 16h systems. This issue stems from BIOS > not performing the proper steps during resume to ensure RDRAND continues > to function properly. Burn it with fire! I mean... people were afraid RDRAND would be backdoored, and you now confirm ... it indeed _is_ backdoored? /., here's news for you! So what is the impact? Does it give random-looking but predictable numbers after resume? Does it give all zeros? Something else? > > + rdrand_force[X86] > + On certain AMD processors, the advertisement of the > + RDRAND instruction has been disabled by the kernel > + because of buggy BIOS support, specifically around the > + suspend/resume path. This option allows for overriding > + that decision if it is known that the BIOS support for > + RDRAND is not buggy on the system. But this is not how we normally deal with buggy BIOSes. We don't want user to have to decide this... Should we introduce black-list or white-list of BIOS versions? Hmm. Actually. You are the CPU vendor. Surely you can tell us how to init RDRAND in kernel if BIOS failed to do that... can you? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: Non-random RDRAND Re: [PATCH] x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
On Thu 2019-08-15 01:24:35, Pavel Machek wrote: > On Wed 2019-08-14 21:17:41, Lendacky, Thomas wrote: > > From: Tom Lendacky > > > > There have been reports of RDRAND issues after resuming from suspend on > > some AMD family 15h and family 16h systems. This issue stems from BIOS > > not performing the proper steps during resume to ensure RDRAND continues > > to function properly. > > Burn it with fire! > > I mean... people were afraid RDRAND would be backdoored, and you now > confirm ... it indeed _is_ backdoored? /., here's news for you! > > So what is the impact? Does it give random-looking but predictable > numbers after resume? Does it give all zeros? Something else? Plus... We trust the RDRAND in some configurations: random.trust_cpu={on,off} [KNL] Enable or disable trusting the use of the CPU's random number generator (if available) to fully seed the kernel's CRNG. Default is controlled by CONFIG_RANDOM_TRUST_CPU. so.. does this mean /dev/random was giving non-random values for some users? Certainly it means userland users were getting non-random values. That sounds like something worth CVE and informing affected users? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH v3 01/12] fpga: dfl: fme: support 512bit data width PR
On Wed, Aug 14, 2019 at 11:34:15AM -0500, Scott Wood wrote: > On Wed, 2019-07-24 at 22:22 +0800, Wu Hao wrote: > > On Wed, Jul 24, 2019 at 11:35:32AM +0200, Greg KH wrote: > > > On Tue, Jul 23, 2019 at 12:51:24PM +0800, Wu Hao wrote: > > > > > > > > @@ -67,8 +69,43 @@ > > > > #define PR_WAIT_TIMEOUT 800 > > > > #define PR_HOST_STATUS_IDLE0 > > > > > > > > +#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512) > > > > + > > > > +#include > > > > +#include > > > > + > > > > +static inline int is_cpu_avx512_enabled(void) > > > > +{ > > > > + return cpu_feature_enabled(X86_FEATURE_AVX512F); > > > > +} > > > > > > That's a very arch specific function, why would a driver ever care about > > > this? > > > > Yes, this is only applied to a specific FPGA solution, which FPGA > > has been integrated with XEON. Hardware indicates this using register > > to software. As it's cpu integrated solution, so CPU always has this > > AVX512 capability. The only check we do, is make sure this is not > > manually disabled by kernel. > > > > With this hardware, software could use AVX512 to accelerate the FPGA > > partial reconfiguration as mentioned in the patch commit message. > > It brings performance benifits to people who uses it. This is only one > > optimization (512 vs 32bit data write to hw) for a specific hardware. > > I thought earlier you said that 512 bit accesses were required for this > particular integrated-only version of the device, and not just an > optimization? yes, some optimization implemented in a specific integrated-only version of hardware, this patch is used to support that particular hardware. This is also the reason you see code here to check hardware revision in this patch. > > > > > +#else > > > > +static inline int is_cpu_avx512_enabled(void) > > > > +{ > > > > + return 0; > > > > +} > > > > + > > > > +static inline void copy512(const void *src, void __iomem *dst) > > > > +{ > > > > + WARN_ON_ONCE(1); > > > > > > Are you trying to get reports from syzbot? :) > > > > Oh.. no.. I will remove it. :) > > > > Thank you very much! > > What's wrong with this? The driver should never call copy512() if > is_cpu_avx512_enabled() returns 0, and if syzbot can somehow make the driver > do so, then yes we do want a report. Yes, you are right, in previous version, it doesn't have avx512 enable check there, so it's possible to have false reporting, it should be fine after driver does early check on this during probe. As this patch has been dropped from main patchset, may rework it later and resubmit. Thanks for the comments. Hao > > -Scott >
Re: [PATCH] Ext4 documentation fixes.
On Aug 14, 2019, at 6:47 PM, Ayush Ranjan wrote: > diff --git a/Documentation/filesystems/ext4/inodes.rst > b/Documentation/filesystems/ext4/inodes.rst > index 6bd35e506..c468a3171 100644 > --- a/Documentation/filesystems/ext4/inodes.rst > +++ b/Documentation/filesystems/ext4/inodes.rst > @@ -470,8 +470,8 @@ inode, which allows struct ext4\_inode to grow for a new > kernel without > having to upgrade all of the on-disk inodes. Access to fields beyond > EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within > ``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as > -of October 2013) the inode structure is 156 bytes > -(``i_extra_isize = 28``). The extra space between the end of the inode > +of October 2013) the inode structure is 160 bytes This should be changed to "as of August 2019", or possibly the date on which the last field (i_projid) was added, namely "October, 2015". Cheers, Andreas signature.asc Description: Message signed with OpenPGP