[PATCH] cgroup, docs: document the root cgroup behavior of cpu and io controllers
Currently, cgroups v2 documentation contains only a generic remark that "How resource consumption in the root cgroup is governed is up to each controller", which isn't really telling users much, who need to dig in the code and / or commit messages to learn the exact behavior. In cgroups v1 at least the blkio controller had its operation with respect to competition between child threads and child cgroups documented in blkio-controller.txt, with references to cfq-iosched.txt. Also, cgroups v2 documentation describes v1 behavior of both cpu and blkio controllers in an "Issues with v1" section. Let's document this behavior also for cgroups v2 to make life easier for users. Signed-off-by: Maciej S. Szmigiero --- Documentation/cgroup-v2.txt | 11 +++ 1 file changed, 11 insertions(+) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index d6efabb487e3..8ecefb4fa3fb 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -906,6 +906,13 @@ have placed RT processes into nonroot cgroups during the system boot process, and these processes may need to be moved to the root cgroup before the cpu controller can be enabled. +When distributing CPU cycles in the root cgroup each thread in this +cgroup is treated as if it was hosted in a separate child cgroup of the +root cgroup. This child cgroup weight is dependent on its thread nice +level. +For details of this mapping see sched_prio_to_weight array in +kernel/sched/core.c file (values from this array should be scaled +appropriately so the neutral - nice 0 - value is 100 instead of 1024). CPU Interface Files ~~~ @@ -1247,6 +1254,10 @@ limit distribution; however, weight based distribution is available only if cfq-iosched is in use and neither scheme is available for blk-mq devices. +Root cgroup processes are hosted in an implicit leaf child node. +When distributing IO resources this implicit child node is taken into +account as if it was a normal child cgroup of the root cgroup with a +weight value of 200. IO Interface Files ~~ -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] MIPS: Add noexec=on|off kernel parameter
From: Miodrag Dinic Add a new kernel parameter to override the default behavior related to the decision whether to set up stack as non-executable in function mips_elf_read_implies_exec(). The new parameter is used to control non executable stack and heap, regardless of PT_GNU_STACK entry or CPU RIXI support. Allowed values: noexec=on Force non-exec stack & heap noexec=off Force executable stack & heap If this parameter is omitted, kernel behavior remains the same as it was before this patch is applied. This functionality is convenient during debugging and is especially useful for Android development where non-exec stack is required. Signed-off-by: Miodrag Dinic Signed-off-by: Aleksandar Markovic --- Parameter name was changed from "nonxstack" to "noexec" in v4, in order to achieve consistency with similar parameter naming for Intel platform. --- Documentation/admin-guide/kernel-parameters.txt | 10 +++ arch/mips/kernel/elf.c | 38 + 2 files changed, 48 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6571fbf..6dff711 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2596,6 +2596,16 @@ noexec=on: enable non-executable mappings (default) noexec=off: disable non-executable mappings + noexec [MIPS] + Force setting up stack and heap as non-executable or + executable regardless of PT_GNU_STACK entry or CPU XI + support. Valid arguments: on, off. + noexec=on: Force non-executable stack and heap + noexec=off: Force executable stack and heap + If omitted, stack and heap will or will not be set + up as non-executable depending on PT_GNU_STACK + entry and possibly other factors (like CPU XI support). + nosmap [X86] Disable SMAP (Supervisor Mode Access Prevention) even if it is supported by processor. diff --git a/arch/mips/kernel/elf.c b/arch/mips/kernel/elf.c index 731325a..a0235d1 100644 --- a/arch/mips/kernel/elf.c +++ b/arch/mips/kernel/elf.c @@ -326,8 +326,46 @@ void mips_set_personality_nan(struct arch_elf_state *state) } } +static int noexec = EXSTACK_DEFAULT; + +/* + * kernel parameter: noexec=on|off + * + * Force setting up stack and heap as non-executable or + * executable regardless of PT_GNU_STACK entry or CPU RIXI + * support. Valid arguments: on, off. + * + * noexec=on: Force non-executable stack and heap + * noexec=off: Force executable stack and heap + * + * If omitted, stack and heap will or will not be set + * up as non-executable depending on PT_GNU_STACK + * entry and possibly other factors (CPU RIXI support). + */ +static int __init noexec_setup(char *str) +{ + if (!strcmp(str, "on")) + noexec = EXSTACK_DISABLE_X; + else if (!strcmp(str, "off")) + noexec = EXSTACK_ENABLE_X; + else + pr_err("Malformed noexec format! noexec=on|off\n"); + + return 1; +} +__setup("noexec=", noexec_setup); + int mips_elf_read_implies_exec(void *elf_ex, int exstack) { + switch (noexec) { + case EXSTACK_DISABLE_X: + return 0; + case EXSTACK_ENABLE_X: + return 1; + default: + break; + } + if (exstack != EXSTACK_DISABLE_X) { /* The binary doesn't request a non-executable stack */ return 1; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi gengdongjiu, On 07/12/17 06:37, gengdongjiu wrote: > I understand you most idea. > > But In the Qemu one signal type can only correspond to one behavior, can not > correspond to two behaviors, > otherwise Qemu will do not know how to do. > > For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate > the CPER > records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if > receives the SIGBUS_MCEERR_AO > signal, it will record the CPER and trigger a IRQ to notify guest, as shown > below: > > SIGBUS_MCEERR_AR trigger Synchronous External Abort. > SIGBUS_MCEERR_AO trigger GPIO IRQ. > > For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify > trigger method, which all > > not involve _trigger_ an SError. It's a policy choice. How does your virtual CPU notify RAS errors to its virtual software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of CPU you are trying to emulate. I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where the guest doesn't take the SError immediately, instead tries to re-execute the code KVM has unmapped from stage2 because its corrupt. (You could detect this happening in Qemu and try something else) Synchronous/asynchronous external abort matters to the CPU, but once the error has been notified to software the reasons for this distinction disappear. Once the error has been handled, all trace of this distinction is gone. CPER records only describe component failures. You are trying to re-create some state that disappeared with one of the firmware-first abstractions. Trying to re-create this information isn't worth the effort as the distinction doesn't matter to linux, only to the CPU. > so there is no chance for Qemu to trigger the SError when gets the > SIGBUS_MCEERR_A{O,R}. You mean there is no reason for Qemu to trigger an SError when it gets a signal from the kernel. The reasons the CPU might have to generate an SError don't apply to linux and KVM user space. User-space will never get a signal for an uncontained error, we will always panic(). We can't give user-space a signal for imprecise exceptions, as it can't return from the signal. The classes of error that are left are covered by polled/irq and NOTIFY_SEA. Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants to, (but I don't think you should, the kernel may have unmapped the page at PC from stage2 due to corruption). I think the problem here is you're applying the CPU->software behaviour and choices to software->software. By the time user-space gets the error, the behaviour is different. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] kbuild,kconfig: generate lexer/parser C files instead of copying _shipped files
Hi Masahiro. > > In Linux build system convention, pre-generated files are version- > > controlled with a "_shipped" suffix. During the kernel building, > > they are simply shipped (copied) removing the suffix. > > > > From users' point of view, this approach can reduce external tool > > dependency for the kernel build, > > > > From developers point of view, it is tedious to manually regenerate > > such artifacts. In fact, we see several patches to regenerate > > _shipped files. They are noise commits. ... Nice cleanup we should have does years ago. When we introduced this we did this to minimize the time it took to configure a clean kernel - as one of the reasons. Since then the average computer has been significantly faster so the time to run flex/bison is not an issue anymore. Sam -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 0/8] Intel SGX Driver
Intel(R) SGX is a set of CPU instructions that can be used by applications to set aside private regions of code and data. The code outside the enclave is disallowed to access the memory inside the enclave by the CPU access control. In a way you can think that SGX provides inverted sandbox. It protects the application from a malicious host. There is a new hardware unit in the processor called Memory Encryption Engine (MEE) starting from the Skylake microacrhitecture. BIOS can define one or many MEE regions that can hold enclave data by configuring them with PRMRR registers. The MEE automatically encrypts the data leaving the processor package to the MEE regions. The data is encrypted using a random key whose life-time is exactly one power cycle. You can tell if your CPU supports SGX by looking into /proc/cpuinfo: cat /proc/cpuinfo | grep sgx The GIT repositoy for SGX driver resides in https://github.com/jsakkine-intel/linux-sgx.git 'le' branch contains the upstream candidate patches. 'master' branch contains the same patches with the following differences: * top-level patch modifies the ioctl API to be SDK compatible * does not use flexible launch control but instead relies on SDK provided Intel launch enclave. v8: * Check that public key MSRs match the LE public key hash in the driver initialization when the MSRs are read-only. * Fix the race in VA slot allocation by checking the fullness immediately after succeesful allocation. * Fix the race in hash mrsigner calculation between the launch enclave and user enclaves by having a separate lock for hash calculation. v7: * Fixed offset calculation in sgx_edbgr/wr(). Address was masked with PAGE_MASK when it should have been masked with ~PAGE_MASK. * Fixed a memory leak in sgx_ioc_enclave_create(). * Simplified swapping code by using a pointer array for a cluster instead of a linked list. * Squeezed struct sgx_encl_page to 32 bytes. * Fixed deferencing of an RSA key on??OpenSSL 1.1.0. * Modified TC's CMAC to use kernel AES-NI. Restructured the code a bit in order to better align with kernel conventions. v6 * Fixed semaphore underrun when accessing /dev/sgx from the launch enclave. * In sgx_encl_create() s/IS_ERR(secs)/IS_ERR(encl)/. * Removed virtualization chapter from the documentation. * Changed the default filename for the signing key as signing_key.pem. * Reworked EPC management in a way that instead of a linked list of struct sgx_epc_page instances there is an array of integers that encodes address and bank of an EPC page (the same data as 'pa' field earlier). The locking has been moved to the EPC bank level instead of a global lock. * Relaxed locking requirements for EPC management. EPC pages can be released back to the EPC bank concurrently. * Cleaned up ptrace() code. * Refined commit messages for new architectural constants. * Sorted includes in every source file. * Sorted local variable declarations according to the line length in every function. * Style fixes based on Darren's comments to sgx_le.c. v5: * Described IPC between the Launch Enclave and kernel in the commit messages. * Fixed all relevant checkpatch.pl issues that I have forgot fix in earlier versions except those that exist in the imported TinyCrypt code. * Fixed spelling mistakes in the documentation. * Forgot to check the return value of sgx_drv_subsys_init(). * Encapsulated properly page cache init and teardown. * Collect epc pages to a temp list in sgx_add_epc_bank * Removed SGX_ENCLAVE_INIT_ARCH constant. v4: * Tied life-cycle of the sgx_le_proxy process to /dev/sgx. * Removed __exit annotation from sgx_drv_subsys_exit(). * Fixed a leak of a backing page in sgx_process_add_page_req() in the case when vm_insert_pfn() fails. * Removed unused symbol exports for sgx_page_cache.c. * Updated sgx_alloc_page() to require encl parameter and documented the behavior (Sean Christopherson). * Refactored a more lean API for sgx_encl_find() and documented the behavior. * Moved #PF handler to sgx_fault.c. * Replaced subsys_system_register() with plain bus_register(). * Retry EINIT 2nd time only if MSRs are not locked. v3: * Check that FEATURE_CONTROL_LOCKED and FEATURE_CONTROL_SGX_ENABLE are set. * Return -ERESTARTSYS in __sgx_encl_add_page() when sgx_alloc_page() fails. * Use unused bits in epc_page->pa to store the bank number. * Removed #ifdef for WQ_NONREENTRANT. * If mmu_notifier_register() fails with -EINTR, return -ERESTARTSYS. * Added --remove-section=.got.plt to objcopy flags in order to prevent a dummy .got.plt, which will cause an inconsistent size for the LE. * Documented sgx_encl_* functions. * Added remark about AES implementation used inside the LE. * Removed redundant sgx_sys_exit() from le/main.c. * Fixed struct sgx_secinfo alignment from 128 to 64 bytes. * Validate miscselect in sgx_encl_create(). * Fixed SSA frame size calculation to take the misc region into account. * Implemented consistent exception handling t
[PATCH v8 6/8] intel_sgx: driver documentation
Signed-off-by: Jarkko Sakkinen Tested-by: Serge Ayoun --- Documentation/index.rst | 1 + Documentation/x86/intel_sgx.rst | 101 2 files changed, 102 insertions(+) create mode 100644 Documentation/x86/intel_sgx.rst diff --git a/Documentation/index.rst b/Documentation/index.rst index cb7f1ba5b3b1..ccfebc260e04 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -86,6 +86,7 @@ implementation. :maxdepth: 2 sh/index + x86/index Korean translations --- diff --git a/Documentation/x86/intel_sgx.rst b/Documentation/x86/intel_sgx.rst new file mode 100644 index ..59049a35512f --- /dev/null +++ b/Documentation/x86/intel_sgx.rst @@ -0,0 +1,101 @@ +=== +Intel(R) SGX driver +=== + +Introduction + + +Intel(R) SGX is a set of CPU instructions that can be used by applications to +set aside private regions of code and data. The code outside the enclave is +disallowed to access the memory inside the enclave by the CPU access control. +In a way you can think that SGX provides inverted sandbox. It protects the +application from a malicious host. + +There is a new hardware unit in the processor called Memory Encryption Engine +(MEE) starting from the Skylake microarchitecture. BIOS can define one or many +MEE regions that can hold enclave data by configuring them with PRMRR registers. + +The MEE automatically encrypts the data leaving the processor package to the MEE +regions. The data is encrypted using a random key whose life-time is exactly one +power cycle. + +You can tell if your CPU supports SGX by looking into ``/proc/cpuinfo``: + + ``cat /proc/cpuinfo | grep sgx`` + +Enclave data types +== + +SGX defines new data types to maintain information about the enclaves and their +security properties. + +The following data structures exist in MEE regions: + +* **Enclave Page Cache (EPC):** memory pages for protected code and data +* **Enclave Page Cache Map (EPCM):** meta-data for each EPC page + +The Enclave Page Cache holds following types of pages: + +* **SGX Enclave Control Structure (SECS)**: meta-data defining the global + properties of an enclave such as range of addresses it can access. +* **Regular (REG):** containing code and data for the enclave. +* **Thread Control Structure (TCS):** defines an entry point for a hardware + thread to enter into the enclave. The enclave can only be entered through + these entry points. +* **Version Array (VA)**: an EPC page receives a unique 8 byte version number + when it is swapped, which is then stored into a VA page. A VA page can hold up + to 512 version numbers. + +Launch control +== + +For launching an enclave, two structures must be provided for ENCLS(EINIT): + +1. **SIGSTRUCT:** a signed measurement of the enclave binary. +2. **EINITTOKEN:** the measurement, the public key of the signer and various + enclave attributes. This structure contains a MAC of its contents using + hardware derived symmetric key called *launch key*. + +The hardware platform contains a root key pair for signing the SIGTRUCT +for a *launch enclave* that is able to acquire the *launch key* for +creating EINITTOKEN's for other enclaves. For the launch enclave +EINITTOKEN is not needed because it is signed with the private root key. + +There are two feature control bits associate with launch control: + +* **IA32_FEATURE_CONTROL[0]**: locks down the feature control register +* **IA32_FEATURE_CONTROL[17]**: allow runtime reconfiguration of + IA32_SGXLEPUBKEYHASHn MSRs that define MRSIGNER hash for the launch + enclave. Essentially they define a signing key that does not require + EINITTOKEN to be let run. + +The BIOS can configure IA32_SGXLEPUBKEYHASHn MSRs before feature control +register is locked. + +It could be tempting to implement launch control by writing the MSRs +every time when an enclave is launched. This does not scale because for +generic case because BIOS might lock down the MSRs before handover to +the OS. + +Debug enclaves +-- + +Enclave can be set as a *debug enclave* of which memory can be read or written +by using the ENCLS(EDBGRD) and ENCLS(EDBGWR) opcodes. The Intel provided launch +enclave provides them always a valid EINITTOKEN and therefore they are a low +hanging fruit way to try out SGX. + +SGX uapi + + +.. kernel-doc:: drivers/platform/x86/intel_sgx_ioctl.c + :functions: sgx_ioc_enclave_create + sgx_ioc_enclave_add_page + sgx_ioc_enclave_init + +.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h + +References +== + +* System Programming Manual: 39.1.4 Intel?? SGX Launch Control Configuration -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] acpi: Fix ACPI GPE mask kernel parameter
On Thursday, November 30, 2017 9:05:59 PM CET Prarit Bhargava wrote: > The acpi_mask_gpe= kernel parameter documentation states that the range > of mask is 128 GPEs (0x00 to 0x7F). The acpi_masked_gpes mask is a u64 so > only 64 GPEs (0x00 to 0x3F) can really be masked. > > Use a bitmap of size 0xFF instead of a u64 for the GPE mask so 256 > GPEs can be masked. > > Fixes: 9c4aa1eecb48 ("ACPI / sysfs: Provide quirk mechanism to prevent GPE > flooding") > Signed-off-by: Prarit Bharava > Cc: Lv Zheng > Cc: Jonathan Corbet > Cc: "Rafael J. Wysocki" > Cc: linux-doc@vger.kernel.org > --- > Documentation/admin-guide/kernel-parameters.txt | 1 - > drivers/acpi/sysfs.c| 26 > - > 2 files changed, 8 insertions(+), 19 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 6571fbfdb2a1..89ba74761180 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -114,7 +114,6 @@ > This facility can be used to prevent such uncontrolled > GPE floodings. > Format: > - Support masking of GPEs numbered from 0x00 to 0x7f. > > acpi_no_auto_serialize [HW,ACPI] > Disable auto-serialization of AML methods > diff --git a/drivers/acpi/sysfs.c b/drivers/acpi/sysfs.c > index 06a150bb35bf..4fc59c3bc673 100644 > --- a/drivers/acpi/sysfs.c > +++ b/drivers/acpi/sysfs.c > @@ -816,14 +816,8 @@ static ssize_t counter_set(struct kobject *kobj, > * interface: > * echo unmask > /sys/firmware/acpi/interrupts/gpe00 > */ > - > -/* > - * Currently, the GPE flooding prevention only supports to mask the GPEs > - * numbered from 00 to 7f. > - */ > -#define ACPI_MASKABLE_GPE_MAX0x80 > - > -static u64 __initdata acpi_masked_gpes; > +#define ACPI_MASKABLE_GPE_MAX0xFF > +static DECLARE_BITMAP(acpi_masked_gpes_map, ACPI_MASKABLE_GPE_MAX) > __initdata; > > static int __init acpi_gpe_set_masked_gpes(char *val) > { > @@ -831,7 +825,7 @@ static int __init acpi_gpe_set_masked_gpes(char *val) > > if (kstrtou8(val, 0, &gpe) || gpe > ACPI_MASKABLE_GPE_MAX) > return -EINVAL; > - acpi_masked_gpes |= ((u64)1< + set_bit(gpe, acpi_masked_gpes_map); > > return 1; > } > @@ -843,15 +837,11 @@ void __init acpi_gpe_apply_masked_gpes(void) > acpi_status status; > u8 gpe; > > - for (gpe = 0; > - gpe < min_t(u8, ACPI_MASKABLE_GPE_MAX, acpi_current_gpe_count); > - gpe++) { > - if (acpi_masked_gpes & ((u64)1< - status = acpi_get_gpe_device(gpe, &handle); > - if (ACPI_SUCCESS(status)) { > - pr_info("Masking GPE 0x%x.\n", gpe); > - (void)acpi_mask_gpe(handle, gpe, TRUE); > - } > + for_each_set_bit(gpe, acpi_masked_gpes_map, ACPI_MASKABLE_GPE_MAX) { > + status = acpi_get_gpe_device(gpe, &handle); > + if (ACPI_SUCCESS(status)) { > + pr_info("Masking GPE 0x%x.\n", gpe); > + (void)acpi_mask_gpe(handle, gpe, TRUE); > } > } > } > Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi James, On 2017/12/16 2:52, James Morse wrote: >> signal, it will record the CPER and trigger a IRQ to notify guest, as shown >> below: >> >> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >> SIGBUS_MCEERR_AO trigger GPIO IRQ. >> >> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify >> trigger method, which all >> >> not involve _trigger_ an SError. > It's a policy choice. How does your virtual CPU notify RAS errors to its > virtual > software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type > of > CPU you are trying to emulate. > > I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where > the guest doesn't take the SError immediately, instead tries to re-execute the I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case. > code KVM has unmapped from stage2 because its corrupt. (You could detect this > happening in Qemu and try something else)For something else, using NOTIFY_SEI > for SIGBUS_MCEERR_AR? At current implementation, It seems only have this case that "KVM has unmapped from stage2", do you thing we still have something else? > > > Synchronous/asynchronous external abort matters to the CPU, but once the error > has been notified to software the reasons for this distinction disappear. Once > the error has been handled, all trace of this distinction is gone. > > CPER records only describe component failures. You are trying to re-create > some > state that disappeared with one of the firmware-first abstractions. Trying to > re-create this information isn't worth the effort as the distinction doesn't > matter to linux, only to the CPU. > > >> so there is no chance for Qemu to trigger the SError when gets the >> SIGBUS_MCEERR_A{O,R}. > You mean there is no reason for Qemu to trigger an SError when it gets a > signal > from the kernel. > > The reasons the CPU might have to generate an SError don't apply to linux and > KVM user space. User-space will never get a signal for an uncontained error, > we > will always panic(). We can't give user-space a signal for imprecise > exceptions, > as it can't return from the signal. The classes of error that are left are > covered by polled/irq and NOTIFY_SEA. > > Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really > wants > to, (but I don't think you should, the kernel may have unmapped the page at PC > from stage2 due to corruption). yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR, so Qemu does not know in which condition to generate RAS SErrors. > > I think the problem here is you're applying the CPU->software behaviour and > choices to software->software. By the time user-space gets the error, the > behaviour is different. In the KVM, as a policy choice to reserve this API to specify guest ESR and drive to trigger SError is OK, At least for Qemu it does not know in which condition to trigger it. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
[...] > >> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >> + /* >> + * Userspace only handle the guest SError Interrupt(SEI) if the >> + * error has not been propagated >> + */ >> + run->exit_reason = KVM_EXIT_EXCEPTION; >> + run->ex.exception = ESR_ELx_EC_SERROR; >> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >> + return 0; > > We should not pass RAS notifications to user space. The kernel either handles > them, or it panics(). User space shouldn't even know if the kernel supports > RAS For the ESR_ELx_AET_UER(Recoverable error), let us see its definition below, which get from [0] The state of the PE is Recoverable if all of the following are true: — The error has not been silently propagated. — The error has not been architecturally consumed by the PE. (The PE architectural state is not infected.) — The exception is precise and PE can recover execution from the preferred return address of the exception, if software locates and repairs the error. The PE cannot make correct progress without either consuming the error or otherwise making the error unrecoverable. The error remains latent in the system. If software cannot locate and repair the error, either the application or the VM, or both, must be isolated by software. so we can see the exception is precise and PE can recover execution from the preferred return address of the exception, so let guest handling it is better, for example, if it is guest application RAS error, we can kill the guest application instead of panic whole OS; if it is guest kernel RAS error, guest will panic. Host does not know which application of guest has error, so host can not handle it, panic OS is not a good choice for the Recoverable error. [0] https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf > until it gets an MCEERR signal. user space will detect whether kernel support RAS before handing it. > > You're making your firmware-first notification an EL3->EL0 signal, bypassing > the OS. > > If we get a RAS SError and there are no CPER records or values in the ERR > nodes, > we should panic as it looks like the CPU/firmware is broken. (spurious RAS > errors) > > >> + default: >> + /* >> + * Until now, the CPU supports RAS and SEI is fatal, or host >> + * does not support to handle the SError. >> + */ >> + panic("This Asynchronous SError interrupt is dangerous, >> panic"); >> + } >> + >> + return 0; >> +} >> + >> /* >> * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on >> * proper exit to userspace. > > > > James > ___ > kvmarm mailing list > kvm...@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html