Re: [RFC 19/19] s390/facilities: enable AP facilities needed by guest
On 11/02/2017 01:08 PM, Christian Borntraeger wrote: > > > On 10/16/2017 11:25 AM, Martin Schwidefsky wrote: >> On Fri, 13 Oct 2017 13:39:04 -0400 >> Tony Krowiak wrote: >> >>> Sets up the following facilities bits to enable the specified AP >>> facilities for the guest VM: >>> * STFLE.12: Enables the AP Query Configuration Information >>> facility. The AP bus running in the guest uses >>> the information returned from this instruction >>> to configure AP adapters and domains for the >>> guest machine. >>> * STFLE.15: Indicates the AP facilities test is available. >>> The AP bus running in the guest uses the >>> information. >>> >>> Signed-off-by: Tony Krowiak >>> --- >>> arch/s390/tools/gen_facilities.c |2 ++ >>> 1 files changed, 2 insertions(+), 0 deletions(-) >>> >>> diff --git a/arch/s390/tools/gen_facilities.c >>> b/arch/s390/tools/gen_facilities.c >>> index 70dd8f1..eeaa7db 100644 >>> --- a/arch/s390/tools/gen_facilities.c >>> +++ b/arch/s390/tools/gen_facilities.c >>> @@ -74,8 +74,10 @@ struct facility_def { >>> 8, /* enhanced-DAT 1 */ >>> 9, /* sense-running-status */ >>> 10, /* conditional sske */ >>> + 12, /* AP query configuration */ >>> 13, /* ipte-range */ >>> 14, /* nonquiescing key-setting */ >>> + 15, /* AP special-command facility */ >>> 73, /* transactional execution */ >>> 75, /* access-exception-fetch/store indication */ >>> 76, /* msa extension 3 */ >> >> With this all KVM guests will always have the AP instructions available, no? >> In principles I like this approach, but it differs from the way z/VM does >> things, >> there the guest will get an exception if it tries to execute an AP >> instruction >> if there are no AP devices assigned to the guest. I wonder if there is a >> reason >> why z/VM does it the way it does. > > A good question. For LPAR it seems that you have AP instructions even if you > have > no crypto cards. > I've tried to figure these things out last week but I've failed. Right at the beginning of AR-10334-03-POK we have this text: "An adjunct processor (AP) facility consists of the three AP instructions, and one to sixty-four APs.". This reads like if we have AP facility we have to have at least one AP. But when I've tried to get a better understanding how the presence/absence of the AP facility is indicated and what facility bits do we have in this context I got confused. Tony, could you please give us a detailed summary on this (best with references, and focusing on the (guest) program perspective)? Regards, Halil
Re: [PATCH v9 12/22] s390: vfio-ap: sysfs interfaces to configure control domains
On 08/27/2018 03:51 PM, Cornelia Huck wrote: On Mon, 27 Aug 2018 09:47:58 -0400 Tony Krowiak wrote: On 08/27/2018 04:33 AM, Cornelia Huck wrote: On Thu, 23 Aug 2018 10:16:59 -0400 Tony Krowiak wrote: On 08/23/2018 06:25 AM, Cornelia Huck wrote: On Wed, 22 Aug 2018 15:16:19 -0400 Tony Krowiak wrote: One of the things I suggested in a private conversation with Christian earlier today was to provide an additional rw sysfs attribute - a boolean - that indicates whether all usage domains should also be control domains. The default could be true. This would allow one to configure guests with usage-only domains as well as satisfy the convention. Would this additional attribute then control "add usage domains to the list of control domains automatically", or "don't allow to add a usage domain if it has not already been added as a control domain"? It was just a proposal that wasn't really discussed at all, but this attribute would add usage domains to the list of control domains automatically if set to one. That would be the default behavior which would be turned off by manually setting it to zero. If we want to do something like that, having it add the usage domains automatically sounds like the more workable alternative. What I like about this is that we make it explicit that we change the masks beyond what the admin explicitly configured, and provide a knob to turn off that behaviour. So, are you saying I should go ahead and implement this? I'm just saying that it does not sound like a bad idea :) If you agree that it's a good idea and if others also like it... I'd certainly not mind you going ahead :) I can live with it. What I don't like about it is that it adds more context dependent semantics. The same sequence of actions results in a different result (depending on the mode of operation). Regards, Halil
Re: [PATCH v9 22/22] s390: doc: detailed specifications for AP virtualization
On 08/20/2018 10:16 PM, Tony Krowiak wrote: Does the SIE complain if you specify a control domain that the host does not have access to (I'd guess so)? The SIE does not complain if you specify a domain to which the host - or a lower level guest - does not have access. The firmware performs a logical AND of the guest's and hosts's - or lower level guest's - APMs, AQMs and ADMs Rather a bit-wise AND, I guess (of the same type masks corresponding to Guest 1 and Guest 2). The result of a logical AND is a logical value (true or false) as far as I remember. to create effective masks EAPM, EAQM and EADM. Only devices corresponding to the bits set in the EAPM, EAQM and EADM will be accessible by the guest. I'm not sure what is the intended meaning of 'the SIE complains'. If it means getting out of (SIE when interpreting lets say an NQAP under the discussed circumstances) with some sort of error code, I think Tony's answer, ' SIE does not complain' makes a lot of sense. It's the guest that's is trying to stretch further than the blanket reaches, and it's the guest that needs to be educated on this fact. AFAIR SIE does the right thing (whatever the right thing is) and we don't have to worry about it. As a matter of fact I can't recall exactly what is supposed to happen when a guest tries to modify a domain such that the guest does not have privileges to modify (in terms of EADM, either because the guest or because the host does not have the corresponding bit set). I'm sure I did not try it out. Tony did you test this scenario? (BTW my best guess at the moment is, that the situation is handled via the command-reply.) Regards, Halil
Re: [PATCH v9 12/22] s390: vfio-ap: sysfs interfaces to configure control domains
On 08/21/2018 07:07 PM, Tony Krowiak wrote: On 08/21/2018 11:25 AM, Cornelia Huck wrote: On Mon, 20 Aug 2018 13:41:32 -0400 Tony Krowiak wrote: On 08/20/2018 10:23 AM, Cornelia Huck wrote: On Mon, 13 Aug 2018 17:48:09 -0400 Tony Krowiak wrote: From: Tony Krowiak Provides the sysfs interfaces for: 1. Assigning AP control domains to the mediated matrix device 2. Unassigning AP control domains from a mediated matrix device 3. Displaying the control domains assigned to a mediated matrix device The IDs of the AP control domains assigned to the mediated matrix device are stored in an AP domain mask (ADM). The bits in the ADM, from most significant to least significant bit, correspond to AP domain numbers 0 to 255. On some systems, the maximum allowable domain number may be less than 255 - depending upon the host's AP configuration - and assignment may be rejected if the input domain ID exceeds the limit. Please remind me of the relationship between control domains and usage domains... IIRC, usage domains allow both requests and configuration, while control domains allow only configuration, and are by convention a superset of usage domains. A usage domain is a domain to which an AP command-request message can be submitted for processing. A control domain is a domain that can be changed by an AP command request message submitted to a usage domain. AP command request messages to configure a domain will contain the domain number of the domain to be modified. The AP firmware will check the control domain mask (ADM) and will allow the request to proceed only if the corresponding bit in the ADM is set. Thanks to you and Halil for the explanation. Is there a hard requirement somewhere in there, or can the admin cheerfully use different masks for usage domains and control domains without the SIE choking on it? There is no hard requirement that control domains must be a superset of the usage domains, it is merely an architectural convention. AFAIK, SIE doesn't enforce this and will not break if the convention is not enforced externally. Having said that, you should note that the AQM and ADM masks configured for the mediated matrix device will be logically OR'd together to create the ADM stored in the CRYCB referenced from the guest's SIE state description. In other words, we are enforcing the convention in our software. Hm, that's interesting, as Halil argued that we should not enforce it in the kernel. Might be somewhat surprising as well. If that is really the way to do it, this needs to be documented clearly. This convention has been enforced by the kernel since v1. This is also enforced by both the LPAR as well as in z/VM. The following is from the PR/SM Planning Guide: Control Domain A logical partition's control domains are those cryptographic domains for which remote secure administration functions can be established and administered from this logical partition. This logical partition’s control domains must include its usage domains. For each index selected in the usage domain index list, you must select the same index in the control domain index list IMHO this quote is quite a half-full half-empty cup one: * it mandates the set of usage domains is a subset of the set of the control domains, but * it speaks of independent controls, namely about the 'usage domain index' and the 'control domain index list' and makes the enforcement of the rule a job of the administrator (instead of codifying it in the controls). Consequently, I'm going to opt for ensuring this is clearly documented. Based on the fact you've requested clarification of many points described in this section of the doc, I think I'll try putting my meager skills as a wordsmith to work to hopefully clarify things. I'll run it by you when I complete that task to see if I've succeeded:) I don't think just a doc update will do. Let me explain why. What describe as "... note that the AQM and ADM masks configured for the mediated matrix device will be logically OR'd together to create the ADM stored in the CRYCB referenced from the guest's SIE state description." is a gotcha at best. The member of struct ap_matrix and the member of the respective apcb in the crycb are both called 'adm', but ap_matrix.adm is not an ADM as we know it from the architecture, but rather ~ AQM & ADM. I feel pretty strongly about this one. If we want to keep the enforcement in the kernel, I guess, the assign_domain should set the bit corresponding bit not only in ap_matrix.aqm but also in ap_matrix.adm. When the ap_matrix is committed into the crycb no further manipulating the masks should take place. I don't feel strongly about whether to enforce this convention about AQM and ADM in the kernel or not. Frankly, I don't know what is behind the rule. Since I can't tell if any problems are to be expected if this convention is violated, I would feel more comfortable if the rule was accommodated higher in the management stack. Regards, Hal
Re: [PATCH v9 12/22] s390: vfio-ap: sysfs interfaces to configure control domains
On 08/22/2018 11:42 AM, Cornelia Huck wrote: On Wed, 22 Aug 2018 01:18:20 +0200 Halil Pasic wrote: On 08/21/2018 07:07 PM, Tony Krowiak wrote: This convention has been enforced by the kernel since v1. This is also enforced by both the LPAR as well as in z/VM. The following is from the PR/SM Planning Guide: Control Domain A logical partition's control domains are those cryptographic domains for which remote secure administration functions can be established and administered from this logical partition. This logical partition’s control domains must include its usage domains. For each index selected in the usage domain index list, you must select the same index in the control domain index list That's interesting. IMHO this quote is quite a half-full half-empty cup one: * it mandates the set of usage domains is a subset of the set of the control domains, but * it speaks of independent controls, namely about the 'usage domain index' and the 'control domain index list' and makes the enforcement of the rule a job of the administrator (instead of codifying it in the controls). I'm wondering if a configuration with a usage domain that is not also a control domain is rejected outright? Anybody tried that? :) I did not. This is my first exposure to the PR/SM Planning Guide. As I stated previously the HMC interface enforces the convention by UI design: in the activation profile you can either configure a domain as 'control' or 'control and usage' domain -- think radio button. I have no idea how is this information feed into PR/SM and same goes for alternatives to specify it. So I'm also very curious about this. Another interesting question: On what level does z/VM and PR/SM enforce the convention (i.e. on privilege level does the code doing the enforcement run)? Consequently, I'm going to opt for ensuring this is clearly documented. Based on the fact you've requested clarification of many points described in this section of the doc, I think I'll try putting my meager skills as a wordsmith to work to hopefully clarify things. I'll run it by you when I complete that task to see if I've succeeded:) I don't think just a doc update will do. Let me explain why. What describe as "... note that the AQM and ADM masks configured for the mediated matrix device will be logically OR'd together to create the ADM stored in the CRYCB referenced from the guest's SIE state description." is a gotcha at best. The member of struct ap_matrix and the member of the respective apcb in the crycb are both called 'adm', but ap_matrix.adm is not an ADM as we know it from the architecture, but rather ~ AQM & ADM. I feel pretty strongly about this one. If we want to keep the enforcement in the kernel, I guess, the assign_domain should set the bit corresponding bit not only in ap_matrix.aqm but also in ap_matrix.adm. When the ap_matrix is committed into the crycb no further manipulating the masks should take place. Would you be fine if the control domain interface stated that it is used to configure _additional_ control domains and the usage domain interface stated that it is used to define usage and implicitly also control domains? (And make the usage domain interface also set the equivalent bit in the control domain mask.) I'm fine with the interface, otherwise I would not have r-b-ed the patch. What I strongly dislike is the implementation is IMHO very confusing (along the lines "surprise, surprise it is called adm but it ain't adm"). This implementation detail however can be changed any time, so I did not want to start a big discussion as we wanted to get this out ASAP. But since it was brought up, I decided to put in my two cents. I don't feel strongly about whether to enforce this convention about AQM and ADM in the kernel or not. Frankly, I don't know what is behind the rule. Since I can't tell if any problems are to be expected if this convention is violated, I would feel more comfortable if the rule was accommodated higher in the management stack. I guess it depends: - If this is a case of: "Don't configure control domains that are not also usage domains. You are likely to go through {code,firmware,hardware} paths that are generally not used.", configure it in the kernel. - If this rather is "Everybody is doing that, it's a general convention.", configure it higher up in the stack (libvirt?) My guess is that it's the second, but I really don't know. Maybe somebody else will help us answer this question, and also tell what is the rationale behind the rule. Regards, Halil
Re: [PATCH v9 22/22] s390: doc: detailed specifications for AP virtualization
On 08/22/2018 12:13 PM, Harald Freudenberger wrote: ... about control domains Talked with the s390 firmware guys. The convention that the control domain mask is a superset of the usage domain mask is only true for 1st level guests. It is absolutely valid to run a kvm guest with restricted control domain mask bitmap in the CRYCB. It is valid to have an empty control domain mask and the guest should be able to run crypto CPRBs on the usage domain(s) without any problems. However, nobody has tried this. I did try this ;). regards Harald Freudenberger
Re: [PATCH v9 12/22] s390: vfio-ap: sysfs interfaces to configure control domains
On 08/22/2018 05:48 PM, Christian Borntraeger wrote: On 08/22/2018 05:34 PM, Pierre Morel wrote: On 22/08/2018 17:11, Christian Borntraeger wrote: On 08/22/2018 01:03 PM, Pierre Morel wrote: That's interesting. IMHO this quote is quite a half-full half-empty cup one: * it mandates the set of usage domains is a subset of the set of the control domains, but * it speaks of independent controls, namely about the 'usage domain index' and the 'control domain index list' and makes the enforcement of the rule a job of the administrator (instead of codifying it in the controls). I'm wondering if a configuration with a usage domain that is not also a control domain is rejected outright? Anybody tried that? :) Yes, and no it is not. We can use a queue (usage domain) to a AP card for SHA-512 or RSA without having to define the queue as a control domain. Huh? My HMC allows to add a domain as - control only domain - control and usage domain. But I am not able to configure a usage-only domain for my LPAR. That seems to match the current code, no? Yes, it may not be configurable by the HMC but if we start a guest with no control domain it is not a problem to access the hardware through the usage domain. I tested this a long time ago, but tested again today to be sure on my LPAR. AFAIU adding a control only domain and a control and usage domain allows say: control and usage domain 1 control only domain 2 Allow to send a message to domain 2 using queue 1 Allow also to send a domain modifying message to domain 1 using queue 1 control domain are domain which are controlled So you have changed the code to not automatically make a usage domain a control domain in the bitfield (and you could still use it as a usage domain). Correct? I tested basically the same yesterday, with the same results. I think this is probably expected. the "usage implies control" seems to be a convention implemented by HMC (lpar) and z/VM but millicode offers the bits to have usage-only domains. As LPAR and z/VM will always enable any usage-domain to also be a control domain we should do the same. I'm fine either way, but slightly prefer higher level management software and not the kernel accommodating this convention. Please consider a quote from Harald's mail in another sub-thread """ ... about control domains Talked with the s390 firmware guys. The convention that the control domain mask is a superset of the usage domain mask is only true for 1st level guests. It is absolutely valid to run a kvm guest with restricted control domain mask bitmap in the CRYCB. It is valid to have an empty control domain mask and the guest should be able to run crypto CPRBs on the usage domain(s) without any problems. However, nobody has tried this. """ I'm yet to get an explanation why was this convention established in the first place. And I can not figure it out myself. For me a setup where I know that the domains used by some guest can not be modified by the same guest makes perfect sense. If I try to think in analogies, I kind of compare modification (that is control domain) with write access, and usage (that is usage domain) with read access to, let's say a regular file. For me, all options (rw, r, and w) do make sense, and if I had to pick the one that makes the least sense I would pick write only. The convention is in these terms making read-only illegal. But should 'usage only domains' ever get identified as something somebody wants to do we can just add an attribute for that. So I'm fine either way. Still I find the commit message for this patch, the implementation of assign_control_domain() and also the documentation slightly misleading regarding what does one get from assign_domain. Regards, Halil It seems that the HMC enforce the LPARs to have access to their usage domain (AFAIU from Harald)
Re: [PATCH v9 12/22] s390: vfio-ap: sysfs interfaces to configure control domains
On 08/22/2018 09:16 PM, Tony Krowiak wrote: On 08/22/2018 01:11 PM, Halil Pasic wrote: On 08/22/2018 05:48 PM, Christian Borntraeger wrote: On 08/22/2018 05:34 PM, Pierre Morel wrote: On 22/08/2018 17:11, Christian Borntraeger wrote: On 08/22/2018 01:03 PM, Pierre Morel wrote: That's interesting. IMHO this quote is quite a half-full half-empty cup one: * it mandates the set of usage domains is a subset of the set of the control domains, but * it speaks of independent controls, namely about the 'usage domain index' and the 'control domain index list' and makes the enforcement of the rule a job of the administrator (instead of codifying it in the controls). I'm wondering if a configuration with a usage domain that is not also a control domain is rejected outright? Anybody tried that? :) Yes, and no it is not. We can use a queue (usage domain) to a AP card for SHA-512 or RSA without having to define the queue as a control domain. Huh? My HMC allows to add a domain as - control only domain - control and usage domain. But I am not able to configure a usage-only domain for my LPAR. That seems to match the current code, no? Yes, it may not be configurable by the HMC but if we start a guest with no control domain it is not a problem to access the hardware through the usage domain. I tested this a long time ago, but tested again today to be sure on my LPAR. AFAIU adding a control only domain and a control and usage domain allows say: control and usage domain 1 control only domain 2 Allow to send a message to domain 2 using queue 1 Allow also to send a domain modifying message to domain 1 using queue 1 control domain are domain which are controlled So you have changed the code to not automatically make a usage domain a control domain in the bitfield (and you could still use it as a usage domain). Correct? I tested basically the same yesterday, with the same results. I think this is probably expected. the "usage implies control" seems to be a convention implemented by HMC (lpar) and z/VM but millicode offers the bits to have usage-only domains. As LPAR and z/VM will always enable any usage-domain to also be a control domain we should do the same. I'm fine either way, but slightly prefer higher level management software and not the kernel accommodating this convention. Please consider a quote from Harald's mail in another sub-thread """ ... about control domains Talked with the s390 firmware guys. The convention that the control domain mask is a superset of the usage domain mask is only true for 1st level guests. It is absolutely valid to run a kvm guest with restricted control domain mask bitmap in the CRYCB. It is valid to have an empty control domain mask and the guest should be able to run crypto CPRBs on the usage domain(s) without any problems. However, nobody has tried this. """ I'm yet to get an explanation why was this convention established in the first place. And I can not figure it out myself. For me a setup where I know that the domains used by some guest can not be modified by the same guest makes perfect sense. If I try to think in analogies, I kind of compare modification (that is control domain) with write access, and usage (that is usage domain) with read access to, let's say a regular file. For me, all options (rw, r, and w) do make sense, and if I had to pick the one that makes the least sense I would pick write only. The convention is in these terms making read-only illegal. But should 'usage only domains' ever get identified as something somebody wants to do we can just add an attribute for that. So I'm fine either way. One of the things I suggested in a private conversation with Christian earlier today was to provide an additional rw sysfs attribute - a boolean - that indicates whether all usage domains should also be control domains. The default could be true. This would allow one to configure guests with usage-only domains as well as satisfy the convention. I prefer keeping the attributes as they are and adding a new let's say (un)assign_usage_domain if the need arises over this boolean attribute that changes how (un)assign_domain works. Halil
Re: [PATCH v9 21/22] KVM: s390: CPU model support for AP virtualization
On 08/23/2018 09:44 AM, David Hildenbrand wrote: On 22.08.2018 22:16, Tony Krowiak wrote: On 08/22/2018 07:24 AM, David Hildenbrand wrote: On 22.08.2018 13:19, David Hildenbrand wrote: On 13.08.2018 23:48, Tony Krowiak wrote: From: Tony Krowiak Introduces a new CPU model feature and two CPU model facilities to support AP virtualization for KVM guests. CPU model feature: The KVM_S390_VM_CPU_FEAT_AP feature indicates that AP instructions are available on the guest. This feature will be enabled by the kernel only if the AP instructions are installed on the linux host. This feature must be specifically turned on for the KVM guest from userspace to use the VFIO AP device driver for guest access to AP devices. CPU model facilities: 1. AP Query Configuration Information (QCI) facility is installed. This is indicated by setting facilities bit 12 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. If this facility is not set for the KVM guest, then only APQNs with an APQI less than 16 will be used by a Linux guest regardless of the matrix configuration for the virtual machine. This is a limitation of the Linux AP bus. 2. AP Facilities Test facility (APFT) is installed. This is indicated by setting facilities bit 15 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. If this facility is not set for the KVM guest, then no AP devices will be available to the guest regardless of the guest's matrix configuration for the virtual machine. This is a limitation of the Linux AP bus. Signed-off-by: Tony Krowiak Reviewed-by: Christian Borntraeger Reviewed-by: Halil Pasic Tested-by: Michael Mueller Tested-by: Farhan Ali Signed-off-by: Christian Borntraeger --- arch/s390/kvm/kvm-s390.c |5 + arch/s390/tools/gen_facilities.c |2 ++ 2 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 1e8cb67..d5e04d2 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -367,6 +367,11 @@ static void kvm_s390_cpu_feat_init(void) if (MACHINE_HAS_ESOP) allow_cpu_feat(KVM_S390_VM_CPU_FEAT_ESOP); + + /* Check if AP instructions installed on host */ + if (ap_instructions_available()) + allow_cpu_feat(KVM_S390_VM_CPU_FEAT_AP); + /* * We need SIE support, ESOP (PROT_READ protection for gmap_shadow), * 64bit SCAO (SCA passthrough) and IDTE (for gmap_shadow unshadowing). diff --git a/arch/s390/tools/gen_facilities.c b/arch/s390/tools/gen_facilities.c index 90a8c9e..a52290b 100644 --- a/arch/s390/tools/gen_facilities.c +++ b/arch/s390/tools/gen_facilities.c @@ -106,6 +106,8 @@ struct facility_def { .name = "FACILITIES_KVM_CPUMODEL", .bits = (int[]){ + 12, /* AP Query Configuration Information */ + 15, /* AP Facilities Test */ -1 /* END */ } }, I really wonder if we should also export the APXA facility. We can probe and allow that CPU feature. However, we cannot disable it (as of now). We have other CPU features where it is the same case (basically all subfunctions). See kvm_s390_get_processor_subfunc(). We probe them and export them, but support to disable them has never been implemented. On a high level, we could then e.g. deny to start a QEMU guest if APXA is available but has been disabled. (until we know that disabling it actually works - if ever). This helps to catch nasty migration bugs (e.g. APXA suddenly disappearing). Although unlikely, definitely possible. Are there any other AP related facilities that the guest can from now on probe that should also become part of the CPU model? To be more precise, shouldn't PQAP(QCI) be handled just like other subfunctions? (I remember it should) When you suggest PQAP(QCI) be handled like other subfunctions, are you suggesting that there should be a field in struct kvm_s390_vm_cpu_subfunc with a bit indicating the QCI subfunction is available? The availability of the QCI subfunction of the PQAP instruction is determined by facilities bit 12. Is it not enough to export facilities bit 12? The feature block (128 bit) from PQAP(QCI) should be passed through a subfunction block to QEMU. I'm confused, which 128 bit? So it is about passing e.g. APXA availability, not QCI itself. (as you correctly said, that is stfl 12)
Re: [PATCH v9 21/22] KVM: s390: CPU model support for AP virtualization
On 08/23/2018 02:47 PM, Pierre Morel wrote: On 23/08/2018 13:12, David Hildenbrand wrote: [..] I'm confused, which 128 bit? Me too :) , I was assuming this block to be 128bit, but the qci block has 128 bytes And looking at arch/s390/include/asm/ap.h, there is a lot of information contained that is definitely not of interest for CPU models... I wonder if there is somewhere defined which bits are reserved for future features/facilities, compared to ap masks and such. This is really hard to understand/plan without access to documentation. You (Halil, Tony, Pier, ...) should have a look if what I described related to PQAP(QCI) containing features that should get part of the CPU model makes sense or not. For now I was thinking that there is some part inside of QCI that is strictly reserved for facilities/features that we can use. No there is no such part. The architecture documentation is quite confusing with some aspects (e.g. persistence) of how exactly some of these features work and are indicated. I'm having a hard time finding my opinion. I may end up asking some questions later, but for now i have to think first. Just one hint. There is a programming note stating that if bit 2 of the QCI block is one there is at least one AP card in the machine that actually has APXA installed. I read the architecture so that the APXA has a 'cpu part' (if we are doing APXA the cpu can't spec exception on certain bits not being zor9) and a 'card(s) part'. Since the stuff seems quite difficult to sort out properly, I ask myself are there real problems we must solve? This ultimately seems to be about the migration, right? You say 'This helps to catch nasty migration bugs (e.g. APXA suddenly disappearing).' at the very beginning of the discussion. Yes, we don't have to have an vfio_ap device, he guest can and will start looking for AP resources if only the cpu model features installed. So the guest could observe a disappearing APXA, but I don't think that would lead to problems (with Linux at least). And there ain't much AP a guest can sanely do without if no AP resources are there. I would really prefer not rushing a solution if we don't have to. What is apsc, qact, rc8a in the qci blocks? are the facility bits? Yes, facility bits concerning the AP instructions According to the current AR document rc8a ain't a facility but bits 0-2 and 4-7 kind of are. Regards, Halil
Re: [PATCH v9 21/22] KVM: s390: CPU model support for AP virtualization
On 08/23/2018 07:40 PM, David Hildenbrand wrote: On 23.08.2018 19:35, Tony Krowiak wrote: On 08/23/2018 10:59 AM, Pierre Morel wrote: On 23/08/2018 15:38, David Hildenbrand wrote: On 23.08.2018 15:22, Halil Pasic wrote: On 08/23/2018 02:47 PM, Pierre Morel wrote: On 23/08/2018 13:12, David Hildenbrand wrote: [..] I'm confused, which 128 bit? Me too :) , I was assuming this block to be 128bit, but the qci block has 128 bytes And looking at arch/s390/include/asm/ap.h, there is a lot of information contained that is definitely not of interest for CPU models... I wonder if there is somewhere defined which bits are reserved for future features/facilities, compared to ap masks and such. This is really hard to understand/plan without access to documentation. You (Halil, Tony, Pier, ...) should have a look if what I described related to PQAP(QCI) containing features that should get part of the CPU model makes sense or not. For now I was thinking that there is some part inside of QCI that is strictly reserved for facilities/features that we can use. No there is no such part. The architecture documentation is quite confusing with some aspects (e.g. persistence) of how exactly some of these features work and are indicated. I'm having a hard time finding my opinion. I may end up asking some questions later, but for now i have to think first. Just one hint. There is a programming note stating that if bit 2 of the QCI block is one there is at least one AP card in the machine that actually has APXA installed. I read the architecture so that the APXA has a 'cpu part' (if we are doing APXA the cpu can't spec exception on certain bits not being zor9) and a 'card(s) part'. Since the stuff seems quite difficult to sort out properly, I ask myself are there real problems we must solve? This ultimately seems to be about the migration, right? You say 'This helps to catch nasty migration bugs (e.g. APXA suddenly disappearing).' at the very beginning of the discussion. Yes, we don't have to have an vfio_ap device, he guest can and will start looking for AP resources if only the cpu model features installed. So the guest could observe a disappearing APXA, but I don't think that would lead to problems (with Linux at least). And there ain't much AP a guest can sanely do without if no AP resources are there. I would really prefer not rushing a solution if we don't have to. What is apsc, qact, rc8a in the qci blocks? are the facility bits? Yes, facility bits concerning the AP instructions According to the current AR document rc8a ain't a facility but bits 0-2 and 4-7 kind of are. Easy ( :) ) answer. Everything that is the CPU part should get into the CPU model. Everything that is AP specific not. If APXA is not a CPU facility, fine with me to leave it out. Ack to not rushing, but also ack to not leaving out important things. Ack that this stuff is hard to ficure out. APXA is not a CPU part, it is a machine part (SIE) and a AP part (QCI,TAPQ), it has no influence on CPU instructions but on the AP instructions. Consequently, if I understood the definition correctly, it should not go in the CPU model. The APXA bit returned via the PQAP(QCI) instruction indicates the APXA facility is installed in the CPUs of the configuration. This means that the facility is installed in one or more adjunct processors but not necessarily all. Given that it indicates a CPU property, maybe it does belong in the CPU model? Hmmm, I tend to agree - especially as it affects SIE behavior. But as this is not a feature block (compared to what I thought), this clould be model as a CPU feature like AP. There is certainly a CPU aspect to APXA: before APXA the APQN had to have zeros in certain bits (otherwise specification exception). When running with APXA we have a guarantee that there won't be any specification exception flying because such an bit is set. The interesting question is, is APXA constant let's say as long as an LPAR partition is activated? Regards, Halil
Re: [PATCH v6 15/21] s390: vfio-ap: configure the guest's AP matrix
On 06/29/2018 11:11 PM, Tony Krowiak wrote: From: Tony Krowiak Configures the AP adapters, usage domains and control domains for the KVM guest from the matrix configured via the mediated matrix device's sysfs attribute files. [..] + +static void kvm_ap_set_crycb_masks(struct ap_matrix_mdev *matrix_mdev) +{ + int nbytes; + unsigned long *apm, *aqm, *adm; + + kvm_ap_clear_crycb_masks(matrix_mdev); + + apm = kvm_ap_get_crycb_apm(matrix_mdev); + aqm = kvm_ap_get_crycb_aqm(matrix_mdev); + adm = kvm_ap_get_crycb_adm(matrix_mdev); + + nbytes = KVM_AP_MASK_BYTES(matrix_mdev->matrix.apm_max + 1); + memcpy(apm, matrix_mdev->matrix.apm, nbytes); + + nbytes = KVM_AP_MASK_BYTES(matrix_mdev->matrix.aqm_max + 1); + memcpy(aqm, matrix_mdev->matrix.aqm, nbytes); + + /* +* Merge the AQM and ADM since the ADM is a superset of the +* AQM by agreed-upon convention. +*/ + bitmap_or(adm, matrix_mdev->matrix.adm, matrix_mdev->matrix.aqm, + matrix_mdev->matrix.adm_max + 1); Are you sure this or works as expected? E.g. if adm_max == 15 the bitmaps include the least significant 2 bytes but you want the other two. +}
Re: [PATCH v6 14/21] s390: vfio-ap: implement mediated device open callback
On 06/29/2018 11:11 PM, Tony Krowiak wrote: Implements the open callback on the mediated matrix device. The function registers a group notifier to receive notification of the VFIO_GROUP_NOTIFY_SET_KVM event. When notified, the vfio_ap device driver will get access to the guest's kvm structure. The open callback must ensure that only one mediated device shall be opened per guest. Signed-off-by: Tony Krowiak --- drivers/s390/crypto/vfio_ap_ops.c | 128 + drivers/s390/crypto/vfio_ap_private.h |2 + 2 files changed, 130 insertions(+), 0 deletions(-) diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index bc7398d..58be495 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -11,6 +11,10 @@ #include #include #include +#include +#include +#include +#include #include "vfio_ap_private.h" @@ -748,12 +752,136 @@ static ssize_t matrix_show(struct device *dev, struct device_attribute *attr, NULL }; +/** + * Verify that the AP instructions are available on the guest and are to be + * interpreted by the firmware. The former is indicated via the + * KVM_S390_VM_CPU_FEAT_AP CPU model feature and the latter by apie crypto + * flag. + */ +static int kvm_ap_validate_crypto_setup(struct kvm *kvm) +{ + if (test_bit_inv(KVM_S390_VM_CPU_FEAT_AP, kvm->arch.cpu_feat) && + kvm->arch.crypto.apie) + return 0; + + pr_err("%s: interpretation of AP instructions not available", + VFIO_AP_MODULE_NAME); + + return -EOPNOTSUPP; +} + +static int vfio_ap_mdev_group_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct ap_matrix_mdev *matrix_mdev; + + if (action == VFIO_GROUP_NOTIFY_SET_KVM) { + matrix_mdev = container_of(nb, struct ap_matrix_mdev, + group_notifier); + matrix_mdev->kvm = data; + } + + return NOTIFY_OK; +} + [..] + +static int vfio_ap_mdev_open(struct mdev_device *mdev) +{ + struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev); + struct ap_matrix_dev *matrix_dev = + to_ap_matrix_dev(mdev_parent_dev(mdev)); + unsigned long events; + int ret; + + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + + ret = vfio_ap_verify_queues_reserved(matrix_dev, matrix_mdev->name, +&matrix_mdev->matrix); + if (ret) + goto out_err; + + matrix_mdev->group_notifier.notifier_call = vfio_ap_mdev_group_notifier; + events = VFIO_GROUP_NOTIFY_SET_KVM; + + ret = vfio_register_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY, +&events, &matrix_mdev->group_notifier); + if (ret) + goto out_err; + + ret = kvm_ap_validate_crypto_setup(matrix_mdev->kvm); At this point you assume that your vfio_ap_mdev_group_notifier callback was called with VFIO_GROUP_NOTIFY_SET_KVM and that you do have matrix_mdev->kvm set up properly. Based on how callbacks usually work this seems rather strange. It's probably cleaner to set up he cyrcb (including all the validation that needs to be done immediately before) in the callback (vfio_ap_mdev_group_notifier). If that is not viable I think we need a comment here explaining why is this OK (at least). Regards, Halil + if (ret) + goto out_kvm_err; + + ret = vfio_ap_mdev_open_once(matrix_mdev); + if (ret) + goto out_kvm_err; + + return 0; + +out_kvm_err: + vfio_unregister_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY, +&matrix_mdev->group_notifier); + matrix_mdev->kvm = NULL; +out_err: + module_put(THIS_MODULE); + + return ret; +} + [..]
Re: [PATCH v6 15/21] s390: vfio-ap: configure the guest's AP matrix
On 06/29/2018 11:11 PM, Tony Krowiak wrote: From: Tony Krowiak Configures the AP adapters, usage domains and control domains for the [..] +static inline void kvm_ap_clear_crycb_masks(struct ap_matrix_mdev *matrix_mdev) +{ + memset(&matrix_mdev->kvm->arch.crypto.crycb->apcb0, 0, + sizeof(matrix_mdev->kvm->arch.crypto.crycb->apcb0)); + memset(&matrix_mdev->kvm->arch.crypto.crycb->apcb1, 0, + sizeof(matrix_mdev->kvm->arch.crypto.crycb->apcb1)); +} + +static void kvm_ap_set_crycb_masks(struct ap_matrix_mdev *matrix_mdev) +{ + int nbytes; + unsigned long *apm, *aqm, *adm; + + kvm_ap_clear_crycb_masks(matrix_mdev); + + apm = kvm_ap_get_crycb_apm(matrix_mdev); + aqm = kvm_ap_get_crycb_aqm(matrix_mdev); + adm = kvm_ap_get_crycb_adm(matrix_mdev); + + nbytes = KVM_AP_MASK_BYTES(matrix_mdev->matrix.apm_max + 1); + memcpy(apm, matrix_mdev->matrix.apm, nbytes); + + nbytes = KVM_AP_MASK_BYTES(matrix_mdev->matrix.aqm_max + 1); + memcpy(aqm, matrix_mdev->matrix.aqm, nbytes); + + /* +* Merge the AQM and ADM since the ADM is a superset of the +* AQM by agreed-upon convention. +*/ + bitmap_or(adm, matrix_mdev->matrix.adm, matrix_mdev->matrix.aqm, + matrix_mdev->matrix.adm_max + 1); +} + [..] + +static int kvm_ap_configure_matrix(struct ap_matrix_mdev *matrix_mdev) +{ + int ret = 0; + + mutex_lock(&matrix_mdev->kvm->lock); + + ret = kvm_ap_validate_queue_sharing(matrix_mdev); + if (ret) + goto done; + + kvm_ap_set_crycb_masks(matrix_mdev); + +done: + mutex_unlock(&matrix_mdev->kvm->lock); + + return ret; +} + +void kvm_ap_deconfigure_matrix(struct ap_matrix_mdev *matrix_mdev) +{ + mutex_lock(&matrix_mdev->kvm->lock); + kvm_ap_clear_crycb_masks(matrix_mdev); The guest may be running at this point of time, or? I think you need our safe update operation that we used to use for the initial set too, but then somebody was like it ain't necessary because we don't support hotplug (yet). Regards, Halil + mutex_unlock(&matrix_mdev->kvm->lock); +} +
Re: [PATCH v6 14/21] s390: vfio-ap: implement mediated device open callback
On 07/12/2018 06:03 PM, Tony Krowiak wrote: +static int vfio_ap_mdev_open(struct mdev_device *mdev) +{ +struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev); +struct ap_matrix_dev *matrix_dev = +to_ap_matrix_dev(mdev_parent_dev(mdev)); +unsigned long events; +int ret; + +if (!try_module_get(THIS_MODULE)) +return -ENODEV; + +ret = vfio_ap_verify_queues_reserved(matrix_dev, matrix_mdev->name, + &matrix_mdev->matrix); +if (ret) +goto out_err; + + matrix_mdev->group_notifier.notifier_call = vfio_ap_mdev_group_notifier; +events = VFIO_GROUP_NOTIFY_SET_KVM; + +ret = vfio_register_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY, + &events, &matrix_mdev->group_notifier); +if (ret) +goto out_err; + +ret = kvm_ap_validate_crypto_setup(matrix_mdev->kvm); At this point you assume that your vfio_ap_mdev_group_notifier callback was called with VFIO_GROUP_NOTIFY_SET_KVM and that you do have matrix_mdev->kvm set up properly. Based on how callbacks usually work this seems rather strange. It's probably cleaner to set up he cyrcb (including all the validation that needs to be done immediately before) in the callback (vfio_ap_mdev_group_notifier). If that is not viable I think we need a comment here explaining why is this OK (at least). This was originally in the callback and moved out, to the best of my recollection, because the validation at that time was done on the CRYCB and if that validation failed, there was no way to notify the client (QEMU) that configuration of the guest's CRYCB failed from the notification callback. This works - at least as far as I can tell from testing - because the registration of the notifier invokes the notification callback if KVM has already been set and that appears to be the case. You are correct, however; we probably shouldn't bank on that given that I don't think we can guarantee that to be the case 100% of the time. Consequently, I will move this back into the notification callback and since consistency checking is now being done on the mdev devices instead of the CRYCB, we don't need KVM at open time. Sounds good to me. Making the open fail was not a good way to communicate this error condition to userspace anyway. Regards, Halil
Re: [PATCH v6 13/21] s390: vfio-ap: sysfs interface to view matrix mdev matrix
On 07/13/2018 02:24 PM, Tony Krowiak wrote: On 07/09/2018 04:38 PM, Pierre Morel wrote: On 09/07/2018 14:20, Pierre Morel wrote: On 29/06/2018 23:11, Tony Krowiak wrote: Provides a sysfs interface to view the AP matrix configured for the mediated matrix device. The relevant sysfs structures are: /sys/devices/vfio_ap ... [matrix] .. [mdev_supported_types] . [vfio_ap-passthrough] [devices] ...[$uuid] .. matrix To view the matrix configured for the mediated matrix device, print the matrix file: cat matrix Signed-off-by: Tony Krowiak --- drivers/s390/crypto/vfio_ap_ops.c | 31 +++ 1 files changed, 31 insertions(+), 0 deletions(-) diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index c8f31f3..bc7398d 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -697,6 +697,36 @@ static ssize_t control_domains_show(struct device *dev, } DEVICE_ATTR_RO(control_domains); +static ssize_t matrix_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct mdev_device *mdev = mdev_from_dev(dev); + struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev); + char *bufpos = buf; + unsigned long apid; + unsigned long apqi; + unsigned long napm = matrix_mdev->matrix.apm_max + 1; + unsigned long naqm = matrix_mdev->matrix.aqm_max + 1; + int nchars = 0; + int n; + + for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, napm) { + n = sprintf(bufpos, "%02lx\n", apid); + bufpos += n; + nchars += n; + + for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, naqm) { + n = sprintf(bufpos, "%02lx.%04lx\n", apid, apqi); + bufpos += n; + nchars += n; + } + } + + return nchars; +} +DEVICE_ATTR_RO(matrix); + + static struct attribute *vfio_ap_mdev_attrs[] = { &dev_attr_assign_adapter.attr, &dev_attr_unassign_adapter.attr, @@ -705,6 +735,7 @@ static ssize_t control_domains_show(struct device *dev, &dev_attr_assign_control_domain.attr, &dev_attr_unassign_control_domain.attr, &dev_attr_control_domains.attr, + &dev_attr_matrix.attr, NULL, }; I have still the same remark: what you show here is not what is currently used by the SIE. It is not irrelevant but what the guest really use may be more interesting for the admin. OK, you implement the right view it in patch 16/21. Still, what is the purpose of showing this view? I find it to have great value when configuring the mdev. It provides a view of what has been configured thus far. IMHO we need to keep this view for the reason stated by Tony. Halil
Re: [Qemu-devel] [RFC PATCH 0/3] vfio: ccw: basic channel path event handling
On 01/23/2018 07:23 AM, Dong Jia Shi wrote: > * Halil Pasic [2018-01-16 16:57:13 +0100]: > >> >> >> On 01/15/2018 09:59 AM, Dong Jia Shi wrote: >>> * Halil Pasic [2018-01-12 19:10:20 +0100]: >>> >>>> >>>> >>>> On 01/11/2018 04:04 AM, Dong Jia Shi wrote: >>>>> What are still missing, thus need to be offered in the next version are: >>>>> - I/O termination and FSM state handling if currently we have I/O on the >>>>> status >>>>> switched path. >>>>> - Vary on/off event is not sensible to a guest. >>>> >>>> I don't see a doc update. We do have documentation (in >>>> Documentation/s390/vfio-ccw.txt) in which the uapi interface with the >>>> regions and their purpose/usage is at least kind of explained. You are >>>> changing this interface without updating the doc. >>>> >>>> I would like to see documentation on this because I'm under the >>>> impression either the design is pretty convoluted or I did not >>>> get it at all. >>> Ah, I missed the documentation part. Thanks for pointing out. I will add >>> it in the next cycle. >>> >> >> I would have preferred having a doc update in this cycle. E.g. as >> an answer to the cover letter. > Ok. If you prefer this, let's try to clarify questions right here and > update documentations in the next review cycle (if there is any). > >> >> As previously pointed out I don't really understand your design. >> I wanted to avoid summarizing the design myself (that is my understanding >> of it), to then question the design. > Fair enough. > >> >> To give you a feeling of what I mean here some bullet points: >> * Channel paths are css level resources (simplified). > Yes, and it's the means for the machine to talk to the device. > >> * When a channel path malfunctions a CRW is generated and a machine >> check channel report pending is generated. Same goes for >> channel paths popping up (on HW level). Should not these get >> propagated? > AFAIR, channel path related CRW is not that reliable. I mean it seems > that it's not mandatory to generate CRWs for channel path malfunctions. > AFAIU, it depneds on machine models. For example, we won't get > CRW_ERC_INIT for a "path has come" event on many models I believe. And > that's why nobody noticed the misuse of CRW_ERC_IPARM and CRW_ERC_INIT > in the CRW handling logic for channel path CRWs. > Ref: > 2daace78a8c9 s390: chp: handle CRW_ERC_INIT for channel-path status change > Honestly, I forgot about this discussion. I've refreshed my memories by now, but I could not find why is it supposed to be architecturally OK to loose CRWs when for instance a chp goes away. > So my understanding for this is: it really a design decision for us to > propagate all of the channel path CRWs, or we just use other ways (like > using PNO and PNOM as this series shows). >From what I read, CRWs did not seem optional. I wonder what should I read in order to change my mind. I'm not talking about the hardware in the wild -- but that could be buggy hardware. > > Of course, CRW propagation is good to have. But as a discussion result > of a former thread, that is something we can consider only after we have > a good channel path re-modelling. That is the problem. We can try to > trigger CRW event in some work of machine checks handling enhancement > later. > >> * Kind of instead of the CRW you introduce a per device interrupt >> for channel path events on the qemu/kvm boundary. AFAIU for a chp >> event you do an STSCH for each (affected?) subchannel > Until here, yes and right. > >> and generate an irq. Right? Why is this a good idea. > This is not right. This series does not generate an irq for the guest. You misunderstood this. The word 'irq' this sentence is a short version of 'per device interrupt for channel path events on the qemu/kvm boundary' form the previous sentence. It's not an irq injected into the guest but a notification (which you call irq in '[RFC PATCH 2/3] vfio: ccw: introduce channel path irq') for QEMU. > In QEMU, when it gets the notification of a channel path status change > event, it read the newest SCHIB from the schib region, and update the > SCHIB (patch related parts) for the virtual subchannel. The guest will > get the new SCHIB whenever it calls STSCH then. We are in agreement. I just wanted to say, if let's say 42 vfio-ccw devices are using the same chp and it goes away, you will generate 42 notifications. > >
Re: [Qemu-devel] [RFC PATCH 0/3] vfio: ccw: basic channel path event handling
On 01/11/2018 04:04 AM, Dong Jia Shi wrote: > What are still missing, thus need to be offered in the next version are: > - I/O termination and FSM state handling if currently we have I/O on the > status > switched path. > - Vary on/off event is not sensible to a guest. I don't see a doc update. We do have documentation (in Documentation/s390/vfio-ccw.txt) in which the uapi interface with the regions and their purpose/usage is at least kind of explained. You are changing this interface without updating the doc. I would like to see documentation on this because I'm under the impression either the design is pretty convoluted or I did not get it at all. Regards, Halil
Re: [PATCH v2] vfio-ccw: update documentation
On 01/16/2018 05:12 PM, Cornelia Huck wrote: > The vfio-ccw documentation comes from the cover letter of the > original patch submission, which shows in some parts. Give it some > love; in particular: > > - Remove/rework statements that make sense in a cover letter, but not > in regular documentation. > - Fix some typos. > - Describe the current limitations in more detail. > > Signed-off-by: Cornelia Huck > --- Acked-by: Halil Pasic
Re: [Qemu-devel] [RFC PATCH 0/3] vfio: ccw: basic channel path event handling
On 01/15/2018 09:59 AM, Dong Jia Shi wrote: > * Halil Pasic [2018-01-12 19:10:20 +0100]: > >> >> >> On 01/11/2018 04:04 AM, Dong Jia Shi wrote: >>> What are still missing, thus need to be offered in the next version are: >>> - I/O termination and FSM state handling if currently we have I/O on the >>> status >>> switched path. >>> - Vary on/off event is not sensible to a guest. >> >> I don't see a doc update. We do have documentation (in >> Documentation/s390/vfio-ccw.txt) in which the uapi interface with the >> regions and their purpose/usage is at least kind of explained. You are >> changing this interface without updating the doc. >> >> I would like to see documentation on this because I'm under the >> impression either the design is pretty convoluted or I did not >> get it at all. > Ah, I missed the documentation part. Thanks for pointing out. I will add > it in the next cycle. > I would have preferred having a doc update in this cycle. E.g. as an answer to the cover letter. As previously pointed out I don't really understand your design. I wanted to avoid summarizing the design myself (that is my understanding of it), to then question the design. To give you a feeling of what I mean here some bullet points: * Channel paths are css level resources (simplified). * When a channel path malfunctions a CRW is generated and a machine check channel report pending is generated. Same goes for channel paths popping up (on HW level). Should not these get propagated? * Kind of instead of the CRW you introduce a per device interrupt for channel path events on the qemu/kvm boundary. AFAIU for a chp event you do an STSCH for each (affected?) subchannel and generate an irq. Right? Why is this a good idea. * SCHIB.PMCW provides path information relevant for a given device. This information is retrieved using store subchannel. Is your series sufficient for making store subchannel work properly (correct and reasonably up to date)? Regarding PMCW it's supposed to get updated when io functions are preformed by the css, AFAIR. Is that right? If yes what are the implications? Are the manipulations you do on some PMCW fields architecturally correct? * The one thing I would very much appreciate as an user of vfio, and should in my understanding be the user story of this series (and the qemu counterpart of course) is the following. If my device (that is IO operation) failed because no path could be found on which the device is accessible, I would like to be able to identify that. Preferably the same way I would do for an LPAR guest. Is this series accomplishing that? * Why not just do proper STSCH for vfio-ccw? * Shouldn't we virtualize CHPIDs? What if we have a clash? Lets say my dasd is only accessible via chp 0.00 in the host. Currently we have a problem there, or (as the only path would end up being ignored)? Note: this is another unpleasant effect of not having MCSSE in the guest. The original design was to have a separate css for vfio, and then we would not have this kind of a problem. (Virtualization of chps would still be good for migration.) Regards, Halil
Re: [PATCH 1/1] vfio-ccw: update documentation
On 01/15/2018 11:43 AM, Cornelia Huck wrote: > The vfio-ccw documentation comes from the cover letter of the > original patch submission, which shows in some parts. Give it some > love; in particular: > > - Remove/rework statements that make sense in a cover letter, but not > in regular documentation. > - Fix some typos. > - Describe the current limitations in more detail. > > Signed-off-by: Cornelia Huck The changes look sane and definitively improve on what we have right now. Acked by me if you like. When we are at the limitations, the prefetch story seems to be still missing. IMHO it's the most important limitation. Regards, Halil
Re: [PATCH 0/4] vfio: ccw: error handling fixes and improvements
On 03/26/2018 11:02 AM, Cornelia Huck wrote: > On Wed, 21 Mar 2018 03:08:18 +0100 > Dong Jia Shi wrote: > >> Hi Conny, >> >> Halil reported a host crash when using vfio-ccw. The root cause of the >> problem >> is that vfio_pin_pages fails with EINVAL for reasons unknown. He has >> experienced such failures after online-ing a dasd in the guest (the dasd has >> 3 >> partitions, hat may or may not have any significance). The problem isn't >> experienced on every attempt to online the dasd, and breaking at css_do_ssch >> seems to make things work. >> >> One thing is sure: the host kernel should not crash under the described >> circumstances. >> >> To fix the problem, the first patch of this series fixes the cleanup when >> cp_prefetch fails in the higher level. The 2nd and the 3rd patches provide >> correctness and denfensive actions for the interfaces in the lower level. > > So, is the first patch stable material? I think it should be. > >> >> The 4th patch is trying to add tracepoints for vfio-ccw, so that we can debug >> such issue easier in future. > > Tracepoints are nice :) > >> >> For details see the commit message portions of the inividual patches. > > Still digging through the post-vacation mail pile, will do more looking > later. > > >> Thanks. >> >> Dong Jia Shi (2): >> vfio: ccw: refactor and improve pfn_array_alloc_pin() >> vfio: ccw: set ccw->cda to NULL defensively >> >> Halil Pasic (2): >> vfio: ccw: fix cleanup if cp_prefetch fails >> vfio: ccw: add traceponits for interesting error paths >> >> drivers/s390/cio/Makefile | 1 + >> drivers/s390/cio/vfio_ccw_cp.c| 121 >> -- >> drivers/s390/cio/vfio_ccw_fsm.c | 13 >> drivers/s390/cio/vfio_ccw_trace.h | 86 +++ >> 4 files changed, 163 insertions(+), 58 deletions(-) >> create mode 100644 drivers/s390/cio/vfio_ccw_trace.h >> >
Re: [PATCH 4/4] vfio: ccw: add traceponits for interesting error paths
On 03/27/2018 12:07 PM, Cornelia Huck wrote: > On Tue, 27 Mar 2018 15:51:14 +0800 > Dong Jia Shi wrote: > >> * Cornelia Huck [2018-03-26 15:59:02 +0200]: >> >> [...] >> @@ -131,6 +138,8 @@ static void fsm_io_request(struct vfio_ccw_private *private, io_region->ret_code = cp_prefetch(&private->cp); if (io_region->ret_code) { + trace_vfio_ccw_cp_prefetch_failed(get_schid(private), +io_region->ret_code); cp_free(&private->cp); goto err_out; } @@ -138,6 +147,8 @@ static void fsm_io_request(struct vfio_ccw_private *private, /* Start channel program and wait for I/O interrupt. */ io_region->ret_code = fsm_io_helper(private); if (io_region->ret_code) { + trace_vfio_ccw_ssch_failed(get_schid(private), + io_region->ret_code); cp_free(&private->cp); goto err_out; } @@ -145,10 +156,12 @@ static void fsm_io_request(struct vfio_ccw_private *private, } else if (scsw->cmd.fctl & SCSW_FCTL_HALT_FUNC) { /* XXX: Handle halt. */ io_region->ret_code = -EOPNOTSUPP; + trace_vfio_ccw_halt(get_schid(private)); goto err_out; } else if (scsw->cmd.fctl & SCSW_FCTL_CLEAR_FUNC) { /* XXX: Handle clear. */ io_region->ret_code = -EOPNOTSUPP; + trace_vfio_ccw_clear(get_schid(private)); goto err_out; >>> >>> Hmmm perhaps better to just trace the function (start/halt/clear) >>> in any case? >>> >> I agree trace the function in any case is good. @Halil, opinion? See below. I don't really understand the question. Trace the function means, trace when it was requested on a subch, or trace the outcome of the request? Seems the question got amended though. >> >> But the traces for cp_prefetch() and fsm_io_helper() should also be >> kept, since they are helpful to debug problem. So I tend to trace the >> following in any case: >> - cp_prefetch() >> - fsm_io_helper() >> - start >> - halt >> - clear > > OK, I was unclear :) I'd argue to keep the others, just replace the > halt/clear tracing with tracing the function. I'm a bit confused. My idea was the following: Prior to this patch we had a kind of OK possibility to trace what we consider the expected and good scenario using the function tracer and the normal cio stuff. But what I wanted is to verify that my fix works (problem occurs but is handled more appropriately) and I've found it difficult to trace this. So the idea was to introduce trace points which tell us what went wrong. The idea is to benefit diagnostic of unrecoverable failures and get an idea how often are we doing extra work recovering recoverable failures. In this sense halt and clear is something that does not work currently. When we get proper halt and clear, these trace points were supposed to become obsolete and get removed. I guess the implementation will eventually issue csch() and hsch() for the underlying subchannel and and we should be able to trace those (see drivers/s390/cio/ioasm.c) Now this is the tricky part. I've read some used to see trace points as part of the kernel ABI. See e.g. https://lwn.net/Articles/705270/. AFAIU this is not a concern any more -- but my confidence on this is pretty low. So IMHO we have two questions to answer: * Do we want static trace points (events) for undesirable or concerning stuff happened (e.g. translation failed, a function we hope we can live without was supposed to be executed)? * Do we want static trace points (events) coverage for the normal path (that is beyond what cio and the function tracer already give us)? What benefit do we expect if we do want these? (E.g. performance evaluation, better debugging especially when multiple virtualized subchannels used.) > >> } [..] +DECLARE_EVENT_CLASS(vfio_ccw_notsupp, + TP_PROTO(struct subchannel_id schid), + TP_ARGS(schid), + + TP_STRUCT__entry( + __field_struct(struct subchannel_id, schid) + ), + + TP_fast_assign( + __entry->schid = schid; + ), + + TP_printk("(schid 0.%x.%04X) request not supported", + __entry->schid.ssid, __entry->schid.sch_no) +); >>> >>> Especially as I don't plan to leave this unsupported for too long :) Sounds great! I don't insist. Especially not if our linux guest tells us what went wrong. @Dong Jia: What would happen should the guest for some reason try to do a clear or a halt (e.g. we make it fail here, guest retries a couple of times and then panicks or gives up on the device)? >>> >>> Just tracing the function is useful now and
Re: [PATCH v3 04/14] KVM: s390: device attribute to set AP interpretive execution
On 03/20/2018 06:58 PM, Tony Krowiak wrote: > I spoke with Christian this morning and he made a suggestion which I think > would provide the best solution here. > This is my proposal: > 1. Get rid of the KVM_S390_VM_CRYPTO_INTERPRET_AP device attribute and return > to setting ECA.28 from the > mdev device open callback. > 2. Since there may be vcpus online at the time the mdev device open is > called, we must first take all running vcpus out of > SIE and block them. Christian suggested the kvm_s390_vcpu_block_all(struct > kvm *kvm) function will do the trick. So I > propose introducing a function like the following to be called during mdev > open: There is one thing you missed, otherwise I'm *very* satisfied with this proposal. What you have missed IMHO is vcpu hottplug. So IMHO you should keep kvm->arch.crypto.apie, and update it accordingly ... > > int kvm_ap_set_interpretive_exec(struct kvm *kvm, bool enable) > { > int i; > struct kvm_vcpu *vcpu; > > if (!test_kvm_cpu_feat(kvm, KVM_S390_VM_CPU_FEAT_AP)) > return -EOPNOTSUPP; > > mutex_lock(&kvm->lock); > > kvm_s390_vcpu_block_all(kvm); ... let's say here. > > kvm_for_each_vcpu(i, vcpu, kvm) { And here you can call kvm_s390_vcpu_crypto_setup(vcpu) (the changes to this function will be required for hotplug) if you like > if (enable) > vcpu->arch.sie_block->eca |= ECA_APIE; > else > vcpu->arch.sie_block->eca &= ~ECA_APIE; or keep this stuff, it does not really matter to me. > } > > kvm_s390_vcpu_unblock_all(kvm); > > mutex_unlock(&kvm->lock); > > return 0; > } > > This interface allows us to set ECA.28 even if vcpus are running I tend to agree. I will give it a proper review when this gets more formal (e.g. v4 (preferably) or patches to be fixed up to this series). Please don't forget to revisit the discussion on kvm_s390_vm_set_crypto: if the mechanism there isn't right for ECA.28 I think you should tell us why it's OK for the other attributes if it's OK. If it is not then I guess you will want to do a stand alone patch for that.
Re: [PATCH 1/4] vfio: ccw: fix cleanup if cp_prefetch fails
On 03/21/2018 03:08 AM, Dong Jia Shi wrote: > From: Halil Pasic > > If the translation of a channel program fails, we may end up attempting > to clean up (free, unpin) stuff that never got translated (and allocated, > pinned) in the first place. > > By adjusting the lengths of the chains accordingly (so the element that > failed, and all subsequent elements are excluded) cleanup activities > based on false assumptions can be avoided. > > Let's make sure cp_free works properly after cp_prefetch returns with an > error by setting ch_len to 0 for the ccw chains those are not prefetched. This sentence used to be: Let's make sure cp_free works properly after cp_prefetch returns with an error. @Dong Jia I find the 'by setting ch_len to 0 for the ccw chains those are not prefetched' you added for clarification (I guess) somewhat problematic. The chain in which the translation failure occurred + chain->ch_len = idx; is shortened so that only the translated elements (ccws) are going to get cleaned up (on a per element basis) by cp_free. This may or may not be the first ccw. Subsequent chains are shortened to 0 as there no translation took place. So as a result of this change only properly translated ccws are going to get (re)visited by cp_free as only those may have resources bound to them which need to be released. I'm not against improving the commit message. But this ain't an improvement to me. > > Acked-by: Pierre Morel > Reviewed-by: Dong Jia Shi > Signed-off-by: Halil Pasic > Signed-off-by: Dong Jia Shi > --- > drivers/s390/cio/vfio_ccw_cp.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c > index d9a2fffd034b..2be114db02f9 100644 > --- a/drivers/s390/cio/vfio_ccw_cp.c > +++ b/drivers/s390/cio/vfio_ccw_cp.c > @@ -749,11 +749,18 @@ int cp_prefetch(struct channel_program *cp) > for (idx = 0; idx < len; idx++) { > ret = ccwchain_fetch_one(chain, idx, cp); > if (ret) > - return ret; > + goto out_err; > } > } > > return 0; > +out_err: > + /* Only cleanup the chain elements that where actually translated. */ > + chain->ch_len = idx; > + list_for_each_entry_continue(chain, &cp->ccwchain_list, next) { > + chain->ch_len = 0; > + } > + return ret; > } > > /** >
Re: [PATCH 1/4] vfio: ccw: fix cleanup if cp_prefetch fails
On 03/22/2018 10:37 AM, Pierre Morel wrote: > On 22/03/2018 03:22, Dong Jia Shi wrote: >> * Halil Pasic [2018-03-21 13:49:54 +0100]: >> >>> >>> On 03/21/2018 03:08 AM, Dong Jia Shi wrote: >>>> From: Halil Pasic >>>> >>>> If the translation of a channel program fails, we may end up attempting >>>> to clean up (free, unpin) stuff that never got translated (and allocated, >>>> pinned) in the first place. >>>> >>>> By adjusting the lengths of the chains accordingly (so the element that >>>> failed, and all subsequent elements are excluded) cleanup activities >>>> based on false assumptions can be avoided. >>>> >>>> Let's make sure cp_free works properly after cp_prefetch returns with an >>>> error by setting ch_len to 0 for the ccw chains those are not prefetched. >>> This sentence used to be: >>> >>> Let's make sure cp_free works properly after cp_prefetch returns with an >>> error. >>> >>> @Dong Jia >>> I find the 'by setting ch_len to 0 for the ccw chains those are not >>> prefetched' >>> you added for clarification (I guess) somewhat problematic. >>> The chain in which the translation failure occurred >>> + chain->ch_len = idx; >> I made a mistake. When rewording the message, I missed this part... >> Sorry for the problem! np >> >>> is shortened so that only the translated elements (ccws) are going to >>> get cleaned up (on a per element basis) by cp_free. This may or may >>> not be the first ccw. Subsequent chains are shortened to 0 as there >>> no translation took place. >>> >>> So as a result of this change only properly translated ccws are going >>> to get (re)visited by cp_free as only those may have resources bound >>> to them which need to be released. >>> >>> I'm not against improving the commit message. But this ain't >>> an improvement to me. >> You are right. How about: >> Let's make sure cp_free works properly after cp_prefetch returns with an >> error by setting ch_len of a ccw chain to the number of the translated >> ccws on that chain. Works with me. > > By the way, since you will propose a new version, > you have a long description of the cp_prefetch function in the code. > I think you should modify it according to the changes and describe how and > why the ch_len field of each chain is used and changed by this function. > > Something like: > > " > For each chain composing the channel program: > On entry ch_len hold the count of CCW to be translated. > On exit ch_len is adjusted to the count of successfully translated CCW. > > This allows cp_free to find in ch_len the count of CCW to free in a chain. > " Sounds good to me. Halil
Re: [PATCH 4/4] vfio: ccw: add traceponits for interesting error paths
On 04/10/2018 10:55 AM, Cornelia Huck wrote: > On Tue, 10 Apr 2018 10:16:39 +0800 > Dong Jia Shi wrote: > >> Does the following effect make sense? >> >> # tracer: nop >> # >> # _-=> irqs-off >> # / _=> need-resched >> #| / _---=> hardirq/softirq >> #|| / _--=> preempt-depth >> #||| / delay >> # TASK-PID CPU# TIMESTAMP FUNCTION >> # | | | | | >> qemu-system-s39-4252 [006] 231.457214: vfio_ccw_cp_prefetch: >> schid=0.0.013f errno=0 >> qemu-system-s39-4252 [006] 231.457222: vfio_ccw_fsm_io_helper: >> schid=0.0.013f errno=0 >> qemu-system-s39-4252 [006] 231.457223: vfio_ccw_io_fctl: >> schid=0.0.013f fctl=4 errno=0 >> ... ... > > I would likely find this useful for following a code path and making > sure the right things are called. > > We certainly want error conditions traced as well (although the code > has been working too well for me to trigger that easily :) > Looks interesting. The approach is to trace (all) exits from selected functions, or? It is an interesting approach. Function entry could probably be traced with the function tracer (if we should ever need that, although relating the two unambiguously may not be possible -- I don't know). I'm still not completely in clear how do we want to do error reporting. Using traces as means of error reporting smells like abuse to me. @Dong Jia: could you help me get an overview? I'm thinking of something like a matrix of type: error | handler | action (propagate as / report / try recover / discard silently) I'm mostly interested in what gets reported and if there is stuff that should be reported. Other than that I'm in favor. And having traces for tracking error condition is clearly better than having nothing. Regards, Halil
Re: [PATCH v11 26/26] s390: doc: detailed specifications for AP virtualization
On 09/27/2018 12:42 AM, Alex Williamson wrote: > On Tue, 25 Sep 2018 19:16:41 -0400 > Tony Krowiak wrote: > >> From: Tony Krowiak [..] >> + >> +2. Secure the AP queues to be used by the three guests so that the host can >> not >> + access them. To secure them, there are two sysfs files that specify >> + bitmasks marking a subset of the APQN range as 'usable by the default AP >> + queue device drivers' or 'not usable by the default device drivers' and >> thus >> + available for use by the vfio_ap device driver'. The sysfs files >> containing >> + the sysfs locations of the masks are: >> + >> + /sys/bus/ap/apmask >> + /sys/bus/ap/aqmask >> + >> + The 'apmask' is a 256-bit mask that identifies a set of AP adapter IDs >> + (APID). Each bit in the mask, from most significant to least significant >> bit, >> + corresponds to an APID from 0-255. If a bit is set, the APID is marked as >> + usable only by the default AP queue device drivers; otherwise, the APID >> is >> + usable by the vfio_ap device driver. >> + >> + The 'aqmask' is a 256-bit mask that identifies a set of AP queue indexes >> + (APQI). Each bit in the mask, from most significant to least significant >> bit, >> + corresponds to an APQI from 0-255. If a bit is set, the APQI is marked as >> + usable only by the default AP queue device drivers; otherwise, the APQI >> is >> + usable by the vfio_ap device driver. >> + >> + The APQN of each AP queue device assigned to the linux host is checked >> by the >> + AP bus against the set of APQNs derived from the cross product of APIDs >> + and APQIs marked as usable only by the default AP queue device drivers. >> If a >> + match is detected, only the default AP queue device drivers will be >> probed; >> + otherwise, the vfio_ap device driver will be probed. >> + >> + By default, the two masks are set to reserve all APQNs for use by the >> default >> + AP queue device drivers. There are two ways the default masks can be >> changed: >> + >> + 1. The masks can be changed at boot time with the kernel command line >> + like this: >> + >> + ap.apmask=0x ap.aqmask=0x40 >> + >> + This would give these two pools: >> + >> +default drivers pool:adapter 0-15, domain 1 >> +alternate drivers pool: adapter 16-255, domains 2-255 > > What happened to domain 0? Right, domain 0 is also 'alternate'. So it should have been alternate drivers pool: adapter 16-255, domains 0,2-255 > I'm also a little confused by the bit > ordering. If 0x40 is bit 1 and 0x is bits 0-15, then the least > significant bit is furthest left? Did I miss documentation of that? > Harald already tried to explain this, let me give it a try too. Yes it is a bit confusing. I would try to describe it like this: the big endian mask, which is of fixed length of 256 bytes is specified byte-wise using hexadecimal notation. If only a prefix of the whole mask is specified, the not explicitly specified bytes are specified are as if they were specified as zero. I didn't quite get this thing with 'the least significant bit is furthest left'. I think it is to the right if we assume we are reading left-to-right. It is big endian, so we consider the most significant bit of a byte to be the first bit, and the byte with the lowest address to be the first byte of the mask (that holds the first 8 bits of the mask). >> + >> + 2. The sysfs mask files can also be edited by echoing a string into the >> + respective file in one of two formats: >> + >> + * An absolute hex string starting with 0x - like "0x12345678" - sets >> +the mask. If the given string is shorter than the mask, it is padded >> +with 0s on the right. If the string is longer than the mask, the >> +operation is terminated with an error (EINVAL). > > And this does say zero padding on the right, but then in the next > bullet our hex digits use normal least significant bit right notation, > ie. 0x41 is 65, not 82, correct? The zero padding on the right is about the non specified bytes of the mask. While this bullet is about specifying a whole mask, the next butlet is about changing a mask by setting the value of bits at a certain position. So in the context of the next bullet point, the hex string here specifies an integer value -- plainly a number written in hexadecimal notation (pure math with no significant bits whatsoever) - in the range 0-256: the index of the bit to be set ('+') or cleared ('-'). I hope that makes some sense. As I said it's indeed a bit confusing. >> + >> + * A plus ('+') or minus ('-') followed by a numerical value. Valid >> +examples are "+1", "-13", "+0x41", "-0xff" and even "+0" and "-0". >> Only >> +the corresponding bit in the mask is switched on ('+') or off >> ('-'). The >> +values may also be specified in a comma-separated list to switch >> more >> +than one
Re: [FIXUP v11] fixup! s390: vfio-ap: implement mediated device open callback
On 09/28/2018 03:35 PM, Cornelia Huck wrote: > On Fri, 28 Sep 2018 09:33:21 -0400 > Tony Krowiak wrote: > >> From: Tony Krowiak >> >> Fixes case statement in vfio_ap_mdev_copy_masks() function in >> vfio-ap-ops.c. >> --- >> drivers/s390/crypto/vfio_ap_ops.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/s390/crypto/vfio_ap_ops.c >> b/drivers/s390/crypto/vfio_ap_ops.c >> index e1e1beaaeba5..86b42487a51c 100644 >> --- a/drivers/s390/crypto/vfio_ap_ops.c >> +++ b/drivers/s390/crypto/vfio_ap_ops.c >> @@ -741,11 +741,12 @@ static void vfio_ap_mdev_copy_masks(struct >> ap_matrix_mdev *matrix_mdev) >> break; >> case CRYCB_FORMAT1: >> case CRYCB_FORMAT0: >> -default: >> apm = (unsigned long *)crycb->apcb0.apm; >> aqm = (unsigned long *)crycb->apcb0.aqm; >> adm = (unsigned long *)crycb->apcb0.adm; >> break; >> +default: >> +return; > > /* cannot happen */ > > ? Or use something like BUG(). Without conveying that default is illegal we don't gain anything over the previous version IMHO. > >> } >> >> nbytes = DIV_ROUND_UP(matrix_mdev->matrix.apm_max + 1, BITS_PER_BYTE); >
Re: [PATCH 1/1] s390: vfio-ap: include for test_facility()
On Fri, 16 Nov 2018 11:47:48 +0100 Petr Tesarik wrote: > The driver uses test_facility(), but does not include the > corresponding include file explicitly. The driver currently builds > only thanks to the following include chain: > > vfio_ap_drv.c > > > > > > > > Files should not rely on such fragile implicit includes. > > Signed-off-by: Petr Tesarik > --- > drivers/s390/crypto/vfio_ap_drv.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/s390/crypto/vfio_ap_drv.c > b/drivers/s390/crypto/vfio_ap_drv.c index 7667b38728f0..31c6c847eaca > 100644 --- a/drivers/s390/crypto/vfio_ap_drv.c > +++ b/drivers/s390/crypto/vfio_ap_drv.c > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > #include "vfio_ap_private.h" > > #define VFIO_AP_ROOT_NAME "vfio_ap" Applied. Is going to go via Martins S390 tree. Thanks, Halil
Re: [PATCH v6 07/21] s390: vfio-ap: base implementation of VFIO AP device driver
On 06/29/2018 11:11 PM, Tony Krowiak wrote: Introduces a new AP device driver. This device driver is built on the VFIO mediated device framework. The framework provides sysfs interfaces that facilitate passthrough access by guests to devices installed on the linux host. The VFIO AP device driver will serve two purposes: 1. Provide the interfaces to reserve AP devices for exclusive use by KVM guests. This is accomplished by unbinding the devices to be reserved for guest usage from the default AP device driver and binding them to the VFIO AP device driver. 2. Implements the functions, callbacks and sysfs attribute interfaces required to create one or more VFIO mediated devices each of which will be used to configure the AP matrix for a guest and serve as a file descriptor for facilitating communication between QEMU and the VFIO AP device driver. When the VFIO AP device driver is initialized: * It registers with the AP bus for control of type 10 (CEX4 and newer) AP queue devices. This limitation was imposed due to: 1. A lack of access to older systems needed to test the older AP device models; 2. A desire to keep the code as simple as possible; 3. Some older models are no longer supported by the kernel and others are getting close to end of service. The probe and remove callbacks will be provided to support the binding/unbinding of AP queue devices to/from the VFIO AP device driver. * Creates a /sys/devices/vfio-ap/matrix device to hold the APQNs of the AP devices bound to the VFIO AP device driver and serves as the parent of the mediated devices created for each guest. Signed-off-by: Tony Krowiak --- MAINTAINERS | 10 +++ arch/s390/Kconfig | 11 +++ drivers/s390/crypto/Makefile |4 + drivers/s390/crypto/vfio_ap_drv.c | 140 + drivers/s390/crypto/vfio_ap_private.h | 29 +++ include/uapi/linux/vfio.h |2 + samples/bpf/bpf_load.c| 62 +++ You have probably touched the last one by accident. Regards, Halil
Re: [PATCH v6 05/21] KVM: s390: CPU model support for AP virtualization
On 07/02/2018 05:37 PM, Tony Krowiak wrote: On 07/02/2018 10:38 AM, Christian Borntraeger wrote: On 06/29/2018 11:11 PM, Tony Krowiak wrote: Introduces a new CPU model feature and two CPU model facilities to support AP virtualization for KVM guests. CPU model feature: The KVM_S390_VM_CPU_FEAT_AP feature indicates that AP instructions are available on the guest. This feature will be enabled by the kernel only if the AP instructions are installed on the linux host. This feature must be specifically turned on for the KVM guest from userspace to use the VFIO AP device driver for guest access to AP devices. CPU model facilities: 1. AP Query Configuration Information (QCI) facility is installed. This is indicated by setting facilities bit 12 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. This facility must not be set by userspace if the KVM_S390_VM_CPU_FEAT_AP feature is not installed. If this facility is not set for the KVM guest, then only APQNs with an APQI less than 16 will be available to the guest regardless of the guest's matrix configuration. This is a limitation of the AP bus running on the guest. 2. AP Facilities Test facility (APFT) is installed. This is indicated by setting facilities bit 15 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. This facility must not be set by userspace if the KVM_S390_VM_CPU_FEAT_AP feature is not installed. If this facility is not set for the KVM guest, then no AP devices will be available to the guest regardless of the guest's matrix configuration. This is a limitation of the AP bus running under the guest. Reviewed-by: Christian Borntraeger Reviewed-by: Halil Pasic Signed-off-by: Tony Krowiak I think it probably should be at the end of the series, other than that its good. If I move this to the end of the series, the very next patch checks the KVM_S390_VM_CPU_FEAT_AP feature? The point is the following: never expose a feature *before* it is actually provided. And this is exactly what you are doing here. AFAIU the userspace can happily negotiate KVM_S390_VM_CPU_FEAT_AP, do it's part of the job and still not have AP instructions in the guest if patches [0..5] are applied but [6..21] not. This is wrong. AFAIR I requested this one being squashed with the next one for exact this reason. That would be OK as starting with patch 6 applied we do satisfy what the features require. It's only that the interfaces required to set up the resources are not there yet and thus the features can't really be used meaningfully. Usually we expose the features at the end of a series, as such a series often just implements support for the given feature(s). In this special IMHO case we can get away with not doing so, but not exposing the feature until the end of the series could still have some merit. Anyway we should avoid exposing half-assed stuff. In that sense the resource cleanup (zapq) logic must not be introduced after resources can be acquired and utilized. Regards, Halil
Re: [PATCH v6 05/21] KVM: s390: CPU model support for AP virtualization
On 07/02/2018 06:11 PM, Cornelia Huck wrote: On Mon, 2 Jul 2018 11:54:28 -0400 Tony Krowiak wrote: On 07/02/2018 11:41 AM, Cornelia Huck wrote: On Mon, 2 Jul 2018 11:37:11 -0400 Tony Krowiak wrote: On 07/02/2018 10:38 AM, Christian Borntraeger wrote: On 06/29/2018 11:11 PM, Tony Krowiak wrote: Introduces a new CPU model feature and two CPU model facilities to support AP virtualization for KVM guests. CPU model feature: The KVM_S390_VM_CPU_FEAT_AP feature indicates that AP instructions are available on the guest. This feature will be enabled by the kernel only if the AP instructions are installed on the linux host. This feature must be specifically turned on for the KVM guest from userspace to use the VFIO AP device driver for guest access to AP devices. CPU model facilities: 1. AP Query Configuration Information (QCI) facility is installed. This is indicated by setting facilities bit 12 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. This facility must not be set by userspace if the KVM_S390_VM_CPU_FEAT_AP feature is not installed. If this facility is not set for the KVM guest, then only APQNs with an APQI less than 16 will be available to the guest regardless of the guest's matrix configuration. This is a limitation of the AP bus running on the guest. 2. AP Facilities Test facility (APFT) is installed. This is indicated by setting facilities bit 15 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. This facility must not be set by userspace if the KVM_S390_VM_CPU_FEAT_AP feature is not installed. If this facility is not set for the KVM guest, then no AP devices will be available to the guest regardless of the guest's matrix configuration. This is a limitation of the AP bus running under the guest. Reviewed-by: Christian Borntraeger Reviewed-by: Halil Pasic Signed-off-by: Tony Krowiak I think it probably should be at the end of the series, other than that its good. If I move this to the end of the series, the very next patch checks the KVM_S390_VM_CPU_FEAT_AP feature? Introduce it here, offer it only with the last patch? I apologize, but I don't know what you mean by this. Are you suggesting this patch should only include the #define for KVM_S390_VM_CPU_FEAT_AP? Yes, just introduce the definition here (so code later in the series can refer to it) and flip the switch (offer the bit) as the final patch. The other features introduced and exposed here are no different. For KVM_S390_VM_CPU_FEAT_AP defer exposing means defer allow_cpu_feat(); for the STFLE features, defer adding to FACILITIES_KVM_CPUMODEL. Anyway, I think the definition should be squashed into #6. Expose the features after patch #6 is in place or expose them at the end of the series is IMHO a matter of taste -- and I lean towards expose at the end of the series. Regards, Halil
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 06/29/2018 11:11 PM, Tony Krowiak wrote: This patch provides documentation describing the AP architecture and design concepts behind the virtualization of AP devices. It also includes an example of how to configure AP devices for exclusive use of KVM guests. Signed-off-by: Tony Krowiak --- [..] + +Reserve APQNs for exclusive use of KVM guests +- +The following block diagram illustrates the mechanism by which APQNs are +reserved: + + +--+ + remove | | unbind + +--->+ cex4queue driver +<---+ + || || + |+--+| + || + || + || +++-+ register +--+ +-+--+ +| +<-+ | bind || +| ap_bus | | vfio_ap driver +<-+admin | +| +->+ | || ++--+ probe +---++-+ ++ + || + create || store APQN + || + vv + +---++-+ + | | + | matrix device | + | | + +--+ + +The process for reserving an AP queue for use by a KVM guest is: + +* The vfio-ap driver during its initialization will perform the following: + * Create the 'vfio_ap' root device - /sys/devices/vfio_ap + * Create the 'matrix' device in the 'vfio_ap' root + * Register the matrix device with the device core +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and + newer) and to provide the vfio_ap driver's probe and remove callback + interfaces. The reason why older devices are not supported is because there + are no systems available on which to test. +* The admin unbinds queue cc. from the cex4queue device driver. This results + in the ap_bus calling the the device driver's remove interface which + unbinds the cc. queue device from the driver. What if the queue cc. is already in use? AFAIU unbind is almost as radical as pulling a cable. What is the proper procedure an admin should follow before doing the unbind? +* The admin binds the cc. queue to the vfio_ap device driver. This results + in the ap_bus calling the device vfio_ap driver's probe interface to bind + queue cc. to the driver. The vfio_ap device driver will store the APQN for + the queue in the matrix device + [..]
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 06/29/2018 11:11 PM, Tony Krowiak wrote: This patch provides documentation describing the AP architecture and design concepts behind the virtualization of AP devices. It also includes an example of how to configure AP devices for exclusive use of KVM guests. Signed-off-by: Tony Krowiak I don't like the design of external interfaces except for: * cpu model features, and * reset handling. In particular: 1) The architecture is such that authorizing access (via APM, AQM and ADM) to an AP queue that is currently not configured (e.g. the card not physically plugged, or just configured off). That seems to be a perfectly normal use case. Your assign operations however enforce that the resource is bound to your driver, and thus the existence of the resource in the host. It is clear: we need to avoid passing trough resources to guests that are not dedicated for this purpose (e.g. a queue utilized by zcrypt). But IMHO we need a different mechanism. 2) I see no benefit in deferring the exclusivity check to vfio_ap_mdev_open(). The downside is however pretty obvious: management software is notified about a 'bad configuration' only at an attempted guest start-up. And your current QEMU patches are not very helpful in conveying this piece of information. I've talked with Boris, and AFAIR he said this is not acceptable to him (@Boris can you confirm). 3) We indicate the reason for failure due to a configuration problem (exclusivity or resource allocation) via pr_err() that is via kernel messages. I don't think this is very tooling/management software friendly, and I hope we don't expect admins to work with the sysfs interface long term. I mean the effects of the admin actions are not very persistent. Thus if the interface is a painful one, we are talking about potentially frequent pain. 4) If I were to act out the role of the administrator, I would prefer to think of specifying or changing the access controls of a guest in respect to AP (that is setting the AP matrix) as a single atomic operation -- which either succeeds or fails. The operation should succeed for any valid configuration, and fail for any invalid on. The current piecemeal approach seems even less fitting if we consider changing the access controls of a running guest. AFAIK changing access controls for a running guest is possible, and I don't see a reason why should we artificially prohibit this. I think the current sysfs interface for manipulating the matrix is good for manual playing around, but I would prefer having an interface that is better suited for programs (e.g. ioctl). Regards, Halil
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 07/03/2018 09:46 AM, Harald Freudenberger wrote: On 02.07.2018 18:28, Halil Pasic wrote: On 06/29/2018 11:11 PM, Tony Krowiak wrote: This patch provides documentation describing the AP architecture and design concepts behind the virtualization of AP devices. It also includes an example of how to configure AP devices for exclusive use of KVM guests. Signed-off-by: Tony Krowiak --- [..] + +Reserve APQNs for exclusive use of KVM guests +- +The following block diagram illustrates the mechanism by which APQNs are +reserved: + + +--+ + remove | | unbind + +--->+ cex4queue driver +<---+ + | | | | + | +--+ | + | | + | | + | | +++-+ register +--+ +-+--+ +| +<-+ | bind | | +| ap_bus | | vfio_ap driver +<-+ admin | +| +->+ | | | ++--+ probe +---++-+ ++ + | | + create | | store APQN + | | + v v + +---++-+ + | | + | matrix device | + | | + +--+ + +The process for reserving an AP queue for use by a KVM guest is: + +* The vfio-ap driver during its initialization will perform the following: + * Create the 'vfio_ap' root device - /sys/devices/vfio_ap + * Create the 'matrix' device in the 'vfio_ap' root + * Register the matrix device with the device core +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and + newer) and to provide the vfio_ap driver's probe and remove callback + interfaces. The reason why older devices are not supported is because there + are no systems available on which to test. +* The admin unbinds queue cc. from the cex4queue device driver. This results + in the ap_bus calling the the device driver's remove interface which + unbinds the cc. queue device from the driver. What if the queue cc. is already in use? AFAIU unbind is almost as radical as pulling a cable. What is the proper procedure an admin should follow before doing the unbind? What do you mean on this level with 'in use'? A unbind destroys the association between device and driver. There is no awareness of 'in use' or 'not in use' on this level. This is a hard unbind. Let me try to invoke the DASD analogy. If one for some reason wants to detach a DASD the procedure to follow seems to be (see https://www.ibm.com/support/knowledgecenter/en/linuxonibm/com.ibm.linux.z.lgdd/lgdd_t_dasd_online.html) the following: 1) Unmount. 2) Offline possibly using safe_offline. 3) Detach. Detaching a disk that is currently doing I/O asks for trouble, so the admin is encouraged to make sure there is no pending I/O. In case of AP you can interpret my 'in use' as the queue is not empty. In my understanding unbind is supposed to be hard (I used the word radical). That's why I compared it to pulling a cable. So that's why I ask is there stuff the admin is supposed to do before doing the unbind. Does that answer your question? Regards, Halil
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 07/03/2018 01:52 PM, Cornelia Huck wrote: On Tue, 3 Jul 2018 11:22:10 +0200 Halil Pasic wrote: [..] Let me try to invoke the DASD analogy. If one for some reason wants to detach a DASD the procedure to follow seems to be (see https://www.ibm.com/support/knowledgecenter/en/linuxonibm/com.ibm.linux.z.lgdd/lgdd_t_dasd_online.html) the following: 1) Unmount. 2) Offline possibly using safe_offline. 3) Detach. Detaching a disk that is currently doing I/O asks for trouble, so the admin is encouraged to make sure there is no pending I/O. I don't think we can use dasd (block devices) as a good analogy for every kind of device (for starters, consider network devices). I did not use it for every kind of device. I used it for AP. I'm under the impression you find the analogy inappropriate. If, could you please explain why? In case of AP you can interpret my 'in use' as the queue is not empty. In my understanding unbind is supposed to be hard (I used the word radical). That's why I compared it to pulling a cable. So that's why I ask is there stuff the admin is supposed to do before doing the unbind. Are you asking for a kind of 'quiescing' operation? I would hope that the crypto drivers already can deal with that via flushing the queue, not allowing new requests, or whatever. This is not the block device case. The current implementation of vfio-ap which is a crypto driver too certainly can not deal 'with that'. Whether the rest of the drivers can, I don't know. Maybe Tony can tell. I'm aware of the fact that AP adapters are not block devices. But as stated above I don't understand what is the big difference regarding the unbind operation. Anyway, this is an administrative issue. If you don't have a clear concept which devices are for host usage and which for guest usage, you already have problems. I'm trying to understand the whole solution. I agree, this is an administrative issue. But the document is trying to address such administrative issues. Speaking of administrative issues, is there libvirt support for vfio-ap under development? It would be helpful to validate the approach. I full-heartedly agree. I guess Tony will have to answer this one too. Regards, Halil
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 07/03/2018 03:25 PM, Cornelia Huck wrote: On Tue, 3 Jul 2018 14:20:11 +0200 Halil Pasic wrote: On 07/03/2018 01:52 PM, Cornelia Huck wrote: On Tue, 3 Jul 2018 11:22:10 +0200 Halil Pasic wrote: [..] Let me try to invoke the DASD analogy. If one for some reason wants to detach a DASD the procedure to follow seems to be (see https://www.ibm.com/support/knowledgecenter/en/linuxonibm/com.ibm.linux.z.lgdd/lgdd_t_dasd_online.html) the following: 1) Unmount. 2) Offline possibly using safe_offline. 3) Detach. Detaching a disk that is currently doing I/O asks for trouble, so the admin is encouraged to make sure there is no pending I/O. I don't think we can use dasd (block devices) as a good analogy for every kind of device (for starters, consider network devices). I did not use it for every kind of device. I used it for AP. I'm under the impression you find the analogy inappropriate. If, could you please explain why? I don't think block devices (which are designed to be more or less permanently accessed, e.g. by mounting a file system) have the same semantics as ap devices (which exist as a backend for crypto requests). Not everything that makes sense for a block device makes sense for other devices as well, and I don't think it makes sense here. I'm still confused. If it's about frequency of access (as hinted by block devices accessed more or less permanently) I'm not sure there is a substantial difference. I guess there are scenarios where the AP domain is used very seldom (e.g. protected keys --> most of the crypto ops done by CPACF but AP unwraps at the beginning), but there are such scenarios for block too. If it's about (persistent) state, I guess it again depends on the scenario and on the type of the card. But I may be wrong. In case of AP you can interpret my 'in use' as the queue is not empty. In my understanding unbind is supposed to be hard (I used the word radical). That's why I compared it to pulling a cable. So that's why I ask is there stuff the admin is supposed to do before doing the unbind. Are you asking for a kind of 'quiescing' operation? I would hope that the crypto drivers already can deal with that via flushing the queue, not allowing new requests, or whatever. This is not the block device case. The current implementation of vfio-ap which is a crypto driver too certainly can not deal 'with that'. Whether the rest of the drivers can, I don't know. Maybe Tony can tell. If the current implementation of vfio-ap cannot deal with it (by cleaning up, blocking, etc.), it needs at the very least be documented so that it can be implemented later. I do not know what the SIE will or won't do to assist here (e.g., if you're removing it from some masks, the device will already be inaccessible to the guest). But the part you were referring to was talking about the existing host driver anyway, wasn't it? I was thinking about both directions. Re-classifying a device form pass-through to normal should also be possible. But the document only talks about one direction. I'm not familiar with the existing host drivers. If we can say 'Hey, unbind is perfectly safe at any time: no per-cautions need to be considered!' I'm very happy with that. Although I would find it a bit surprising. I just wanted to make sure this is not something we forget. I'm aware of the fact that AP adapters are not block devices. But as stated above I don't understand what is the big difference regarding the unbind operation. Anyway, this is an administrative issue. If you don't have a clear concept which devices are for host usage and which for guest usage, you already have problems. I'm trying to understand the whole solution. I agree, this is an administrative issue. But the document is trying to address such administrative issues. I'd assume "know which devices are for the host and which devices are for the guests" to be a given, no? My other email scratches this topic. AFAIK we don't have a solution for that yet. Nor we have a good understanding of how and to what extent is statically given what is given. E.g. if one wants to re-partition my AP resources (and at some point one will have to at least do the initial re-partitioning) do I need a reboot for the changes to take effect? Or is this 'known' variable during the uptime of an OS. @Tony: Please feel free to fill the gaps in my understanding. Regards, Halil
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 07/09/2018 05:21 AM, Pierre Morel wrote: On 03/07/2018 01:10, Halil Pasic wrote: On 06/29/2018 11:11 PM, Tony Krowiak wrote: This patch provides documentation describing the AP architecture and design concepts behind the virtualization of AP devices. It also includes an example of how to configure AP devices for exclusive use of KVM guests. Signed-off-by: Tony Krowiak I don't like the design of external interfaces except for: * cpu model features, and * reset handling. In particular: ...snip... 4) If I were to act out the role of the administrator, I would prefer to think of specifying or changing the access controls of a guest in respect to AP (that is setting the AP matrix) as a single atomic operation -- which either succeeds or fails. The operation should succeed for any valid configuration, and fail for any invalid on. The current piecemeal approach seems even less fitting if we consider changing the access controls of a running guest. AFAIK changing access controls for a running guest is possible, and I don't see a reason why should we artificially prohibit this. I think the current sysfs interface for manipulating the matrix is good for manual playing around, but I would prefer having an interface that is better suited for programs (e.g. ioctl). I disagree with using ioctl. Why? What speaks against ioctl? I agree that the current implementation is not right. The configuration of APM and AQM should always be guarantied as coherent within the host but it can be done doing the right checks when using the sysfs. I'm glad we agree on this one at least. Regards, Halil
Re: [PATCH v5 06/13] KVM: s390: interfaces to manage guest's AP matrix
On 05/07/2018 05:11 PM, Tony Krowiak wrote: Provides interfaces to manage the AP adapters, usage domains and control domains assigned to a KVM guest. The guest's SIE state description has a satellite structure called the Crypto Control Block (CRYCB) containing three bitmask fields identifying the adapters, queues (domains) and control domains assigned to the KVM guest: [..] index 00bcfb0..98b53c7 100644 --- a/arch/s390/kvm/kvm-ap.c +++ b/arch/s390/kvm/kvm-ap.c @@ -7,6 +7,7 @@ [..] + +/** + * kvm_ap_validate_queue_sharing + * + * Verifies that the APQNs derived from the cross product of the AP adapter IDs + * and AP queue indexes comprising the AP matrix are not configured for + * another guest. AP queue sharing is not allowed. + * + * @kvm: the KVM guest + * @matrix: the AP matrix + * + * Returns 0 if the APQNs are valid, otherwise; returns -EBUSY. + */ +static int kvm_ap_validate_queue_sharing(struct kvm *kvm, +struct kvm_ap_matrix *matrix) +{ + struct kvm *vm; + unsigned long *apm, *aqm; + unsigned long apid, apqi; + + + /* No other VM may share an AP Queue with the input VM */ + list_for_each_entry(vm, &vm_list, vm_list) { + if (kvm == vm) + continue; + + apm = kvm_ap_get_crycb_apm(vm); + if (!bitmap_and(apm, apm, matrix->apm, matrix->apm_max + 1)) + continue; + + aqm = kvm_ap_get_crycb_aqm(vm); + if (!bitmap_and(aqm, aqm, matrix->aqm, matrix->aqm_max + 1)) + continue; + + for_each_set_bit_inv(apid, apm, matrix->apm_max + 1) + for_each_set_bit_inv(apqi, aqm, matrix->aqm_max + 1) + kvm_ap_log_sharing_err(vm, apid, apqi); + + return -EBUSY; + } + + return 0; +} + +int kvm_ap_configure_matrix(struct kvm *kvm, struct kvm_ap_matrix *matrix) +{ + int ret = 0; + + mutex_lock(&kvm->lock); You seem to take only kvm->lock, vm_list however (used in kvm_ap_validate_queue_sharing()) seems to be protected by kvm_lock. Can you tell me why is this supposed to be safe? What is supposed to prevent an execution like vm1: call kvm_ap_configure_matrix(m1) vm2: call kvm_ap_configure_matrix(m2) vm1: call kvm_ap_validate_queue_sharing(m1) vm2: call kvm_ap_validate_queue_sharing(m2) vm1: call kvm_ap_set_crycb_masks(m1) vm2: call kvm_ap_set_crycb_masks(m2) where, let's say, m1 and m2 are equal in the sense that the mask values are the same? Regards, Halil + + ret = kvm_ap_validate_queue_sharing(kvm, matrix); + if (ret) + goto done; + + kvm_ap_set_crycb_masks(kvm, matrix); + +done: + mutex_unlock(&kvm->lock); + + return ret; +} +EXPORT_SYMBOL(kvm_ap_configure_matrix); +
Re: [PATCH v5 05/13] s390: vfio-ap: register matrix device with VFIO mdev framework
On 05/07/2018 05:11 PM, Tony Krowiak wrote: Registers the matrix device created by the VFIO AP device driver with the VFIO mediated device framework. Registering the matrix device will create the sysfs structures needed to create mediated matrix devices each of which will be used to configure the AP matrix for a guest and connect it to the VFIO AP device driver. [..] diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c new file mode 100644 index 000..d7d36fb --- /dev/null +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -0,0 +1,106 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Adjunct processor matrix VFIO device driver callbacks. + * + * Copyright IBM Corp. 2017 + * Author(s): Tony Krowiak + * + */ +#include +#include +#include +#include +#include + +#include "vfio_ap_private.h" + +#define VFOP_AP_MDEV_TYPE_HWVIRT "passthrough" +#define VFIO_AP_MDEV_NAME_HWVIRT "VFIO AP Passthrough Device" + +static int vfio_ap_mdev_create(struct kobject *kobj, struct mdev_device *mdev) +{ + struct ap_matrix *ap_matrix = to_ap_matrix(mdev_parent_dev(mdev)); + + ap_matrix->available_instances--; + + return 0; +} + +static int vfio_ap_mdev_remove(struct mdev_device *mdev) +{ + struct ap_matrix *ap_matrix = to_ap_matrix(mdev_parent_dev(mdev)); + + ap_matrix->available_instances++; + + return 0; +} + The above functions seem to be called with the lock of this auto-generated mdev parent device held. That's why we don't have to care about synchronization ourselves, right? A small comment in the code could be helpful for mdev non-experts. Hell, I would even consider documenting it for all mdev -- took me some time to figure out. [..] +int vfio_ap_mdev_register(struct ap_matrix *ap_matrix) +{ + int ret; + + ret = mdev_register_device(&ap_matrix->device, &vfio_ap_matrix_ops); + if (ret) + return ret; + + ap_matrix->available_instances = AP_MATRIX_MAX_AVAILABLE_INSTANCES; + + return 0; +} + +void vfio_ap_mdev_unregister(struct ap_matrix *ap_matrix) +{ + ap_matrix->available_instances--; What is this for? I don't understand. Regards, Halil + mdev_unregister_device(&ap_matrix->device); +}
Re: [PATCH 1/1] s390: vfio-ccw: push down unsupported IDA check
On 05/14/2018 01:55 PM, Cornelia Huck wrote: On Wed, 9 May 2018 19:36:47 +0200 Halil Pasic wrote: There is at least one relevant control program (CP) that don't set the I'd prefer not to talk about 'control program' here, as it is not a term commonly used in Linux. Call it 'guest'? Also, s/don't/doesn't/ I will use guest instead. IDA flags in the ORB as we would like them, but never uses any IDA. So instead of saying -EOPNOTSUPP when observing an ORB such that a channel program specified by it could be a not supported one, let us say -EOPNOTSUPP only if the channel program is a not supported one. Of course, the real solution would be doing proper translation for all IDA. This is possible, but given the current code not straight forward. I agree, this seems useful for now, but we really need to support the different ida flags to be fully architecture compliant. I think this support is deeply buried in Dong Jia's backlog. FWIW I'm unaware of any (relevant) exploiter (guest) for the old IDA. Thus testing could also prove challenging, that is require extra test code. So given the estimated pain/gain ratio I don't see this coming soon. With my QEMU changes related to this patch we will also get the full IDA support as soon as the kernel is there. Signed-off-by: Halil Pasic Tested-by: Jason J. Herne --- QEMU counterpart comming soon. --- drivers/s390/cio/vfio_ccw_cp.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c index 2c7550797ec2..adfff492dc83 100644 --- a/drivers/s390/cio/vfio_ccw_cp.c +++ b/drivers/s390/cio/vfio_ccw_cp.c @@ -365,6 +365,9 @@ static void cp_unpin_free(struct channel_program *cp) * This is the chain length not considering any TICs. * You need to do a new round for each TIC target. * + * The program is also validated for absence of not yet supported + * indirect data addressing scenarios. + * * Returns: the length of the ccw chain or -errno. */ static int ccwchain_calc_length(u64 iova, struct channel_program *cp) @@ -391,6 +394,14 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp) do { cnt++; + /* +* 2k byte block IDAWs (fmt1 or fmt2) are not yet supported. +* There are however CPs that don't use IDA at all, and can +* benefit from not failing until failure is eminent. The second sentence is confusing (What is 'CP' referring to here? 'Control program' or struct channel_program?) Control program. I was under impression that in mainframe context CP mostly stands for control program. What about: "As we don't want to fail direct addressing even if the orb specified one of the unsupported formats, we defer checking for IDAWs in unsupported formats to here." Was the second sentence only confusing because of CP? I'm not perfectly satisfied with your version either: * 'fail direct addressing even if the orb specified one of the unsupported formats' I wanted to say: 'hey it does not matter what format for IDA the orb implies if the channel program does not use any IDA at all'. That could be paraphrased as channel programs using direct addressing exclusively. But failing the direct addressing does not fit for me. * 'defer' is IMHO trivial from the perspective that we used to fence the unsupported scenarios earlier (by just looking at the orb). But if one just reads the new code defer does not make much sense to me. But no strong opinions here. If you think your version is the way to go I will just take it. +*/ + if ((!cp->orb.cmd.c64 || cp->orb.cmd.i2k) && ccw_is_idal(ccw)) + return -EOPNOTSUPP; + if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw))) break; @@ -656,10 +667,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb) /* * XXX: * Only support prefetch enable mode now. -* Only support 64bit addressing idal. -* Only support 4k IDAW. */ - if (!orb->cmd.pfch || !orb->cmd.c64 || orb->cmd.i2k) + if (!orb->cmd.pfch) return -EOPNOTSUPP; INIT_LIST_HEAD(&cp->ccwchain_list); @@ -688,6 +697,10 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb) ret = ccwchain_loop_tic(chain, cp); if (ret) cp_unpin_free(cp); + /* It is safe to force: if not set but idals used +* ccwchain_calc_length returns an error. s/returns/already returned/ ? Yes we can do that. I think returns is also grammatical. Present simple can be used for expressing something that is always true. +*
Re: [PATCH 1/1] s390: vfio-ccw: push down unsupported IDA check
On 05/14/2018 04:00 PM, Cornelia Huck wrote: On Mon, 14 May 2018 15:37:17 +0200 Halil Pasic wrote: On 05/14/2018 01:55 PM, Cornelia Huck wrote: On Wed, 9 May 2018 19:36:47 +0200 Halil Pasic wrote: [..] + /* +* 2k byte block IDAWs (fmt1 or fmt2) are not yet supported. +* There are however CPs that don't use IDA at all, and can +* benefit from not failing until failure is eminent. The second sentence is confusing (What is 'CP' referring to here? 'Control program' or struct channel_program?) Control program. I was under impression that in mainframe context CP mostly stands for control program. Yes, but it is very confusing as there is also a variable named 'cp' in this function. Right. I was coming from the PoP side of things. But I do agree. What about: "As we don't want to fail direct addressing even if the orb specified one of the unsupported formats, we defer checking for IDAWs in unsupported formats to here." Was the second sentence only confusing because of CP? I'm not perfectly satisfied with your version either: * 'fail direct addressing even if the orb specified one of the unsupported formats' I wanted to say: 'hey it does not matter what format for IDA the orb implies if the channel program does not use any IDA at all'. That could be paraphrased as channel programs using direct addressing exclusively. But failing the direct addressing does not fit for me. But that's effectively what happens now, no? We reject the orb out of hand due to unsupported flags that do not have any relevance for the channel program in that case. Yes, that's what happens now, except that we make the whole channel program fail, and not the direct addressing. But the comment should describe what happens with the patch applied. Or maybe 'channel programs using direct addressing only'? * 'defer' is IMHO trivial from the perspective that we used to fence the unsupported scenarios earlier (by just looking at the orb). But if one just reads the new code defer does not make much sense to me. I think it still makes sense if you look at how the functions are called. But no strong opinions here. If you think your version is the way to go I will just take it. I certainly don't want to dictate things :) No problem. I'm aware of my limitations when it comes to producing readable text. In particular, my judgment about well readable or not is not trustworthy. So your input is highly appreciated. I will take your version, unless there is development. +*/ + if ((!cp->orb.cmd.c64 || cp->orb.cmd.i2k) && ccw_is_idal(ccw)) + return -EOPNOTSUPP; + if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw))) break; @@ -656,10 +667,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb) /* * XXX: * Only support prefetch enable mode now. -* Only support 64bit addressing idal. -* Only support 4k IDAW. */ - if (!orb->cmd.pfch || !orb->cmd.c64 || orb->cmd.i2k) + if (!orb->cmd.pfch) return -EOPNOTSUPP; INIT_LIST_HEAD(&cp->ccwchain_list); @@ -688,6 +697,10 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb) ret = ccwchain_loop_tic(chain, cp); if (ret) cp_unpin_free(cp); + /* It is safe to force: if not set but idals used +* ccwchain_calc_length returns an error. s/returns/already returned/ ? Yes we can do that. I think returns is also grammatical. Present simple can be used for expressing something that is always true. I think it makes it clearer that we already checked earlier in the call sequence. Will do. +*/ + cp->orb.cmd.c64 = 1; return ret; } The patch looks sane, I have only issues with the description/comments. Thanks for having a look. Please give me short feedback about the one open point and I will respin with the requested changes. Does anybody else have feedback? Will wait a day or so. Dong Jia and Jason have already seen the patch, and they only complained about the text. Since that spin was mainly for the tested-by tags, and I stated that any substantial discussion should happen upstream, I ignored those complaints. So yes I will wait a bit so everybody can chime in. Regards, Halil
Re: [PATCH v6 21/21] s390: doc: detailed specifications for AP virtualization
On 07/03/2018 04:30 PM, Cornelia Huck wrote: On Tue, 3 Jul 2018 15:58:37 +0200 Halil Pasic wrote: On 07/03/2018 03:25 PM, Cornelia Huck wrote: On Tue, 3 Jul 2018 14:20:11 +0200 Halil Pasic wrote: On 07/03/2018 01:52 PM, Cornelia Huck wrote: On Tue, 3 Jul 2018 11:22:10 +0200 Halil Pasic wrote: [..] Let me try to invoke the DASD analogy. If one for some reason wants to detach a DASD the procedure to follow seems to be (see https://www.ibm.com/support/knowledgecenter/en/linuxonibm/com.ibm.linux.z.lgdd/lgdd_t_dasd_online.html) the following: 1) Unmount. 2) Offline possibly using safe_offline. 3) Detach. Detaching a disk that is currently doing I/O asks for trouble, so the admin is encouraged to make sure there is no pending I/O. I don't think we can use dasd (block devices) as a good analogy for every kind of device (for starters, consider network devices). I did not use it for every kind of device. I used it for AP. I'm under the impression you find the analogy inappropriate. If, could you please explain why? I don't think block devices (which are designed to be more or less permanently accessed, e.g. by mounting a file system) have the same semantics as ap devices (which exist as a backend for crypto requests). Not everything that makes sense for a block device makes sense for other devices as well, and I don't think it makes sense here. I'm still confused. If it's about frequency of access (as hinted by block devices accessed more or less permanently) I'm not sure there is a substantial difference. I guess there are scenarios where the AP domain is used very seldom (e.g. protected keys --> most of the crypto ops done by CPACF but AP unwraps at the beginning), but there are such scenarios for block too. If it's about (persistent) state, I guess it again depends on the scenario and on the type of the card. But I may be wrong. So, let's turn this around: Why do you think that dasd (and not qeth or whatever) is a good model for ap device unbinding? Because I really fail to get it... maybe the ap driver maintainers can chime in. Let's do it! But let me clarify one thing first I never stated that dasd is the only good model. What speaks for dasd as a model for unbinding: * DASD is currently the only device we have vfio-mdev passthrough for on s390x. * DASD is comparatively simple and familiar. I'm not less confident to talk about qeth or whatever else than to talk about DASD. * DASD has persistent state. A NIC is much more stateless. * DASD has offline and safe_offline. This kind of demonstrates that the stock operation may trade 'safety' for stuff (e.g. guarantee to terminate). Since the queue reset implemented by Tony has a limited wait built in this seemed relevant. * DASD can be seen as request-response with some local-ish stuff as opposed to sending and receiving packets in a probably largish network. The idea of outstanding operations is easy to gasp. * From expectations of the upper layer entities a block device seems to be a better fit than a network interface. Fault recovery is less of a concern for an application that writes to a file, than for an application that tires to talk to an other application over the net. In my experience connections break more often that disks or I suppose AP domains. What is so wrong about asking the question: Is really unbind all the admin has to do? In case of AP you can interpret my 'in use' as the queue is not empty. In my understanding unbind is supposed to be hard (I used the word radical). That's why I compared it to pulling a cable. So that's why I ask is there stuff the admin is supposed to do before doing the unbind. Are you asking for a kind of 'quiescing' operation? I would hope that the crypto drivers already can deal with that via flushing the queue, not allowing new requests, or whatever. This is not the block device case. The current implementation of vfio-ap which is a crypto driver too certainly can not deal 'with that'. Whether the rest of the drivers can, I don't know. Maybe Tony can tell. If the current implementation of vfio-ap cannot deal with it (by cleaning up, blocking, etc.), it needs at the very least be documented so that it can be implemented later. I do not know what the SIE will or won't do to assist here (e.g., if you're removing it from some masks, the device will already be inaccessible to the guest). But the part you were referring to was talking about the existing host driver anyway, wasn't it? I was thinking about both directions. Re-classifying a device form pass-through to normal should also be possible. But the document only talks about one direction. Presumably because it (rightfully) focuses on setting up vfio-ap? I'm afraid we have a misunderstanding here. I did not propose to include the ot
Re: [PATCH v6 09/21] s390: vfio-ap: structure for storing mdev matrix
On 06/29/2018 11:11 PM, Tony Krowiak wrote: From: Tony Krowiak Introduces a new structure for storing the AP matrix configured for the mediated matrix device via its sysfs attributes files. Signed-off-by: Tony Krowiak --- drivers/s390/crypto/vfio_ap_ops.c | 12 drivers/s390/crypto/vfio_ap_private.h | 24 2 files changed, 36 insertions(+), 0 deletions(-) diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index 4e61e33..bf7ed9f 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -20,6 +20,17 @@ DEFINE_SPINLOCK(mdev_list_lock); LIST_HEAD(mdev_list); +static void vfio_ap_matrix_init(struct ap_matrix *matrix) +{ + /* Test if PQAP(QCI) instruction is available */ + if (test_facility(12)) + ap_qci(&matrix->info); + + matrix->apm_max = matrix->info.apxa ? matrix->info.Na : 63; + matrix->aqm_max = matrix->info.apxa ? matrix->info.Nd : 15; + matrix->adm_max = matrix->info.apxa ? matrix->info.Nd : 15; +} + static int vfio_ap_mdev_create(struct kobject *kobj, struct mdev_device *mdev) { struct ap_matrix_dev *matrix_dev = @@ -31,6 +42,7 @@ static int vfio_ap_mdev_create(struct kobject *kobj, struct mdev_device *mdev) return -ENOMEM; matrix_mdev->name = dev_name(mdev_dev(mdev)); + vfio_ap_matrix_init(&matrix_mdev->matrix); mdev_set_drvdata(mdev, matrix_mdev); if (atomic_dec_if_positive(&matrix_dev->available_instances) < 0) { diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h index 3de1275..ae771f5 100644 --- a/drivers/s390/crypto/vfio_ap_private.h +++ b/drivers/s390/crypto/vfio_ap_private.h @@ -29,9 +29,33 @@ struct ap_matrix_dev { atomic_t available_instances; }; +/** + * The AP matrix is comprised of three bit masks identifying the adapters, + * queues (domains) and control domains that belong to an AP matrix. The bits i + * each mask, from least significant to most significant bit, correspond to IDs + * 0 to 255. When a bit is set, the corresponding ID belongs to the matrix. + * + * @apm identifies the AP adapters in the matrix + * @apm_max: max adapter number in @apm + * @aqm identifies the AP queues (domains) in the matrix + * @aqm_max: max domain number in @aqm + * @adm identifies the AP control domains in the matrix + * @adm_max: max domain number in @adm + */ +struct ap_matrix { + unsigned long apm_max; + DECLARE_BITMAP(apm, 256); + unsigned long aqm_max; + DECLARE_BITMAP(aqm, 256); + unsigned long adm_max; + DECLARE_BITMAP(adm, 256); + struct ap_config_info info; Why do we maintain (and populate by doing a QCI) the info member on a per mdev device basis? +}; + struct ap_matrix_mdev { const char *name; struct list_head list; + struct ap_matrix matrix; }; static struct ap_matrix_dev *to_ap_matrix_dev(struct device *dev)
Re: [PATCH v3] vfio/mdev: Check globally for duplicate devices
On 05/17/2018 10:09 AM, Cornelia Huck wrote: [Dong Jia, Halil: Can you please take a look whether vfio-ccw is really ok? I don't think we open up any new races, but I'd appreciate a second or third opinion.] I will wait for things to settle a bit before I start reviewing the synchronization for mdev and vfio-ccw. I suspect there will be a v4. I would appreciate a cc, so I don't miss it. Halil
Re: [PATCH v9 12/22] s390: vfio-ap: sysfs interfaces to configure control domains
On 08/20/2018 04:23 PM, Cornelia Huck wrote: On Mon, 13 Aug 2018 17:48:09 -0400 Tony Krowiak wrote: From: Tony Krowiak Provides the sysfs interfaces for: 1. Assigning AP control domains to the mediated matrix device 2. Unassigning AP control domains from a mediated matrix device 3. Displaying the control domains assigned to a mediated matrix device The IDs of the AP control domains assigned to the mediated matrix device are stored in an AP domain mask (ADM). The bits in the ADM, from most significant to least significant bit, correspond to AP domain numbers 0 to 255. On some systems, the maximum allowable domain number may be less than 255 - depending upon the host's AP configuration - and assignment may be rejected if the input domain ID exceeds the limit. Please remind me of the relationship between control domains and usage domains... IIRC, usage domains allow both requests and configuration, while control domains allow only configuration, and are by convention a superset of usage domains. The whole terminology with control and usage domains is IMHO a bit confusing. With the HMC one can assign a domain either as a 'Control' or as a 'Control and Usage' domain. Regarding the masks in the CRYCB, the AQM controls 'using' the domain (e.g. if AQM bit is zero NQAP will refuse to enqueue on that queue) while ADM tells if the guest is allowed to 'change' the given domain. Whether a command-request is of type 'using' or 'changing' is a property of the command request. You can think of 'using' a domain like signing stuff with a key residing within the domain, and of 'changing' a domain like issuing a command to generate a new key for the given domain. Is there a hard requirement somewhere in there, or can the admin cheerfully use different masks for usage domains and control domains without the SIE choking on it? It is a convention. AFAIR it ain't architecture. SIE won't choke on it I've tried it out. I was arguing along the lines that the kernel should not enforce this convention -- tooling can still do that if we want this enforced. Regards, Halil
Re: [PATCH] MAINTAINERS/vfio-ccw: add Farhan and Eric, make Halil Reviewer
On Wed, 12 Dec 2018 15:59:29 +0100 Christian Borntraeger wrote: > Eric and Farhan will help with maintaining vfio-ccw. > > Cc: Cornelia Huck > Cc: Halil Pasic > Cc: Farhan Ali > Cc: Eric Farman > Signed-off-by: Christian Borntraeger Acked-by: Halil Pasic > --- > MAINTAINERS | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 8119141a926f..879594b93385 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13026,7 +13026,9 @@ F:drivers/pci/hotplug/s390_pci_hpc.c > > S390 VFIO-CCW DRIVER > M: Cornelia Huck > -M: Halil Pasic > +M: Farhan Ali > +M: Eric Farman > +R: Halil Pasic > L: linux-s...@vger.kernel.org > L: k...@vger.kernel.org > S: Supported
Re: [PATCH v3 6/6] vfio: ccw: serialize the write system calls
On Thu, 13 Dec 2018 16:39:53 +0100 Cornelia Huck wrote: > On Wed, 28 Nov 2018 13:41:07 +0100 > Pierre Morel wrote: > > > When the user program is QEMU we rely on the QEMU lock to serialize > > the calls to the driver. > > > > In the general case we need to make sure that two data transfer are > > not started at the same time. > > It would in the current implementation resul in a overwriting of the > > IO region. > > > > We also need to make sure a clear or a halt started after a data > > transfer do not win the race agains the data transfer. > > Which would result in the data transfer being started after the > > halt/clear. > > > > Signed-off-by: Pierre Morel > > --- > > drivers/s390/cio/vfio_ccw_ops.c | 17 + > > 1 file changed, 13 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/s390/cio/vfio_ccw_ops.c > > b/drivers/s390/cio/vfio_ccw_ops.c > > index eb5b49d..b316966 100644 > > --- a/drivers/s390/cio/vfio_ccw_ops.c > > +++ b/drivers/s390/cio/vfio_ccw_ops.c > > @@ -267,22 +267,31 @@ static ssize_t vfio_ccw_mdev_write(struct mdev_device > > *mdev, > > { > > unsigned int index = VFIO_CCW_OFFSET_TO_INDEX(*ppos); > > struct vfio_ccw_private *private; > > + static atomic_t serialize = ATOMIC_INIT(0); > > + int ret = -EINVAL; > > + > > + if (!atomic_add_unless(&serialize, 1, 1)) > > + return -EBUSY; > > I think that hammer is far too big: This serializes _all_ write > operations across _all_ devices. > > There are various cases of simultaneous writes that may happen > (assuming any userspace; QEMU locking will prevent some of them from > happening): > > - One thread does a write for one mdev, another thread does a write for > another mdev. For example, if two vcpus issue an I/O instruction on > two different devices. This should be fine. > - One thread does a write for one mdev, another thread does a write for > the same mdev. Maybe a guest has confused/no locking and is trying to > do ssch on the same device from different vcpus. There, we want to > exclude simultaneous writes; the desired outcome may be that one ssch > gets forwarded to the hardware, and the second one either gets > forwarded after processing for the first one has finished, or > userspace gets an error immediately (hopefully resulting in a > appropriate condition code for that second ssch in any case). Both > handing the second ssch to the hardware or signaling device busy > immediately are probably sane in that case. > - If those writes for the same device involve hsch/csch, things get > more complicated. First, we have two different regions, and allowing > simultaneous writes to the I/O region and to the async region should > not really be a problem, so I don't think fencing should be done in > the generic write handler. Second, the semantics for device busy are > different: a hsch immediately after a ssch should not give device > busy, and csch cannot return device busy at all. > > I don't think we'll be able to get around some kind of "let's retry > sending this" logic for hsch/csch; maybe we should already do that for > ssch. (Like the -EINVAL logic I described in the other thread.) > > Nice write-up! I agree with the conclusion, and also with the most of the analysis. IMHO, to sort this out properly, we really have to think end-to-end (i.e. guest, userspace, vfio-ccw, backing-device). Striving towards an comprehensively documented the user-space facing vfio-ccw interface could help as well. I hope we can figure out a good solution in the context of hsch/csch. Regards, Halil
Re: [PATCH v5 05/15] KVM: s390: unify pending_irqs() and pending_irqs_no_gisa()
On Thu, 20 Dec 2018 12:49:56 +0100 Michael Mueller wrote: > > > On 20.12.18 12:06, Cornelia Huck wrote: > > On Wed, 19 Dec 2018 20:17:46 +0100 > > Michael Mueller wrote: > > > >> Use a single function with parameter irq_flags to differentiate > >> between cases. > >> > >> New irq flag: > >> IRQ_FLAG_LOCAL: include vcpu local interruptions pending > >> IRQ_FLAG_FLOATING: include vcpu floating interruptions pending > >> IRQ_FLAG_GISA: include GISA interruptions pending in IPM > > > > I presume that means that irqs may be in more than one set? Or are gisa > > irqs not considered floating irqs, because they use a different > > mechanism? > > Currently, the interruptions managed in GISA are floating only. But > that might change in future. I don't think GISA can be used for non-floating interrupts. Regards, Halil > The idea is not to subsume IRQ_FLAG_GISA > in IRQ_FLAG_FLOATING but to be able to address the right set of > procedures to determine the irq pending set for a given subset of irq > types that have different implementations. > > There might be a better name for IRQ_FLAG_FLOATING then? > [..]
Re: [PATCH v5 11/15] KVM: s390: restore IAM in get_ipm() when IPM is clean
On Wed, 19 Dec 2018 20:17:52 +0100 Michael Mueller wrote: > The patch adds the parameter irq_flags and allows to > restore the Interruption Alert Mask (IAM) in the GISA > atomically while guaranteeing the IPM is clean. > > The function returns the IPM of the GISA. If the returned > value is 0x00 and the IRQ_FLAG_IAM was set, the IAM has > been restored. > > New irq flag: > IRQ_FLAG_IAM: When set, the IAM is restored if no ISC bit > is set in the IPM, i.e. no new airqs are > pending. The test and restore operations > are done atomically. > > Signed-off-by: Michael Mueller > --- > arch/s390/kvm/interrupt.c | 34 +- > 1 file changed, 29 insertions(+), 5 deletions(-) > > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index 1cc3ad2e6c7e..8307717e3caf 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -37,6 +37,7 @@ > #define IRQ_FLAG_LOCAL0x8000 /* include local interruption pending mask > */ > #define IRQ_FLAG_FLOATING 0x4000 /* include float interruption pending mask > */ > #define IRQ_FLAG_GISA 0x2000 /* include GISA interruption pending mask */ > +#define IRQ_FLAG_IAM 0x0080 /* when set try to restore IAM */ > > #define IRQ_MASK_ALL (IRQ_FLAG_LOCAL | IRQ_FLAG_FLOATING | > IRQ_FLAG_GISA) > #define IRQ_MASK_NO_GISA (IRQ_MASK_ALL & ~IRQ_FLAG_GISA) > @@ -253,9 +254,32 @@ static inline void set_ipm_gisc(struct kvm_s390_gisa > *gisa, u32 gisc) > set_bit_inv(IPM_BIT_OFFSET + gisc, (unsigned long *) gisa); > } > > -static inline u8 get_ipm(struct kvm_s390_gisa *gisa) > +static inline u8 get_ipm(struct kvm_s390_gisa *gisa, u16 irq_flags) > { > - return READ_ONCE(gisa->ipm); > + u64 word, _word; > + u8 ipm; > + > + if (!(irq_flags & IRQ_FLAG_IAM)) > + return READ_ONCE(gisa->ipm); > + > + do { > + word = READ_ONCE(gisa->u64.word[0]); > + ipm = word >> 24; > + /* If the GISA is in the alert list, return the IPM. */ > + if ((u64)gisa != word >> 32) > + return ipm; > + /* If the IPM is dirty, return the IPM. */ > + if (ipm) > + return ipm; > + /* > + * Try to restore the IAM or loop, if the IPM is dirty > + * again or the GISA has been inserted into the alert list. > + */ > + _word = (word & ~0xffUL) | > + container_of(gisa, struct sie_page2, > gisa)->kvm->arch.iam; > + } while (cmpxchg(&gisa->u64.word[0], word, _word) != _word); ^^ Shouldn't this rather be: } while (cmpxchg(&gisa->u64.word[0], word, _word) != word); Regards, Halil > + > + return 0; > }
Re: [PATCH v5 13/15] KVM: s390: add function process_gib_alert_list()
On Wed, 19 Dec 2018 20:17:54 +0100 Michael Mueller wrote: > This function processes the Gib Alert List (GAL). It is required > to run when either a gib alert interruption has been received or > a gisa that is in the alert list is cleared or dropped. > > The GAL is build up by millicode, when the respective ISC bit is > set in the Interruption Alert Mask (IAM) and an interruption of > that class is observed. > > Signed-off-by: Michael Mueller > --- > arch/s390/kvm/interrupt.c | 140 > ++ > 1 file changed, 140 insertions(+) > > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index 48a93f5e5333..03e7ba4f215a 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -2941,6 +2941,146 @@ int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, > __u8 __user *buf, int len) > return n; > } > > +static int __try_airqs_kick(struct kvm *kvm, u8 ipm) > +{ > + struct kvm_s390_float_interrupt *fi = &kvm->arch.float_int; > + struct kvm_vcpu *vcpu = NULL, *kick_vcpu[MAX_ISC + 1]; > + int online_vcpus = atomic_read(&kvm->online_vcpus); > + u8 ioint_mask, isc_mask, kick_mask = 0x00; > + int vcpu_id, kicked = 0; > + > + /* Loop over vcpus in WAIT state. */ > + for (vcpu_id = find_first_bit(fi->idle_mask, online_vcpus); > + /* Until all pending ISCs have a vcpu open for airqs. */ > + (~kick_mask & ipm) && vcpu_id < online_vcpus; > + vcpu_id = find_next_bit(fi->idle_mask, online_vcpus, vcpu_id)) { > + vcpu = kvm_get_vcpu(kvm, vcpu_id); > + if (psw_ioint_disabled(vcpu)) > + continue; > + ioint_mask = (u8)(vcpu->arch.sie_block->gcr[6] >> 24); > + for (isc_mask = 0x80; isc_mask; isc_mask >>= 1) { > + /* ISC pending in IPM ? */ > + if (!(ipm & isc_mask)) > + continue; > + /* vcpu for this ISC already found ? */ > + if (kick_mask & isc_mask) > + continue; > + /* vcpu open for airq of this ISC ? */ > + if (!(ioint_mask & isc_mask)) > + continue; > + /* use this vcpu (for all ISCs in ioint_mask) */ > + kick_mask |= ioint_mask; > + kick_vcpu[kicked++] = vcpu; Assuming that the vcpu can/will take all ISCs it's currently open for does not seem right. We kind of rely on this assumption here, or? > + } > + } > + > + if (vcpu && ~kick_mask & ipm) > + VM_EVENT(kvm, 4, "gib alert undeliverable isc mask > 0x%02x", > + ~kick_mask & ipm); > + > + for (vcpu_id = 0; vcpu_id < kicked; vcpu_id++) > + kvm_s390_vcpu_wakeup(kick_vcpu[vcpu_id]); > + > + return (online_vcpus != 0) ? kicked : -ENODEV; > +} > + > +static void __floating_airqs_kick(struct kvm *kvm) > +{ > + struct kvm_s390_float_interrupt *fi = &kvm->arch.float_int; > + int online_vcpus, kicked; > + u8 ipm_t0, ipm; > + > + /* Get IPM and return if clean, IAM has been restored. */ > + ipm = get_ipm(kvm->arch.gisa, IRQ_FLAG_IAM); > + if (!ipm) > + return; > +retry: > + ipm_t0 = ipm; > + > + /* Try to kick some vcpus in WAIT state. */ > + kicked = __try_airqs_kick(kvm, ipm); > + if (kicked < 0) > + return; > + > + /* Get IPM and return if clean, IAM has been restored. */ > + ipm = get_ipm(kvm->arch.gisa, IRQ_FLAG_IAM); > + if (!ipm) > + return; > + > + /* Start over, if new ISC bits are pending in IPM. */ > + if ((ipm_t0 ^ ipm) & ~ipm_t0) > + goto retry; > + > + /* > + * Return as we just kicked at least one vcpu in WAIT state > + * open for airqs. The IAM will be restored latest when one > + * of them goes into WAIT or STOP state. > + */ > + if (kicked > 0) > + return; > + > + /* > + * No vcpu was kicked either because no vcpu was in WAIT state > + * or none of the vcpus in WAIT state are open for airqs. > + * Return immediately if no vcpus are in WAIT state. > + * There are vcpus in RUN state. They will process the airqs > + * if not closed for airqs as well. In that case the system will > + * delay airqs until a vcpu decides to take airqs again. > + */ > + online_vcpus = atomic_read(&kvm->online_vcpus); > + if (!bitmap_weight(fi->idle_mask, online_vcpus)) > + return; > + > + /* > + * None of the vcpus in WAIT state take airqs and we might > + * have no running vcpus as at least one vcpu is in WAIT state > + * and IPM is dirty. > + */ > + set_iam(kvm->arch.gisa, kvm->arch.iam); > +} > + > +#define NULL_GISA_ADDR 0xUL > +#define NONE_GISA_ADDR 0x0001UL > +#define GISA_ADDR_MASK 0xf000UL >
Re: [PATCH v5 10/15] KVM: s390: add functions to (un)register GISC with GISA
On Tue, 8 Jan 2019 11:34:44 +0100 Cornelia Huck wrote: > > >> > > >>> + spin_unlock(&kvm->arch.iam_ref_lock); > > >>> + > > >>> + return gib->nisc; > > >>> +} > > >>> +EXPORT_SYMBOL_GPL(kvm_s390_gisc_register); > > >>> + > > >>> +int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc) > > >>> +{ > > >>> + int rc = 0; > > >>> + > > >>> + if (!kvm->arch.gib_in_use) > > >>> + return -ENODEV; > > >>> + if (gisc > MAX_ISC) > > >>> + return -ERANGE; > > >>> + > > >>> + spin_lock(&kvm->arch.iam_ref_lock); > > >>> + if (kvm->arch.iam_ref_count[gisc] == 0) { > > >>> + rc = -EINVAL; > > >>> + goto out; > > >>> + } > > >>> + kvm->arch.iam_ref_count[gisc]--; > > >>> + if (kvm->arch.iam_ref_count[gisc] == 0) { > > >>> + kvm->arch.iam &= ~(0x80 >> gisc); > > >>> + set_iam(kvm->arch.gisa, kvm->arch.iam); > > > > > > Any chance of this function failing here? If yes, would there be any > > > implications? > > > > It is the same here. > > I'm not sure that I follow: This is the reverse operation > (unregistering the gisc). Can we rely on get_ipm() to do any fixup > later? Is that a problem for the caller? IMHO gib alerts are all about not loosing initiative. I.e. avoiding substantially delayed delivery of interrupts due to vCPUs in wait state. Thus we must avoid letting go before setting up stuff (gisa.iam under consideration of gisa ipm, gisa.next_alert, and gib.alert_list_origin) so that we can react on the next interrupt (if necessary). On the other hand, getting more gisa alerts than necessary is not fatal -- better avoided if not too much hassle of course. Bottom line is, while it may be critical that the IAM changes implied by register take place immediately, unregister is a different story. Regards, Halil > > Apologies if I sound confused (well, that's because I probably am); > this is hard to review without access to the hardware specification.
Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
On Mon, 31 Dec 2018 06:03:51 + "Wang, Wei W" wrote: > On Sunday, December 30, 2018 2:06 PM, Halil Pasic wrote: > > > > I guess you are the first one trying to read virtio config from within > > interrupt > > context. AFAICT this never worked. > > I'm not sure about "never worked". It seems to work well with virtio-pci. > But looking forward to hearing a solid reason why reading config inside > the handler is forbidden (if that's true). By "never worked" I meant "never worked with virtio-ccw". Sorry about the misunderstanding. Seems I've also failed to convey that I don't know if reading config inside the handler is forbidden or not. So please don't expect me providing the solid reasons you are looking forward to. > > > About what happens. The apidoc of ccw_device_start() says it needs to be > > called with the ccw device lock held, so ccw_io_helper() tries to take it > > (since > > forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and > > io_subchannel_initialize_dev() makes the ccw device lock be the subchannel > > lock. That means when one tries to get virtio config form within a cio > > interrupt context we deadlock, because we try to take a lock we already > > have. > > > > That said, I don't think this limitation is by design (i.e. intended). > > Maybe Connie can help us with that question. AFAIK we have nothing > > documented regarding this (neither that can nor can't). > > > > Obviously, there are multiple ways around this problem, and at the moment > > I can't tell which would be my preferred one. > > Yes, it's also not difficult to tweak the virtio-balloon code to avoid that > issue. > But if that's just an issue with ccw itself, I think it's better to tweak ccw > and > remain virtio-balloon unchanged. > As I said, at the moment I don't have a preference regarding the fix, partly because I'm not sure if "reading config inside the handler" is OK or not. Maybe Connie or Michael can help us here. I'm however sure that commit 86a5597 "virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT" breaks virtio-balloon with the ccw transport (i.e. effectively breaks virtio-balloon on s390): it used to work before and does not work after. AFAICT tweaking the balloon code may be simpler than tweaking the virtio-ccw (transport code). ccw_io_helper() relies on getting an interrupt when the issued IO is done. If virtio-ccw is buggy, it needs to be fixed, but I'm not sure it is. Regards, Halil
Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
On Wed, 2 Jan 2019 10:53:14 +0100 Cornelia Huck wrote: > On Tue, 1 Jan 2019 00:40:19 +0100 > Halil Pasic wrote: > > > On Mon, 31 Dec 2018 06:03:51 + > > "Wang, Wei W" wrote: > > > > > On Sunday, December 30, 2018 2:06 PM, Halil Pasic wrote: > > > > > > > > I guess you are the first one trying to read virtio config from within > > > > interrupt > > > > context. AFAICT this never worked. > > > > > > I'm not sure about "never worked". It seems to work well with virtio-pci. > > > But looking forward to hearing a solid reason why reading config inside > > > the handler is forbidden (if that's true). > > > > By "never worked" I meant "never worked with virtio-ccw". Sorry > > about the misunderstanding. Seems I've also failed to convey that I don't > > know if reading config inside the handler is forbidden or not. So please > > don't expect me providing the solid reasons you are looking forward to. > > It won't work with the current code, and this is all a bit ugly :( More > verbose explanation below. > > > > > > > > > > About what happens. The apidoc of ccw_device_start() says it needs to be > > > > called with the ccw device lock held, so ccw_io_helper() tries to take > > > > it (since > > > > forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and > > > > io_subchannel_initialize_dev() makes the ccw device lock be the > > > > subchannel > > > > lock. That means when one tries to get virtio config form within a cio > > > > interrupt context we deadlock, because we try to take a lock we already > > > > have. > > > > > > > > That said, I don't think this limitation is by design (i.e. intended). > > > > Maybe Connie can help us with that question. AFAIK we have nothing > > > > documented regarding this (neither that can nor can't). > > The main problem is that channel I/O is a fundamentally asynchronous > mechanism. As channel devices don't have the concept of config spaces > (or some other things that virtio needs), I decided to map > reading/writing the config space to channel commands. Starting I/O on a > subchannel always needs the lock (to avoid races on the subchannel), > and the asynchronous interrupt for that I/O needs the lock as well (for > the same reason; things like the scsw contain state that you want to > access without races). A config change also means that the subchannel > becomes state pending (and an interrupt is made pending), so the > subchannel lock is taken for that path as well. (Virtqueue > notifications are handled differently on modern QEMU, but that does not > come into play here.) > Besides locking (thinking along the lines that we work around the lock problem somehow) there is also the new PSW which masks IO interrupts. As I said, doing something about this seems non-trivial at least. > > > > > > > > Obviously, there are multiple ways around this problem, and at the > > > > moment > > > > I can't tell which would be my preferred one. > > > > > > Yes, it's also not difficult to tweak the virtio-balloon code to avoid > > > that issue. > > > But if that's just an issue with ccw itself, I think it's better to tweak > > > ccw and > > > remain virtio-balloon unchanged. > > > > > > > As I said, at the moment I don't have a preference regarding the fix, > > partly because I'm not sure if "reading config inside the handler" is OK > > or not. Maybe Connie or Michael can help us here. I'm however sure that > > commit 86a5597 "virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT" > > breaks virtio-balloon with the ccw transport (i.e. effectively breaks > > virtio-balloon on s390): it used to work before and does not work > > after. > > Yes, that's unfortunate. > > > > > AFAICT tweaking the balloon code may be simpler than tweaking the > > virtio-ccw (transport code). ccw_io_helper() relies on getting > > an interrupt when the issued IO is done. If virtio-ccw is buggy, it > > needs to be fixed, but I'm not sure it is. > > I would not call virtio-ccw buggy, but it has some constraints that > virtio-pci apparently doesn't have (and which did not show up so far; > e.g. virtio-blk schedules a work item on config change, so there's no > deadlock there.) IMHO it is an internal API design thing. From the spirit of the virtio standard perspective a virtio-ccw device is a ccw device, and acts like one. We don't support new IO form ccw device interrupt handler. So that's quite OK. OTOH we probably do want a coherent in kernel virtio interface. And if that one needs to account for all the quirks of any transport, that is quite ugly. > > One way to get out of that constraint (don't interact with the config > space directly in the config changed handler) would be to schedule a > work item in virtio-ccw that calls virtio_config_changed() for the > device. My understanding is that delaying the notification to a work > queue would be fine. > That would get us out of irq context, but I read you found other problems. [..] Regards, Halil
Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
On Wed, 2 Jan 2019 14:23:38 +0100 Cornelia Huck wrote: > On Wed, 2 Jan 2019 10:53:14 +0100 > Cornelia Huck wrote: > > > On Tue, 1 Jan 2019 00:40:19 +0100 > > Halil Pasic wrote: > > > > As I said, at the moment I don't have a preference regarding the fix, > > > partly because I'm not sure if "reading config inside the handler" is OK > > > or not. Maybe Connie or Michael can help us here. I'm however sure that > > > commit 86a5597 "virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT" > > > breaks virtio-balloon with the ccw transport (i.e. effectively breaks > > > virtio-balloon on s390): it used to work before and does not work > > > after. > > > > Yes, that's unfortunate. > > > > > > > > AFAICT tweaking the balloon code may be simpler than tweaking the > > > virtio-ccw (transport code). ccw_io_helper() relies on getting > > > an interrupt when the issued IO is done. If virtio-ccw is buggy, it > > > needs to be fixed, but I'm not sure it is. > > > > I would not call virtio-ccw buggy, but it has some constraints that > > virtio-pci apparently doesn't have (and which did not show up so far; > > e.g. virtio-blk schedules a work item on config change, so there's no > > deadlock there.) > > > > One way to get out of that constraint (don't interact with the config > > space directly in the config changed handler) would be to schedule a > > work item in virtio-ccw that calls virtio_config_changed() for the > > device. My understanding is that delaying the notification to a work > > queue would be fine. > > Unfortunately, calling virtio_config_changed() from a work item is not > enough: That function takes the config_lock, and the virtio-ccw code to > get the config both needs to allocate some memory and call schedule :/ > > The best option really seems to be > - have virtio-balloon move the handling of the config change onto a > workqueue or something like that, and > - document that you cannot read/write the virtio config space from an > atomic context > > Unless someone has a better idea? > I wonder, would making config_lock a mutex suffice? I've already hinted that a virtio-balloon side fix is probably the simpler variant. I agree, let's fix the regression first, and think about wether to teach virtio-ccw to allow config manipulation from virtio_config_changed() or not later. Regards, Halil
Re: [virtio-dev] RE: [PATCH v1 0/2] Virtio: fix some vq allocation issues
On Wed, 2 Jan 2019 19:02:33 +0100 Cornelia Huck wrote: > On Wed, 2 Jan 2019 16:59:19 +0100 > Halil Pasic wrote: > > > On Wed, 2 Jan 2019 14:23:38 +0100 > > Cornelia Huck wrote: > > > > > On Wed, 2 Jan 2019 10:53:14 +0100 > > > Cornelia Huck wrote: > > > > > > > On Tue, 1 Jan 2019 00:40:19 +0100 > > > > Halil Pasic wrote: > > > > > > AFAICT tweaking the balloon code may be simpler than tweaking the > > > > > virtio-ccw (transport code). ccw_io_helper() relies on getting > > > > > an interrupt when the issued IO is done. If virtio-ccw is buggy, it > > > > > needs to be fixed, but I'm not sure it is. > > > > > > > > I would not call virtio-ccw buggy, but it has some constraints that > > > > virtio-pci apparently doesn't have (and which did not show up so far; > > > > e.g. virtio-blk schedules a work item on config change, so there's no > > > > deadlock there.) > > > > > > > > One way to get out of that constraint (don't interact with the config > > > > space directly in the config changed handler) would be to schedule a > > > > work item in virtio-ccw that calls virtio_config_changed() for the > > > > device. My understanding is that delaying the notification to a work > > > > queue would be fine. > > > > > > Unfortunately, calling virtio_config_changed() from a work item is not > > > enough: That function takes the config_lock, and the virtio-ccw code to > > > get the config both needs to allocate some memory and call schedule :/ > > > > > > The best option really seems to be > > > - have virtio-balloon move the handling of the config change onto a > > > workqueue or something like that, and > > > - document that you cannot read/write the virtio config space from an > > > atomic context > > > > > > Unless someone has a better idea? > > > > > > > I wonder, would making config_lock a mutex suffice? > > Unless I'm mistaken, you can't take a mutex in an interrupt path. > I was too focused on virtio-ccw. We have the workqueue now, so it would not be a problem for us, but for the other transports. Grrr
Re: [PATCH v1 1/2] virtio-balloon: tweak config_changed implementation
On Thu, 3 Jan 2019 13:31:01 +0800 Wei Wang wrote: > virtio-ccw has deadlock issues with reading config registers inside the > interrupt context I would say something like 'virtio-ccw does not support using virtio_config_ops from an atomic context' as the limitation is not limited to read config (aka. get()). But I'm fine with your formulation as well. > , so we tweak the virtballoon_changed implementation > by moving the config read operations into the related workqueue contexts. > > Signed-off-by: Wei Wang Reviewed-by: Halil Pasic [..]
Re: [PATCH v1 2/2] virtio-balloon: improve update_balloon_size_func
On Thu, 3 Jan 2019 13:31:02 +0800 Wei Wang wrote: > There is no need to update the balloon actual register when there is no > ballooning request. This patch avoids update_balloon_size when diff is 0. > > Signed-off-by: Wei Wang Reviewed-by: Halil Pasic I would be also fine with both patches squashed into one. BTW the end result looks to me even a bit nicer.
Re: [PATCH 2/2] virtio: document virtio_config_ops restrictions
On Thu, 3 Jan 2019 17:08:04 +0100 Cornelia Huck wrote: > Some transports (e.g. virtio-ccw) implement virtio operations that > seem to be a simple read/write as something more involved that > cannot be done from an atomic context. > > Give at least a hint about that. > > Signed-off-by: Cornelia Huck > --- > include/linux/virtio_config.h | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h > index 7087ef946ba7..987b6491b946 100644 > --- a/include/linux/virtio_config.h > +++ b/include/linux/virtio_config.h > @@ -12,6 +12,11 @@ struct irq_affinity; > > /** > * virtio_config_ops - operations for configuring a virtio device > + * Note: Do not assume that a transport implements all of the operations > + * getting/setting a value as a simple read/write! Generally speaking, > + * any of @get/@set, @get_status/@set_status, or @get_features/ > + * @finalize_features are NOT safe to be called from an atomic > + * context. I think the only exception is @bus_name (and maybe @generation, I don't know) because it does not have to 'speak' with the hypervisor. If a transport operation has to 'speak' with the hypervisor, we do it by making it interpret a channel program. That means not safe to be called form atomic context. Or am I missing something? Regards, Halil > * @get: read the value of a configuration field > * vdev: the virtio_device > * offset: the offset of the configuration field
Re: [virtio-dev] [PATCH v2 1/2] virtio-balloon: tweak config_changed implementation
On Fri, 4 Jan 2019 15:11:52 +0800 Wei Wang wrote: > virtio-ccw has deadlock issues with reading the config space inside the > interrupt context, so we tweak the virtballoon_changed implementation > by moving the config read operations into the related workqueue contexts. > The config_read_bitmap is used as a flag to the workqueue callbacks > about the related config fields that need to be read. > > Reported-by: Christian Borntraeger > Signed-off-by: Wei Wang Reviewed-by: Halil Pasic
Re: [PATCH v1 0/2] Virtio: fix some vq allocation issues
On Sat, 29 Dec 2018 02:45:49 + "Wang, Wei W" wrote: > On Friday, December 28, 2018 3:57 PM, Christian Borntraeger wrote: > > On 28.12.2018 03:26, Wei Wang wrote: > > > Some vqs don't need to be allocated when the related feature bits are > > > disabled. Callers notice the vq allocation layer by setting the > > > related names[i] to be NULL. > > > > > > This patch series fixes the find_vqs implementations to handle this case. > > > > So the random crashes during boot are gone. > > What still does not work is actually using the balloon. > > > > So in the qemu monitor using lets say "balloon 1000" will hang the guest. > > Seems to be a deadlock in the virtio-ccw code. We seem to call the config > > code in the interrupt handler. > > Yes. It reads a config register from the interrupt handler. Do you know why > ccw doesn't support it and has some internal lock that caused the deadlock > issue? > > Best, > Wei I guess you are the first one trying to read virtio config from within interrupt context. AFAICT this never worked. About what happens. The apidoc of ccw_device_start() says it needs to be called with the ccw device lock held, so ccw_io_helper() tries to take it (since forever I guess). OTOH do_cio_interrupt() takes the subchannel lock and io_subchannel_initialize_dev() makes the ccw device lock be the subchannel lock. That means when one tries to get virtio config form within a cio interrupt context we deadlock, because we try to take a lock we already have. That said, I don't think this limitation is by design (i.e. intended). Maybe Connie can help us with that question. AFAIK we have nothing documented regarding this (neither that can nor can't). Obviously, there are multiple ways around this problem, and at the moment I can't tell which would be my preferred one. Regards, Halil
Re: [PATCH v1 0/2] KVM: s390: Tracing APCB changes
On 10/05/2018 10:31 AM, Pierre Morel wrote: > In the first patch we define kvm_arch_crypto_set_masks, > a new function to centralize the setup the APCB masks > inside the CRYCB SIE satelite and add KVM_EVENT() to > kvm_arch_crypto_set_masks and kvm_arch_crypto_clear_masks. > > In the second patch we replace the vfio_ap_mdev_copy_masks() > by the new kvm_arch_crypto_set_masks() function. > FWIW I also like this version better. And accommodating for on-the fly changes seems to be showing in the right direction anyway. I did not look into the details but it looks good to me. Halil
Re: [PATCH v10 11/26] s390: vfio-ap: implement mediated device open callback
On 09/12/2018 09:43 PM, Tony Krowiak wrote: > +/** > + * vfio_ap_mdev_open_once > + * > + * @matrix_mdev: a mediated matrix device > + * > + * Return 0 if no other mediated matrix device has been opened for the > + * KVM guest assigned to @matrix_mdev; otherwise, returns an error. > + */ > +static int vfio_ap_mdev_open_once(struct ap_matrix_mdev *matrix_mdev, > + struct kvm *kvm) > +{ > + struct ap_matrix_mdev *m; > + > + mutex_lock(&matrix_dev->lock); > + > + list_for_each_entry(m, &matrix_dev->mdev_list, node) { > + if ((m != matrix_mdev) && (m->kvm == kvm)) { > + mutex_unlock(&matrix_dev->lock); > + return -EPERM; > + } > + } > + > + mutex_unlock(&matrix_dev->lock); > + > + return 0; > +} > + > +static int vfio_ap_mdev_group_notifier(struct notifier_block *nb, > +unsigned long action, void *data) > +{ > + int ret; > + struct ap_matrix_mdev *matrix_mdev; > + > + if (action != VFIO_GROUP_NOTIFY_SET_KVM) > + return NOTIFY_OK; > + > + matrix_mdev = container_of(nb, struct ap_matrix_mdev, group_notifier); > + > + if (!data) { > + matrix_mdev->kvm = NULL; > + return NOTIFY_OK; > + } > + > + ret = vfio_ap_mdev_open_once(matrix_mdev, data); This could be racy. Two threads doing vfio_ap_mdev_group_notifier() can first get 0 here in a sense that there is no such kvm in the list, and then both set the very same kvm three lines below. Which would result in what we are trying to prevent. Also vfio_ap_mdev_open_once() does not seem like an appropriate name any more. If we were to do the matrix_mdev->kvm = kvm in there we could call it something like vfio_ap_mdev_set_kvm(). > + if (ret) > + return NOTIFY_DONE; > + > + matrix_mdev->kvm = data; > + > + ret = kvm_ap_validate_crypto_setup(matrix_mdev->kvm); > + if (ret) > + return ret; > + > + vfio_ap_mdev_copy_masks(matrix_mdev); > + > + return NOTIFY_OK; > +}
[PATCH 1/1] s390/virtio_ccw: fix config change notifications
Commit e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") broke configuration change notifications for virtio-ccw by putting the DMA address of *indicatorp directly into ccw->cda disregarding the fact that if !!(vcdev->is_thinint) then the function virtio_ccw_register_adapter_ind() will overwrite that ccw->cda value with the address of the virtio_thinint_area so it can actually set up the adapter interrupts via CCW_CMD_SET_IND_ADAPTER. Thus we end up pointing to the wrong object for both CCW_CMD_SET_IND if setting up the adapter interrupts fails, and for CCW_CMD_SET_CONF_IND regardless whether it succeeds or fails. To fix this, let us save away the dma address of *indicatorp in a local variable, and copy it to ccw->cda after the "vcdev->is_thinint" branch. Reported-by: Boqiao Fu Reported-by: Sebastian Mitterle Fixes: e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") Signed-off-by: Halil Pasic --- I know that checkpatch.pl complains about a missing 'Closes' tag. Unfortunately I don't have an appropriate URL at hand. @Sebastian, @Boqiao: do you have any suggetions? --- drivers/s390/virtio/virtio_ccw.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c index d7569f395559..d6491fc84e8c 100644 --- a/drivers/s390/virtio/virtio_ccw.c +++ b/drivers/s390/virtio/virtio_ccw.c @@ -698,6 +698,7 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs, dma64_t *indicatorp = NULL; int ret, i, queue_idx = 0; struct ccw1 *ccw; + dma32_t indicatorp_dma = 0; ccw = ccw_device_dma_zalloc(vcdev->cdev, sizeof(*ccw), NULL); if (!ccw) @@ -725,7 +726,7 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs, */ indicatorp = ccw_device_dma_zalloc(vcdev->cdev, sizeof(*indicatorp), - &ccw->cda); + &indicatorp_dma); if (!indicatorp) goto out; *indicatorp = indicators_dma(vcdev); @@ -735,6 +736,7 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs, /* no error, just fall back to legacy interrupts */ vcdev->is_thinint = false; } + ccw->cda = indicatorp_dma; if (!vcdev->is_thinint) { /* Register queue indicators with host. */ *indicators(vcdev) = 0; base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670 -- 2.40.1
Re: [PATCH 1/1] s390/virtio_ccw: fix config change notifications
On Wed, 12 Jun 2024 16:04:15 +0200 Thomas Huth wrote: > On 11/06/2024 23.47, Halil Pasic wrote: > > Commit e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") > > broke configuration change notifications for virtio-ccw by putting the > > DMA address of *indicatorp directly into ccw->cda disregarding the fact > > that if !!(vcdev->is_thinint) then the function > > virtio_ccw_register_adapter_ind() will overwrite that ccw->cda value > > with the address of the virtio_thinint_area so it can actually set up > > the adapter interrupts via CCW_CMD_SET_IND_ADAPTER. Thus we end up > > pointing to the wrong object for both CCW_CMD_SET_IND if setting up the > > adapter interrupts fails, and for CCW_CMD_SET_CONF_IND regardless > > whether it succeeds or fails. > > > > To fix this, let us save away the dma address of *indicatorp in a local > > variable, and copy it to ccw->cda after the "vcdev->is_thinint" branch. > > > > Reported-by: Boqiao Fu > > Reported-by: Sebastian Mitterle > > Fixes: e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") > > Signed-off-by: Halil Pasic > > --- > > I know that checkpatch.pl complains about a missing 'Closes' tag. > > Unfortunately I don't have an appropriate URL at hand. @Sebastian, > > @Boqiao: do you have any suggetions? > > Closes: https://issues.redhat.com/browse/RHEL-39983 > ? Yep! That is a public bug tracker bug. Qualifies! @Vasily: Can you guys pick hat one up when picking the patch? > > Anyway, I've tested the patch and it indeed fixes the problem with > virtio-balloon and the link state for me: > > Tested-by: Thomas Huth > Thanks!
Re: [PATCH v2 2/2] virtio: fix vq # for balloon
On Wed, 10 Jul 2024 07:42:46 -0400 "Michael S. Tsirkin" wrote: > --- a/drivers/s390/virtio/virtio_ccw.c > +++ b/drivers/s390/virtio/virtio_ccw.c > @@ -694,7 +694,7 @@ static int virtio_ccw_find_vqs(struct virtio_device > *vdev, unsigned nvqs, > { > struct virtio_ccw_device *vcdev = to_vc_device(vdev); > dma64_t *indicatorp = NULL; > - int ret, i, queue_idx = 0; > + int ret, i; > struct ccw1 *ccw; > dma32_t indicatorp_dma = 0; > > @@ -710,7 +710,7 @@ static int virtio_ccw_find_vqs(struct virtio_device > *vdev, unsigned nvqs, > continue; > } > > - vqs[i] = virtio_ccw_setup_vq(vdev, queue_idx++, vqi->callback, > + vqs[i] = virtio_ccw_setup_vq(vdev, i, vqi->callback, >vqi->name, vqi->ctx, ccw); > if (IS_ERR(vqs[i])) { > ret = PTR_ERR(vqs[i]); Acked-by: Halil Pasic #s390
Re: [PATCH v15 00/13] s390/vfio-ap: dynamic configuration support
On Tue, 6 Apr 2021 11:31:09 -0400 Tony Krowiak wrote: > Tony Krowiak (13): > s390/vfio-ap: fix circular lockdep when setting/clearing crypto masks The subsequent patches, re introduce this circular locking dependency problem. See my kernel messages for the details. The link we severe in the above patch is re-introduced at several places. One of them is assign_adapter_store(). Regards, Halil [ +0.000236] vfio_ap matrix: MDEV: Registered [ +0.037919] vfio_mdev 4f77ad87-1e62-4959-8b7a-c677c98d2194: Adding to iommu group 1 [ +0.92] vfio_mdev 4f77ad87-1e62-4959-8b7a-c677c98d2194: MDEV: group_id = 1 [Apr 8 22:31] == [ +0.02] WARNING: possible circular locking dependency detected [ +0.02] 5.12.0-rc6-00016-g5bea90816c56 #57 Not tainted [ +0.02] -- [ +0.02] CPU 1/KVM/6651 is trying to acquire lock: [ +0.02] cef9d508 (&matrix_dev->lock){+.+.}-{3:3}, at: handle_pqap+0x56/0x1c8 [vfio_ap] [ +0.11] but task is already holding lock: [ +0.01] d41f4308 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x898 [kvm] [ +0.38] which lock already depends on the new lock. [ +0.02] the existing dependency chain (in reverse order) is: [ +0.01] -> #2 (&vcpu->mutex){+.+.}-{3:3}: [ +0.04]validate_chain+0x796/0xa20 [ +0.06]__lock_acquire+0x420/0x7c8 [ +0.03]lock_acquire.part.0+0xec/0x1e8 [ +0.02]lock_acquire+0xb8/0x208 [ +0.02]__mutex_lock+0xa2/0x928 [ +0.05]mutex_lock_nested+0x32/0x40 [ +0.02]kvm_s390_cpus_to_pv+0x4e/0xf8 [kvm] [ +0.19]kvm_s390_handle_pv+0x1ce/0x6b0 [kvm] [ +0.18]kvm_arch_vm_ioctl+0x3ec/0x550 [kvm] [ +0.19]kvm_vm_ioctl+0x40e/0x4a8 [kvm] [ +0.18]__s390x_sys_ioctl+0xc0/0x100 [ +0.04]do_syscall+0x7e/0xd0 [ +0.43]__do_syscall+0xc0/0xd8 [ +0.04]system_call+0x72/0x98 [ +0.04] -> #1 (&kvm->lock){+.+.}-{3:3}: [ +0.04]validate_chain+0x796/0xa20 [ +0.02]__lock_acquire+0x420/0x7c8 [ +0.02]lock_acquire.part.0+0xec/0x1e8 [ +0.02]lock_acquire+0xb8/0x208 [ +0.03]__mutex_lock+0xa2/0x928 [ +0.02]mutex_lock_nested+0x32/0x40 [ +0.02]kvm_arch_crypto_set_masks+0x4a/0x2b8 [kvm] [ +0.18]vfio_ap_mdev_refresh_apcb+0xd0/0xe0 [vfio_ap] [ +0.03]assign_adapter_store+0x1f2/0x240 [vfio_ap] [ +0.03]kernfs_fop_write_iter+0x13e/0x1e0 [ +0.03]new_sync_write+0x10a/0x198 [ +0.03]vfs_write.part.0+0x196/0x290 [ +0.02]ksys_write+0x6c/0xf8 [ +0.03]do_syscall+0x7e/0xd0 [ +0.02]__do_syscall+0xc0/0xd8 [ +0.03]system_call+0x72/0x98 [ +0.02] -> #0 (&matrix_dev->lock){+.+.}-{3:3}: [ +0.04]check_noncircular+0x16e/0x190 [ +0.02]check_prev_add+0xec/0xf38 [ +0.02]validate_chain+0x796/0xa20 [ +0.02]__lock_acquire+0x420/0x7c8 [ +0.02]lock_acquire.part.0+0xec/0x1e8 [ +0.02]lock_acquire+0xb8/0x208 [ +0.02]__mutex_lock+0xa2/0x928 [ +0.02]mutex_lock_nested+0x32/0x40 [ +0.03]handle_pqap+0x56/0x1c8 [vfio_ap] [ +0.02]handle_pqap+0xe2/0x1d8 [kvm] [ +0.19]kvm_handle_sie_intercept+0x134/0x248 [kvm] [ +0.19]vcpu_post_run+0x2b6/0x580 [kvm] [ +0.18]__vcpu_run+0x27e/0x388 [kvm] [ +0.19]kvm_arch_vcpu_ioctl_run+0x10a/0x278 [kvm] [ +0.18]kvm_vcpu_ioctl+0x2cc/0x898 [kvm] [ +0.18]__s390x_sys_ioctl+0xc0/0x100 [ +0.03]do_syscall+0x7e/0xd0 [ +0.02]__do_syscall+0xc0/0xd8 [ +0.02]system_call+0x72/0x98 [ +0.03] other info that might help us debug this: [ +0.01] Chain exists of: &matrix_dev->lock --> &kvm->lock --> &vcpu->mutex [ +0.05] Possible unsafe locking scenario: [ +0.01]CPU0CPU1 [ +0.01] [ +0.02] lock(&vcpu->mutex); [ +0.02]lock(&kvm->lock); [ +0.02]lock(&vcpu->mutex); [ +0.02] lock(&matrix_dev->lock); [ +0.02] *** DEADLOCK *** [ +0.02] 2 locks held by CPU 1/KVM/6651: [ +0.02] #0: d41f4308 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x898 [kvm] [ +0.23] #1: da2fc508 (&kvm->srcu){}-{0:0}, at: __vcpu_run+0x1ec/0x388 [kvm] [ +0.21] stack backtrace: [ +0.02] CPU: 6 PID: 6651 Comm: CPU 1/KVM Not tainted 5.12.0-rc6-00016-g5bea90816c56 #57 [ +0.04] Hardware name: IBM 8561 T01 7
Re: [PATCH v4 0/2] vfio/mdev: Device namespace protection
From vfio-ccw perspective I join Connie's assessment: vfio-ccw should be fine with these changes. I'm however not too deeply involved with the mdev framework, thus I don't feel comfortable r-b-ing. That results in Acked-by: Halil Pasic for both patches. While at it I have would like to ask about the semantics and intended use of the mdev interfaces. static int vfio_ccw_sch_probe(struct subchannel *sch) { /* HALIL: 8< Not so interesting stuff happens here. >8 */ ret = vfio_ccw_mdev_reg(sch); if (ret) goto out_disable; /* * HALIL: * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute * is made available (it calls mdev_register_device()). For instance create will * attempt to decrement private->avail which is initialized below. I fail to * understand how is this well synchronized. */ INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo); atomic_set(&private->avail, 1); private->state = VFIO_CCW_STATE_STANDBY; return 0; out_disable: cio_disable_subchannel(sch); out_free: dev_set_drvdata(&sch->dev, NULL); kfree(private); return ret; } Should not initialization of go before mdev_register_device(), and then rolled back if necessary if mdev_register_device() fails? In practice it does not seem very likely that userspace can trigger mdev_device_create() before vfio_ccw_sch_probe() finishes so it should not be a practical problem. But I would like to understand how synchronization is supposed to work. [Added Dong Jia, maybe he is also able to answer my question.] Regards, Halil On 05/18/2018 09:10 PM, Alex Williamson wrote: v4: Fix the 'create' racing 'remove' gap noted by Kirti by moving removal from mdev_list to mdev_device_release(). Fix missing mdev_put_parent() cases in mdev_device_create(), also noted by Kirti. Added documention update regarding serialization as noted by Cornelia. Added additional commit log comment about -EAGAIN vs -ENODEV for 'remove' racing 'create'. Added second patch to re-order sysfs attributes, with this my targeted scripts can no longer hit the gap where -EAGAIN is regurned. BTW, the gap where the current code returns -ENODEV in this race condition is about 50% easier to hit than it exists in this series with patch 1 alone. Thanks, Alex --- Alex Williamson (2): vfio/mdev: Check globally for duplicate devices vfio/mdev: Re-order sysfs attribute creation Documentation/vfio-mediated-device.txt |5 ++ drivers/vfio/mdev/mdev_core.c | 102 +++- drivers/vfio/mdev/mdev_private.h |2 - drivers/vfio/mdev/mdev_sysfs.c | 14 ++-- 4 files changed, 49 insertions(+), 74 deletions(-)
Re: [PATCH v4 0/2] vfio/mdev: Device namespace protection
On 05/23/2018 10:56 AM, Cornelia Huck wrote: On Tue, 22 May 2018 12:38:29 -0600 Alex Williamson wrote: On Tue, 22 May 2018 19:17:07 +0200 Halil Pasic wrote: From vfio-ccw perspective I join Connie's assessment: vfio-ccw should be fine with these changes. I'm however not too deeply involved with the mdev framework, thus I don't feel comfortable r-b-ing. That results in Acked-by: Halil Pasic for both patches. While at it I have would like to ask about the semantics and intended use of the mdev interfaces. static int vfio_ccw_sch_probe(struct subchannel *sch) { /* HALIL: 8< Not so interesting stuff happens here. >8 */ This was interesting: private->state = VFIO_CCW_STATE_NOT_OPER; ret = vfio_ccw_mdev_reg(sch); if (ret) goto out_disable; /* * HALIL: * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute * is made available (it calls mdev_register_device()). For instance create will * attempt to decrement private->avail which is initialized below. I fail to * understand how is this well synchronized. */ INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo); atomic_set(&private->avail, 1); private->state = VFIO_CCW_STATE_STANDBY; return 0; out_disable: cio_disable_subchannel(sch); out_free: dev_set_drvdata(&sch->dev, NULL); kfree(private); return ret; } Should not initialization of go before mdev_register_device(), and then rolled back if necessary if mdev_register_device() fails? In practice it does not seem very likely that userspace can trigger mdev_device_create() before vfio_ccw_sch_probe() finishes so it should not be a practical problem. But I would like to understand how synchronization is supposed to work. [Added Dong Jia, maybe he is also able to answer my question.] vfio_ccw_mdev_create() requires that private->state is not VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state to this value before calling vfio_ccw_mdev_reg(), so a create should return -ENODEV if racing with parent registration. Is there something else that I'm missing? Thanks, Disclaimer: I did not do much kernel work up until now. I still have much to learn. I mostly agree with your analysis but I'm not sure if the conclusion should be 'and thus everything is good' or 'and thus indeed we do have a race, a poorly handled one'. One thing I'm not sure about is: can atomic_set(&private->avail, 1) and private->state = VFIO_CCW_STATE_STANDBY be perceived as reordered by e.g. some other cpu and thus vfio_ccw_mdev_create() or not. I tried to figure it out based on Documentation/atomic_t.txt but was not very successful. If these can be reordered we could observe -EPERM instead of -ENODEV, I think. Furthermore from your analysis I deduce that the client code (I think mdev calls it vendor code) may rely on mdev_register_device() containing a (RELEASE) barrier. We use a mutex in there so the barrier is there. And the client code may rely on a (ACQUIRE) barrier before the create callback is called. That should also be true and was true in the past too again because of mutex usage. Alex No, I think your understanding is correct. We move the state from NOT_OPER to STANDBY only after we're set up completely, so our create callback will simply fail early with -ENODEV. This looks fine to me. This -ENODEV looks strange to me. Which device does not exist? The userspace were supposed to retry on this? It's not even -EAGAIN. Is it documented somewhere? If it's unavoidable (which I don't see why) I would prefer -EAGAIN. I think throwing an -ENODEV at our userspace once in a blue moon (if ever) because that is the way we 'handle' races in our code instead of avoiding them is not very friendly. And I'm not sure -EPERM is not possible (see my statement about reordering of the writes above). Regards, Halil
Re: [PATCH 1/1] vfio-ccw: fence off transport mode
On 02/22/2018 04:39 PM, Cornelia Huck wrote: > vfio-ccw only supports command mode for channel programs, not transport > mode. User space is supposed to already take care of that and pass us > command-mode ORBs only, but better make sure and return an error to > the caller instead of trying to process tcws as ccws. > > Signed-off-by: Cornelia Huck Acked-by: Halil Pasic > --- > drivers/s390/cio/vfio_ccw_fsm.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/s390/cio/vfio_ccw_fsm.c b/drivers/s390/cio/vfio_ccw_fsm.c > index c30420c517b1..ff6963ad6e39 100644 > --- a/drivers/s390/cio/vfio_ccw_fsm.c > +++ b/drivers/s390/cio/vfio_ccw_fsm.c > @@ -124,6 +124,11 @@ static void fsm_io_request(struct vfio_ccw_private > *private, > if (scsw->cmd.fctl & SCSW_FCTL_START_FUNC) { > orb = (union orb *)io_region->orb_area; > > + /* Don't try to build a cp if transport mode is specified. */ > + if (orb->tm.b) { > + io_region->ret_code = -EOPNOTSUPP; I guess now we communicate this as appropriately as possible. > + goto err_out; > + } > io_region->ret_code = cp_init(&private->cp, mdev_dev(mdev), > orb); > if (io_region->ret_code) >
Re: [PATCH v2 5/5] vfio: ccw: add traceponits for interesting error paths
On 04/30/2018 01:51 PM, Cornelia Huck wrote: On Sat, 28 Apr 2018 13:50:23 +0800 Dong Jia Shi wrote: * Cornelia Huck [2018-04-27 12:13:53 +0200]: On Mon, 23 Apr 2018 13:01:13 +0200 Dong Jia Shi wrote: typo in subject: s/traceponits/tracepoints/ From: Halil Pasic Add some tracepoints so we can inspect what is not working as is should. Signed-off-by: Halil Pasic Signed-off-by: Dong Jia Shi --- drivers/s390/cio/Makefile | 1 + drivers/s390/cio/vfio_ccw_fsm.c | 16 +++- drivers/s390/cio/vfio_ccw_trace.h | 77 +++ 3 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 drivers/s390/cio/vfio_ccw_trace.h @@ -135,6 +142,8 @@ static void fsm_io_request(struct vfio_ccw_private *private, goto err_out; io_region->ret_code = cp_prefetch(&private->cp); + trace_vfio_ccw_cp_prefetch(get_schid(private), + io_region->ret_code); if (io_region->ret_code) { cp_free(&private->cp); goto err_out; @@ -142,11 +151,13 @@ static void fsm_io_request(struct vfio_ccw_private *private, /* Start channel program and wait for I/O interrupt. */ io_region->ret_code = fsm_io_helper(private); + trace_vfio_ccw_fsm_io_helper(get_schid(private), +io_region->ret_code); if (io_region->ret_code) { cp_free(&private->cp); goto err_out; } - return; + goto out; } else if (scsw->cmd.fctl & SCSW_FCTL_HALT_FUNC) { /* XXX: Handle halt. */ io_region->ret_code = -EOPNOTSUPP; @@ -159,6 +170,9 @@ static void fsm_io_request(struct vfio_ccw_private *private, err_out: private->state = VFIO_CCW_STATE_IDLE; +out: + trace_vfio_ccw_io_fctl(scsw->cmd.fctl, get_schid(private), + io_region->ret_code); } /* I really don't want to bikeshed, especially as some tracepoints are better than no tracepoints, but... We now trace fctl/schid/ret_code unconditionally (good). We trace the outcome of cp_prefetch() and fsm_io_helper() unconditionally. We don't, however, trace all things that may go wrong. We have the tracepoint at the end, but it cannot tell us where the error came from. Should we have tracepoints in every place (in this function) that may generate an error? Only if there is an actual error? Are the two enough for common debug scenarios? Trace actual error sounds like a better idea than trace unconditionally of these two functions. These two are not enough for common debug scenarios. For example, we cann't tell if a -EOPNOTSUPP is a orb->tm.b problem, or error code returned by cp_init(). Idea to improve: 1. Trace actual error. 2. Define a trace event and add error trace for cp_init(). Hm. Going from what I have done in the past when doing printk debugging: - stick in a message that is always hit, with some information about parameters, if it makes sense - stick in a message "foo happened!" in the error branches - or, alternatively, trace the called functions So tracing on failure only might be more useful? Have all failure paths under a common knob to turn on/off? Opinions? We can just go ahead with this and improve things later on, I guess. I think it's also fine to do this - better something than nothing. We could at least have a code base to be improved to make everybody happier in future. Maybe keep the patch as it is now, except trace the errors only (keeping the fctl trace point)? What do you mean by this sentence. Get rid of vfio_ccw_io_fctl or get rid of vfio_ccw_cp_prefetch and vfio_ccw_fsm_io_helper, or get don't get rid of any, but make some conditional (!errno)? Halil, as you wrote the patch (and I presume you found it helpful): What is your opinion? I'm in favor of this patch (as previously stated here https://patchwork.kernel.org/patch/10298305/). And regarding the questions under discussion I'm mostly fine either way. I think the naming of this fctl thing is a bit cryptic, but if we don't see this as ABI I'm fine with it -- can be improved. What would be a better name? I was thinking along the lines accept_request. (Bad error code would mean that the request did not get accepted. Good code does not mean the requested function was performed successfully.) Also I think vfio_ccw_io_fctl with no zero error code would make sense as dev_warn. If I were an admin looking into a problem I would very much appreciate seeing something in the messages log (and not having to enable tracing first). This point seems to be a good one for high level 'request gone bad' kind of report. Opinions? Regards, Halil
Re: [PATCH v2 5/5] vfio: ccw: add traceponits for interesting error paths
On 04/30/2018 05:03 PM, Cornelia Huck wrote: I think the naming of this fctl thing is a bit cryptic, but if we don't see this as ABI I'm fine with it -- can be improved. What would be a better name? I was thinking along the lines accept_request. (Bad error code would mean that the request did not get accepted. Good code does not mean the requested function was performed successfully.) I think fctl is fine (if you don't understand what 'fctl' is, you're unlikely to understand it even if it were named differently.) AFAIU this fctl is a bit more complicated than the normal fctl. But better let sleeping dogs lie. Also I think vfio_ccw_io_fctl with no zero error code would make sense as dev_warn. If I were an admin looking into a problem I would very much appreciate seeing something in the messages log (and not having to enable tracing first). This point seems to be a good one for high level 'request gone bad' kind of report. Opinions? I'd also exclude -EOPNOTSUPP (as this also might happen with e.g. a halt/clear enabled user space, which probes availability of halt/clear support by giving it a try once (yes, I really want to post my patches this week.)) I'm looking forward to the clear/halt. It hope it will help me understand the big vfio-ccw picture better. There are still dark spots, but I don't feel like doing something against this, as there is quite some activity going on here -- and I don't want to hamper the efforts by binding resources. Regards, Halil
Re: [PATCH v5 10/13] s390: vfio-ap: sysfs interface to view matrix mdev matrix
On 06/07/2018 02:53 PM, Tony Krowiak wrote: 2) As I said above, what you show is not the effective mask used by the guest Why would a sysfs attribute for the mediated matrix device show the effective mask used by the guest? OK, bad word, "effective", replace with "really". We do not implement any kind of provisioning nor do we implement update of the CRYCB at any point after the first mediated device open. I think this is a way we might be able to hot plug/unplug devices. Binding a queue and updating the mask can be done at any time (may be we should change this ?) As I said above, I think we can utilize this as a means of hot plugging/unplugging AP adapters and domains. If the guest is running when an adapter or domain is assigned, we can update the guest's CRYCB at that time. What is the point of showing a matrix which will never be used by the guest? That is simply not true. The matrix WILL be used by a guest the next time a guest is configured with a vfio-ap device referencing the path to the mediated matrix device - i.e., -device vfio-ap,sysfsdev=$PATH. The point is to show the matrix assigned to the mediated matrix device. In my mind, the mediated matrix device is a separate object from the guest. Sure it is used to configure a guest's matrix when the guest is started, but it could be used to configure the matrix for any guest; it has no direct connection to a particular guest until a guest using the device is started. IMHO the sysfs attributes for the mediated matrix device reflect only the attributes of the device, not the attributes of a guest. So bottom line is what? Is the interface going to change so that modifications to the mdev's matrix will be reflected immediately -- to support hotplug of domains and ap cards? Or are you intending to keep the interface as is? If the matrix assigned to the mediated device can differ from the matrix of the guest (that is the masks in the CRYCB, and I'm talking about a running guest) do you provide a way for the host admin to examine the matrix of the guest? If not, why do you think that information is irrelevant to the host admin? Regards, Halil
Re: [PATCH v5 05/13] s390: vfio-ap: register matrix device with VFIO mdev framework
On 05/15/2018 05:16 PM, Tony Krowiak wrote: On 05/15/2018 10:17 AM, Pierre Morel wrote: On 14/05/2018 21:42, Tony Krowiak wrote: On 05/11/2018 01:18 PM, Halil Pasic wrote: On 05/07/2018 05:11 PM, Tony Krowiak wrote: Registers the matrix device created by the VFIO AP device driver with the VFIO mediated device framework. Registering the matrix device will create the sysfs structures needed to create mediated matrix devices each of which will be used to configure the AP matrix for a guest and connect it to the VFIO AP device driver. [..] diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c new file mode 100644 index 000..d7d36fb --- /dev/null +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -0,0 +1,106 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Adjunct processor matrix VFIO device driver callbacks. + * + * Copyright IBM Corp. 2017 + * Author(s): Tony Krowiak + * + */ +#include +#include +#include +#include +#include + +#include "vfio_ap_private.h" + +#define VFOP_AP_MDEV_TYPE_HWVIRT "passthrough" +#define VFIO_AP_MDEV_NAME_HWVIRT "VFIO AP Passthrough Device" + +static int vfio_ap_mdev_create(struct kobject *kobj, struct mdev_device *mdev) +{ + struct ap_matrix *ap_matrix = to_ap_matrix(mdev_parent_dev(mdev)); + + ap_matrix->available_instances--; + + return 0; +} + +static int vfio_ap_mdev_remove(struct mdev_device *mdev) +{ + struct ap_matrix *ap_matrix = to_ap_matrix(mdev_parent_dev(mdev)); + + ap_matrix->available_instances++; + + return 0; +} + The above functions seem to be called with the lock of this auto-generated mdev parent device held. That's why we don't have to care about synchronization ourselves, right? I would assume as much. The comments for the 'struct mdev_parent_ops' in include/linux/mdev.h do not mention anything about synchronization, nor did I see any locking or synchronization in the vfio_ccw implementation after which I modeled my code, so frankly it is something I did not consider. A small comment in the code could be helpful for mdev non-experts. Hell, I would even consider documenting it for all mdev -- took me some time to figure out. You may want to bring this up with the VFIO mdev maintainers, but I'd be happy to include a comment in the functions in question if you think it important. [..] +int vfio_ap_mdev_register(struct ap_matrix *ap_matrix) +{ + int ret; + + ret = mdev_register_device(&ap_matrix->device, &vfio_ap_matrix_ops); + if (ret) + return ret; + + ap_matrix->available_instances = AP_MATRIX_MAX_AVAILABLE_INSTANCES; + + return 0; +} + +void vfio_ap_mdev_unregister(struct ap_matrix *ap_matrix) +{ + ap_matrix->available_instances--; What is this for? I don't understand. To control the number of mediated devices one can create for the matrix device. Once the max is reached, the mdev framework will not allow creation of another mediated device until one is removed. This counter keeps track of the number of instances that can be created. This is documented with the mediated framework. You may want to take a look at: Documentation/vfio-mediated-device.txt Documentation/vfio.txt Documentation/virtual/kvm/devices/vfio.txt This is what you do in create/remove. But here in unregister I agree with Halil, it does not seem to be usefull. If that is in fact what Halil was asking, then I misinterpreted his question; I thought he was asking what the available_instances was used for. You are correct, this does not belong here although it makes little difference given this is called only when the driver, which creates the matrix device, is unloaded. It is necessary in the register function to initialize its value, but I'll remove it from here. I questioned the dubious usage of ap_matrix->available_instances rather than asking what is the variable for. If I've had this deemed damaging I would have asked if it's damaging in a way I think it is. For example take my comment on 'KVM: s390: interfaces to manage guest's AP matrix'. Regards, Halil Regards, Halil + mdev_unregister_device(&ap_matrix->device); +}
[PATCH v2 1/1] s390: vfio-ccw: push down unsupported IDA check
There is at least one relevant guest OS that doesn't set the IDA flags in the ORB as we would like them, but never uses any IDA. So instead of saying -EOPNOTSUPP when observing an ORB, such that a channel program specified by it could be a not supported one, let us say -EOPNOTSUPP only if the channel program is a not supported one. Of course, the real solution would be doing proper translation for all IDA. This is possible, but given the current code not straight forward. Signed-off-by: Halil Pasic Tested-by: Jason J. Herne --- --- drivers/s390/cio/vfio_ccw_cp.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c index 2c7550797ec2..cb4bf0dfbacc 100644 --- a/drivers/s390/cio/vfio_ccw_cp.c +++ b/drivers/s390/cio/vfio_ccw_cp.c @@ -365,6 +365,9 @@ static void cp_unpin_free(struct channel_program *cp) * This is the chain length not considering any TICs. * You need to do a new round for each TIC target. * + * The program is also validated for absence of not yet supported + * indirect data addressing scenarios. + * * Returns: the length of the ccw chain or -errno. */ static int ccwchain_calc_length(u64 iova, struct channel_program *cp) @@ -391,6 +394,14 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp) do { cnt++; + /* +* As we don't want to fail direct addressing even if the +* orb specified one of the unsupported formats, we defer +* checking for IDAWs in unsupported formats to here. +*/ + if ((!cp->orb.cmd.c64 || cp->orb.cmd.i2k) && ccw_is_idal(ccw)) + return -EOPNOTSUPP; + if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw))) break; @@ -656,10 +667,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb) /* * XXX: * Only support prefetch enable mode now. -* Only support 64bit addressing idal. -* Only support 4k IDAW. */ - if (!orb->cmd.pfch || !orb->cmd.c64 || orb->cmd.i2k) + if (!orb->cmd.pfch) return -EOPNOTSUPP; INIT_LIST_HEAD(&cp->ccwchain_list); @@ -688,6 +697,10 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb) ret = ccwchain_loop_tic(chain, cp); if (ret) cp_unpin_free(cp); + /* It is safe to force: if not set but idals used +* ccwchain_calc_length returns an error. +*/ + cp->orb.cmd.c64 = 1; return ret; } -- 2.16.3
Re: [PATCH 01/10] vfio: ccw: Moving state change out of IRQ context
On 04/25/2018 08:57 AM, Cornelia Huck wrote: AFAIU this will be the problem of the person implementing the clear and the halt for vfio-ccw. I.e. it's a non-problem right now. Well, that person is me:) I will post some RFC Real Soon Now if I stop getting sidetracked... Makes sense. It should be fine either way AFAIU. CSCH, more precisely the clear function is supposed to clear the interruption request(s) too. But I guess there is no way of the CP to identify an I/O interrupt that should have been cleared -- that is catch us disrespecting the architecture. I can't think of a way to establish must happen before relationship... But discarding the first interrupt and delivering just one for the CSCH is fine too for the same reason. Regards, Halil
Re: [PATCH v2 1/5] vfio: ccw: fix cleanup if cp_prefetch fails
On 04/25/2018 04:43 AM, Dong Jia Shi wrote: [Some procedural notes: I've created a new vfio-ccw-fixes branch based on the s390 fixes branch for easier handling. Things targeted for the next release will go on the vfio-ccw branch on top of the s390 features branch, as before. Does that work for everybody? I don't see a reason that why it doesn't work for me. Works for me. (And I am the only vfio-ccw maintainer with a kernel.org account, right?)] I don't have such an account, and don't know if I need to apply for one. Neither do I have a kernel.org account -- at least for now. Regards, Halil
Re: [PATCH v4 1/7] s390: ap: kvm: add PQAP interception for AQIC
On Thu, 28 Feb 2019 10:42:23 +0100 Christian Borntraeger wrote: > > > On 27.02.2019 19:00, Tony Krowiak wrote: > > On 2/27/19 3:09 AM, Pierre Morel wrote: > >> On 26/02/2019 16:47, Tony Krowiak wrote: > >>> On 2/26/19 6:47 AM, Pierre Morel wrote: > On 25/02/2019 19:36, Tony Krowiak wrote: > > On 2/22/19 10:29 AM, Pierre Morel wrote: > >> We prepare the interception of the PQAP/AQIC instruction for > >> the case the AQIC facility is enabled in the guest. > >> > >> We add a callback inside the KVM arch structure for s390 for > >> a VFIO driver to handle a specific response to the PQAP > >> instruction with the AQIC command. > >> > >> We inject the correct exceptions from inside KVM for the case the > >> callback is not initialized, which happens when the vfio_ap driver > >> is not loaded. > >> > >> If the callback has been setup we call it. > >> If not we setup an answer considering that no queue is available > >> for the guest when no callback has been setup. > >> > >> We do consider the responsability of the driver to always initialize > >> the PQAP callback if it defines queues by initializing the CRYCB for > >> a guest. > >> > >> Signed-off-by: Pierre Morel > > ...snip... > > >> @@ -592,6 +593,55 @@ static int handle_io_inst(struct kvm_vcpu *vcpu) > >> } > >> } > >> +/* > >> + * handle_pqap: Handling pqap interception > >> + * @vcpu: the vcpu having issue the pqap instruction > >> + * > >> + * We now support PQAP/AQIC instructions and we need to correctly > >> + * answer the guest even if no dedicated driver's hook is available. > >> + * > >> + * The intercepting code calls a dedicated callback for this > >> instruction > >> + * if a driver did register one in the CRYPTO satellite of the > >> + * SIE block. > >> + * > >> + * For PQAP/AQIC instructions only, verify privilege and > >> specifications. > >> + * > >> + * If no callback available, the queues are not available, return > >> this to > >> + * the caller. > >> + * Else return the value returned by the callback. > >> + */ > >> +static int handle_pqap(struct kvm_vcpu *vcpu) > >> +{ > >> + uint8_t fc; > >> + struct ap_queue_status status = {}; > >> + > >> + /* Verify that the AP instruction are available */ > >> + if (!ap_instructions_available()) > >> + return -EOPNOTSUPP; > > > > How can the guest even execute an AP instruction if the AP instructions > > are not available? If the AP instructions are not available on the host, > > they will not be available on the guest (i.e., CPU model feature > > S390_FEAT_AP will not be set). I suppose it doesn't hurt to check this > > here given QEMU may not be the only client. > > > >> + /* Verify that the guest is allowed to use AP instructions */ > >> + if (!(vcpu->arch.sie_block->eca & ECA_APIE)) > >> + return -EOPNOTSUPP; > >> + /* Verify that the function code is AQIC */ > >> + fc = vcpu->run->s.regs.gprs[0] >> 24; > >> + if (fc != 0x03) > >> + return -EOPNOTSUPP; > > > > You must have missed my suggestion to move this to the > > vcpu->kvm->arch.crypto.pqap_hook(vcpu) in the following responses: > > Please consider what happen if the vfio_ap module is not loaded. > >>> > >>> I have considered it and even verified my expectations empirically. If > >>> the vfio_ap module is not loaded, you will not be able to create an mdev > >>> device. > >> > >> OK, now please consider that another userland tool, not QEMU uses KVM. > > > > What does that have to do with loading the vfio_ap module? Without the > > vfio_ap module, there will be no AP devices for the guest. What are you > > suggesting here? > > > >> > >>> If you don't have an mdev device, you will not be able to > >>> start a guest with a vfio-ap device. If you start a guest without a > >>> vfio-ap device, but enable AP instructions for the guest, there will be > >>> no AP devices attached to the guest. Without any AP devices attached, > >>> the PQAP(AQIC) instructions will not ever get executed. > >> > >> This is not right. The instruction will be executed, eventually, after > >> decoding. > > > > Please explain why the PQAP(AQIC) instruction will be executed on a > > guest without any devices? Point me to the code in the AP bus where > > PQAP(AQIC) is executed without a queue? > > The host must be prepared to handle malicous and broken guests. So if > a guest does PQAP, we must handle that gracefully (e.g. by injecting an > exception) > Nod. > > > >> > >>> Even if for some > >>> unknown reason the PQAP(AQIC) instruction is executed - for some unknown > >>> reason, it will fail with response code 0x01, AP-queue number not valid. > >> > >> No, before accessing the AP-queue the instructio
Re: [PATCH v4 1/7] s390: ap: kvm: add PQAP interception for AQIC
On Thu, 28 Feb 2019 14:47:35 +0100 Pierre Morel wrote: > On 28/02/2019 14:44, Christian Borntraeger wrote: > > > > > > On 28.02.2019 14:23, Pierre Morel wrote: > >> On 28/02/2019 10:42, Christian Borntraeger wrote: > >>> > >>> > >>> On 27.02.2019 19:00, Tony Krowiak wrote: > On 2/27/19 3:09 AM, Pierre Morel wrote: > > On 26/02/2019 16:47, Tony Krowiak wrote: > >> On 2/26/19 6:47 AM, Pierre Morel wrote: > >>> On 25/02/2019 19:36, Tony Krowiak wrote: > On 2/22/19 10:29 AM, Pierre Morel wrote: > > We prepare the interception of the PQAP/AQIC instruction for > > the case the AQIC facility is enabled in the guest. > > > > We add a callback inside the KVM arch structure for s390 for > > a VFIO driver to handle a specific response to the PQAP > > instruction with the AQIC command. > > > > We inject the correct exceptions from inside KVM for the case the > > callback is not initialized, which happens when the vfio_ap driver > > is not loaded. > > > > If the callback has been setup we call it. > > If not we setup an answer considering that no queue is available > > for the guest when no callback has been setup. > > > > We do consider the responsability of the driver to always initialize > > the PQAP callback if it defines queues by initializing the CRYCB for > > a guest. > > > > Signed-off-by: Pierre Morel > >>> > >>> ...snip... > >>> > > @@ -592,6 +593,55 @@ static int handle_io_inst(struct kvm_vcpu > > *vcpu) > > } > > } > > +/* > > + * handle_pqap: Handling pqap interception > > + * @vcpu: the vcpu having issue the pqap instruction > > + * > > + * We now support PQAP/AQIC instructions and we need to correctly > > + * answer the guest even if no dedicated driver's hook is > > available. > > + * > > + * The intercepting code calls a dedicated callback for this > > instruction > > + * if a driver did register one in the CRYPTO satellite of the > > + * SIE block. > > + * > > + * For PQAP/AQIC instructions only, verify privilege and > > specifications. > > + * > > + * If no callback available, the queues are not available, return > > this to > > + * the caller. > > + * Else return the value returned by the callback. > > + */ > > +static int handle_pqap(struct kvm_vcpu *vcpu) > > +{ > > + uint8_t fc; > > + struct ap_queue_status status = {}; > > + > > + /* Verify that the AP instruction are available */ > > + if (!ap_instructions_available()) > > + return -EOPNOTSUPP; > > How can the guest even execute an AP instruction if the AP > instructions > are not available? If the AP instructions are not available on the > host, > they will not be available on the guest (i.e., CPU model feature > S390_FEAT_AP will not be set). I suppose it doesn't hurt to check > this > here given QEMU may not be the only client. > > > + /* Verify that the guest is allowed to use AP instructions */ > > + if (!(vcpu->arch.sie_block->eca & ECA_APIE)) > > + return -EOPNOTSUPP; > > + /* Verify that the function code is AQIC */ > > + fc = vcpu->run->s.regs.gprs[0] >> 24; > > + if (fc != 0x03) > > + return -EOPNOTSUPP; > > You must have missed my suggestion to move this to the > vcpu->kvm->arch.crypto.pqap_hook(vcpu) in the following responses: > >>> > >>> Please consider what happen if the vfio_ap module is not loaded. > >> > >> I have considered it and even verified my expectations empirically. If > >> the vfio_ap module is not loaded, you will not be able to create an > >> mdev device. > > > > OK, now please consider that another userland tool, not QEMU uses KVM. > > What does that have to do with loading the vfio_ap module? Without the > vfio_ap module, there will be no AP devices for the guest. What are you > suggesting here? > > > > >> If you don't have an mdev device, you will not be able to > >> start a guest with a vfio-ap device. If you start a guest without a > >> vfio-ap device, but enable AP instructions for the guest, there will be > >> no AP devices attached to the guest. Without any AP devices attached, > >> the PQAP(AQIC) instructions will not ever get executed. > > > > This is not right. The instruction will be executed, eventually, after > > decoding. > > Please explain why the PQAP(AQIC) instruction will be executed on a > gues
Re: [PATCH v4 0/7] vfio: ap: AP Queue Interrupt Control
On Fri, 22 Feb 2019 16:29:53 +0100 Pierre Morel wrote: > This patch implement PQAP/AQIC interception in KVM. > > To implement this we need to add a new structure, vfio_ap_queue,to be > able to retrieve the mediated device associated with a queue and specific > values needed to register/unregister the interrupt structures: > - APQN: to be able to issue the commands and search for queue structures > - NIB : to unpin the NIB on clear IRQ > - ISC : to unregister with the GIB interface > - MATRIX: a pointer to the matrix mediated device > - LIST: the list_head to handle the vfio_queue life cycle > > Having this structure and the list management greatly ease the handling > of the AP queues and diminues the LOCs needed in the vfio_ap driver by > more than 150 lines in comparison with the previous version. > > > 0) Queues life cycle > > vfio_ap_queues are created on probe > > We define one bucket on the matrix device to store the free vfio_ap_queues, > the queues not assign to any matrix mediated device. > > We define one bucket on each matrix mediated device to hold the > vfio_ap_queues belonging to it. > > vfio_ap_queues are deleted on remove > > This makes the search for a queue easy and the detection of assignent > incoherency obvious (the queue is not avilable) and simplifies assignment. > > > 1) Phase 1, probe and remove from vfio_ap_queue > > The vfio_ap_queue structures are dynamically allocated and setup > when a queue is probed by the ap_vfio_driver. > The vfio_ap_queue is linked to the ap_queue device as the driver data. > > The new The vfio_ap_queue is put on a free_list belonging to the > matrix device. > > The vfio_ap_queue are free during remove. > > > 2) Phase 2, assignment of vfio_ap_queue to a mediated device > > When a APID is assigned we look for APQI already assigned to > the matrix mediated device and associate all the queue with the > APQN = (APID,APQI) to the mediated device by adding them to > the mediated device queue list. > We do the same when a APQI is assigned. > > If any queue with a matching APQN can not be found on the matrix > device free list it means it is already associated to another matrix > mediated device and no queue is added to the matrix mediated device. > > 3) Phase 3, starting the guest > > When the VFIO device is opened the PQAP callback and a pointer to > the matrix mediated device are set inside KVM during the open callback. > > When the device is closed or if a queue is removed, the vfio_ap_queue is > dissociated from the mediated device. > > > 4) Phase 3 intercepting the PQAP/AQIC instruction > > On interception of the PQAP/AQIC instruction, the interception code > makes sure the pqap_hook is initialized and allowed to be called > and call it. > Otherwise it reports the usual -EOPNOTSUPP return code to let > QEMU handle the fault. > > the pqap callback search for the queue asociated with the APQN > stored in the register 0, setting the code to "illegal APQN" > if the vfio_ap_queue can not be found. > > Depending on the "i" bit of the register 1, the pqap callback > setup or clear the interruption by calling the host format PQAP/AQIC > instruction. > When seting up the interruption it uses the NIB and the guest ISC > provided by the guest and the host ISC provided by the registration > to the GIB code, pin the NIB and also stores ISC and NIB inside > the vfio_ap_queue structure. > When clearing the interrupt it retrieves the host ISC to unregister > with the GIB code and unpin the NIB. > > We take care when enabling GISA that the guest may have issued a > reset and will not need to disable the interuptions before > re-enabling interruptions. Please let us know what guarantees, that we will disable the interruptions we previously enabled using AQIC (and generally facilitate proper cleanup) *before* kvm_s390_gisa_destroy() makes the gisa and with that the IPM go away! Please note that IMHO this needs to be guaranteed by the kernel regardless of what userspace (QEMU) or the guest does. (I've asked this question before during our internal review but I could not find the answer if there was one after going trough my mails.) Regards, Halil
Re: [PATCH v4 1/7] s390: ap: kvm: add PQAP interception for AQIC
On Thu, 28 Feb 2019 15:12:16 +0100 Pierre Morel wrote: > On 28/02/2019 13:39, Halil Pasic wrote: > > On Thu, 28 Feb 2019 10:42:23 +0100 > > Christian Borntraeger wrote: [..] > >> Correct? > > > > IMHO mostly. > > > > I also doing the facility checks in kvm is easier, and I think this is > > something we can change later if needed without any major trouble. > > > > There are a couple of things I would do differently than Pierre does: > > 1) Do the PGM_PRIVILEGED_OP before the fc == 3 check. > > Idea was not to modify existing behavior for fc != 3 > > Also Christian already proposed to handle all FC codes. So in this idea, > this must be done as you say. > > > > > 2) Do the test_kvm_facility(vcpu->kvm, 65) check in the context of fc == > > 3. I.e. decide if this hook is about pqap or just about pqap aqic and > > make the code convey that decision to its reader. > > > > 3) I would most probably test if the queue is available by looking at the > > masks in CRYCB here. If not AP_RESPONSE_Q_NOT_AVAIL is what we need. > > This I do not agree with, it is typically the responsibility of the part > in charge of the virtualization to do this, also the vfio_driver. > See at 4) regarding the details. My guess is you disagree with checking CRYCB explicitly but don't digress with AP_RESPONSE_Q_NOT_AVAIL if APCB does not authorize the queue. Your idea was to infer APCB all zero from the fact that pqap_hook is NULL. If my assumption is right, then yes we can have an implicit coarse check here and a fine grained check in the client code (vfio_ap). > > > > 4) If we have APIE and queues authorized by the CRYCB (i.e. we have a > > vfio_ap module loaded an an mdev associated with the kvm) the callback > > not set (!(vcpu->kvm->arch.crypto.pqap_hook)) is a BUG! > > I do not agree with this either, the maintainers ;) will not allow this. After an offline discussion we came to the conclusion that I did not understand your code. Your train of thought was: !(vcpu->kvm->arch.crypto.pqap_hook) _implies_ APCB all zero (i.e. the masks in the CRYCB This is *why* you respond with AP_RESPONSE_Q_NOT_AVAIL. However if that is the case I would like that spelled out in a code comment at least. Furthermore setting pqap_hook and APCB needs to happen in the right sequence. Means client code (vfio_ap) may only set APCB after the qpap_hook has been set. Currently we have a race there (as you first do kvm_arch_crypto_set_masks and only then kvm->arch.crypto.pqap_hook. Furthermore I guess kvm->arch.crypto.pqap_hook needs to be set with the kvm lock held, which does not seem to be the case. > > > In that case > > lying that the queue is not available does not seem right. BTW this > > is something Pierre changed since the last version quietly (I can't > > recall a mention in the change log or somebody asking for this). If > > we want to be very pedantic about this bug scenario our best bet is > > probably response code 6. > > > RC 06 means "Invalid address of AP-queue notification byte" > > So you must have think about another code or I do not understand at > all what you mean. > I did not assume you decided to ignore the possibility of a programming error (which you at least technically did commit yourself) for what I described as a BUG. My train of thought was, if we are very pedantic we can make things work with degraded functionality in that case. I.e. without AP interrupts. For that we need to tell the guest something like: yes your queue is fine and there and all that but AQCI setup interrupts did not work. And RC 06 is the only RC I see being suitable to convey that. Detect and handle if the client code does not hold up their end of the bargain or just ignore the possibility is a design decision. But at least you should spell out your expectations against the client code. Regards, Halil
Re: [PATCH v4 4/7] vfio: ap: register IOMMU VFIO notifier
On Thu, 28 Feb 2019 09:48:39 +0100 Pierre Morel wrote: > On 28/02/2019 09:23, Christian Borntraeger wrote: > > On 22.02.2019 16:29, Pierre Morel wrote: > >> To be able to use the VFIO interface to facilitate the > >> mediated device memory pining/unpining we need to register > >> a notifier for IOMMU. > > > > You might want to add that while we start to pin one guest page for the > > interrupt indicator byte in the next patch, this is still ok with ballooning > > as this page will never be used by the guest virtio-balloon driver. So the > > pinned page will never be freed. And even a broken guest does so, that would > > not impact the host as the original page is still in control by vfio. > > > > Thanks, I ll do. > I recall a comment in qemu that says vfio-ap does not pin any pages. That one needs to be fixed up as well. Regards, Halil [..]
Re: [PATCH 1/7] s390: zcrypt: driver callback to indicate resource in use
On Mon, 15 Apr 2019 18:43:24 -0400 Tony Krowiak wrote: > On 4/15/19 2:59 PM, Halil Pasic wrote: > > On Mon, 15 Apr 2019 12:51:23 -0400 > > Tony Krowiak wrote: > > > >> Having said that, I understand your concern about a driver hogging > >> resources. I think I can provide a solution that serves both the > >> purpose of preventing problems associated with accidental removal > >> of AP resources as well as allowing root to remove them > >> forcefully. I'll work on that for v2. > > > > Can you tell us some more about this solution? Should we stop reviewing > > v1 because v2 is going to be different anyway? > > Patch 1 and 2 will be removed. There will not be a major design change > between these patches and v2. In order to avoid a long explanation of > my proposed changes, I'd prefer to state that the patch set will > establish and enforce the following rules: > > 1. An APQN can be assigned to an mdev device iff it is NOT > reserved for use by a zcrypt driver and is not assigned to > another mdev device. > > 2. Once an APQN is assigned to an mdev device, it will remain > assigned until it is explicitly unassigned. > > 3. A queue's APQN can be set in the guest's CRYCB iff the APQN is > assigned to the mdev device used by the guest; however, if the > queue is also in the host configuration (i.e., online), it MUST > also be bound to the vfio_ap device driver. > > 4. When a queue is bound to the vfio_ap driver and its APQN > is assigned to an mdev device in use by a guest, the guest will > be given access to the queue. > > 5. When a queue is unbound from the vfio_ap driver and its APQN > is assigned to an mdev device in use by the guest, access to > the card containing the queue will be removed from the guest. > Keep in mind that we can not deny access to a specific queue > due to the architecture (i.e., clearing a bit in the AQM > removes access to the queue for all adapters) > > 6. When an adapter is assigned to an mdev device that is in use > by a guest, the guest will be given access to the adapter. > > 7. When an adapter is unassigned from an mdev device that is in use > by a guest, access to the adapter will removed from the guest. > > 8. When a domain is assigned to an mdev device that is in use > by a guest, the guest will be given access to the domain. > > 9. When a domain is unassigned from an mdev device that is in use > by a guest, access to the domain will removed from the guest. > Based on our off-the-list chat and this list I think I know where are you heading :). I think it's actually the design that I currently prefer the most. But in that case, it may be wise to touch base with Reinhard -- AFAIR he was the strongest proponent of the 'do not let a[pq]mask changes take away queues from guests' design. Regards, Halil
Re: [PATCH v5 1/7] s390: ap: kvm: add PQAP interception for AQIC
On Fri, 15 Mar 2019 14:26:34 +0100 Pierre Morel wrote: > On 15/03/2019 11:20, Cornelia Huck wrote: > > On Wed, 13 Mar 2019 17:04:58 +0100 > > Pierre Morel wrote: > > > >> +/* > >> + * handle_pqap: Handling pqap interception > >> + * @vcpu: the vcpu having issue the pqap instruction > >> + * > >> + * We now support PQAP/AQIC instructions and we need to correctly > >> + * answer the guest even if no dedicated driver's hook is available. > >> + * > >> + * The intercepting code calls a dedicated callback for this instruction > >> + * if a driver did register one in the CRYPTO satellite of the > >> + * SIE block. > >> + * > >> + * For PQAP/AQIC instructions only, verify privilege and specifications. > >> + * > >> + * If no callback available, the queues are not available, return this to > >> + * the caller. > >> + * Else return the value returned by the callback. > >> + */ > >> +static int handle_pqap(struct kvm_vcpu *vcpu) > >> +{ > >> + uint8_t fc; > >> + struct ap_queue_status status = {}; > >> + int ret; > >> + /* Verify that the AP instruction are available */ > >> + if (!ap_instructions_available()) > >> + return -EOPNOTSUPP; > >> + /* Verify that the guest is allowed to use AP instructions */ > >> + if (!(vcpu->arch.sie_block->eca & ECA_APIE)) > >> + return -EOPNOTSUPP; > >> + /* Verify that the function code is AQIC */ > >> + fc = vcpu->run->s.regs.gprs[0] >> 24; > >> + /* We do not want to change the behavior we had before this patch*/ > >> + if (fc != 0x03) > >> + return -EOPNOTSUPP; > >> + > >> + /* PQAP instructions are allowed for guest kernel only */ > >> + if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE) > >> + return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP); > >> + /* AQIC instruction is allowed only if facility 65 is available */ > >> + if (!test_kvm_facility(vcpu->kvm, 65)) > >> + return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION); > >> + /* Verify that the hook callback is registered and call it */ > >> + if (vcpu->kvm->arch.crypto.pqap_hook) { > >> + if (!try_module_get(vcpu->kvm->arch.crypto.pqap_hook->owner)) > >> + return -EOPNOTSUPP; > >> + ret = vcpu->kvm->arch.crypto.pqap_hook->hook(vcpu); > >> + module_put(vcpu->kvm->arch.crypto.pqap_hook->owner); > >> + return ret; > >> + } > >> + /* > >> + * It is the duty of the vfio_driver to register a hook > >> + * If it does not and we get an exception on AQIC we must > >> + * guess that there is no vfio_ap_driver at all and no one > >> + * to handle the guests's CRYCB and the CRYCB is empty. > >> + */ > >> + status.response_code = 0x01; > > > > I'm still confused here, sorry. From previous discussions I recall that > > this indicates "no crypto device" (please correct me if I'm wrong.) > > > > Before this patch, we had: > > - guest issues PQAP/AQIC -> drop to userspace > > > > With a correct implementation, we get: > > - guest issues PQAP/AQIC -> callback does what needs to be done > > > > With an incorrect implementation (no callback), we get: > > - guest issues PQAP/AQIC -> guest gets response code 0x01 > > > > Why not drop to userspace in that case? > > This is what I had in the previous patches. > Hum, I do not remember which discussion lead me to modify this. > > Anyway, now that you put the finger on this problem, I think the problem > is worse. > > The behavior with old / new Linux, vfio driver and qemu is: > > LINUX VFIO_AP QEMUPGM > OLD x x OPERATION Isn't OPERATION a bad answer if ap=on? It should not happen with a well behaved guest because facility 65 is not indicated, but if it does, I guess we give the wrong answer. > NEW - OLD SPECIFICATION > NEW - NEW/aqic=offSPECIFICATION > NEW x NEW/aqic=on - > AFAICT with LINUX == NEW we get the correct answer. OPERATION exception is only good if ap=off. > x = whatever > - = absent/none > > So yes there is a change in behavior for the userland for the case QEMU > do not set the AQIC facility 65, OLD QEMU or NEW QEMU wanting to behave > like an older one. > > I fear we have the same problem with the privileged operation... > IMHO this boils down to: * either OLD QEMU or * OLD LINUX should have taken care of handling the mandatory intercept for PQAP/AQIC if ap=on (i.e. guest has AP instructions), and does not have facility 65 which was the case for OLD. Things get complicated when one considers that ECA.28 is an effective control. > For the last case, when the kvm_facility(65) is set, the explication is > the following: > > This is related to the handling of PQAP AQIC which is now authorized by > this patch series. > If we authorize PQAP AQIC, by setting the bit for facility 65, the guest > can use this instruction. > If the instruction follows the specifications we must answer something > realistic and since there is nothing in the CRYCB (no driver) we an
Re: [PATCH v5 1/7] s390: ap: kvm: add PQAP interception for AQIC
On Fri, 15 Mar 2019 15:10:25 +0100 Pierre Morel wrote: > Sorry, wrong conclusion, handling this in userland will bring us much > too far if we want to answer correctly for the case the hook is not > there but QEMU accepted the facility for AQIC. > > The alternative is easier, we just continue to respond with the > OPERATION exception here and only handle the specification and > privileged exception cases in QEMU and in the hook. I don't quite understand what do you mean by this paragraph. Especially not what do you mean by 'just continue to respond with the OPERATION exception here'. In any case if the guest is supposed to have ap instructions, and does not have facility 65 the right answer is specification and not operation exception. And this has to work regardless of vfio-ap module loaded or not. Regards, Halil > > So, I think the discussion will go on until you come back :) > > Regards, > Pierre
Re: [PATCH v6 01/13] KVM: s390: drop obsolete else path
On Thu, 24 Jan 2019 13:59:27 +0100 Michael Mueller wrote: > The explicit else path specified in set_intercept_indicators_io > is not required as the function returns in case the first branch > is taken anyway. > > Signed-off-by: Michael Mueller Reviewed-by: Halil Pasic > --- > arch/s390/kvm/interrupt.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index fcb55b02990e..19d556512452 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -345,7 +345,7 @@ static void set_intercept_indicators_io(struct kvm_vcpu > *vcpu) > { > if (!(pending_irqs_no_gisa(vcpu) & IRQ_PEND_IO_MASK)) > return; > - else if (psw_ioint_disabled(vcpu)) > + if (psw_ioint_disabled(vcpu)) > kvm_s390_set_cpuflags(vcpu, CPUSTAT_IO_INT); > else > vcpu->arch.sie_block->lctl |= LCTL_CR6;
Re: [PATCH v6 02/13] KVM: s390: make bitmap declaration consitent
On Thu, 24 Jan 2019 13:59:28 +0100 Michael Mueller wrote: > Use a consistent bitmap declaration throughout the code. > > Signed-off-by: Michael Mueller Reviewed-by: Halil Pasic > --- > arch/s390/include/asm/kvm_host.h | 2 +- > arch/s390/kvm/interrupt.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/s390/include/asm/kvm_host.h > b/arch/s390/include/asm/kvm_host.h > index d5d24889c3bc..3cba08f73dc6 100644 > --- a/arch/s390/include/asm/kvm_host.h > +++ b/arch/s390/include/asm/kvm_host.h > @@ -591,7 +591,7 @@ struct kvm_s390_float_interrupt { > struct kvm_s390_mchk_info mchk; > struct kvm_s390_ext_info srv_signal; > int next_rr_cpu; > - unsigned long idle_mask[BITS_TO_LONGS(KVM_MAX_VCPUS)]; > + DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS); > struct mutex ais_lock; > u8 simm; > u8 nimm; > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index 19d556512452..167a3068ef84 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -2831,7 +2831,7 @@ static void store_local_irq(struct > kvm_s390_local_interrupt *li, > int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, __u8 __user *buf, int len) > { > int scn; > - unsigned long sigp_emerg_pending[BITS_TO_LONGS(KVM_MAX_VCPUS)]; > + DECLARE_BITMAP(sigp_emerg_pending, KVM_MAX_VCPUS); > struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; > unsigned long pending_irqs; > struct kvm_s390_irq irq;
Re: [PATCH v6 04/13] KVM: s390: coding style kvm_s390_gisa_init/clear()
On Thu, 24 Jan 2019 13:59:30 +0100 Michael Mueller wrote: > The change helps to reduce line length and > increases code readability. > > Signed-off-by: Michael Mueller > Reviewed-by: Cornelia Huck > Reviewed-by: Pierre Morel Reviewed-by: Halil Pasic > --- > arch/s390/kvm/interrupt.c | 20 ++-- > 1 file changed, 10 insertions(+), 10 deletions(-) > > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index 2a3eb9f076c3..005dbe7252e7 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -2885,20 +2885,20 @@ int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, > __u8 __user *buf, int len) > > void kvm_s390_gisa_clear(struct kvm *kvm) > { > - if (kvm->arch.gisa) { > - memset(kvm->arch.gisa, 0, sizeof(struct kvm_s390_gisa)); > - kvm->arch.gisa->next_alert = (u32)(u64)kvm->arch.gisa; > - VM_EVENT(kvm, 3, "gisa 0x%pK cleared", kvm->arch.gisa); > - } > + if (!kvm->arch.gisa) > + return; > + memset(kvm->arch.gisa, 0, sizeof(struct kvm_s390_gisa)); > + kvm->arch.gisa->next_alert = (u32)(u64)kvm->arch.gisa; > + VM_EVENT(kvm, 3, "gisa 0x%pK cleared", kvm->arch.gisa); > } > > void kvm_s390_gisa_init(struct kvm *kvm) > { > - if (css_general_characteristics.aiv) { > - kvm->arch.gisa = &kvm->arch.sie_page2->gisa; > - VM_EVENT(kvm, 3, "gisa 0x%pK initialized", kvm->arch.gisa); > - kvm_s390_gisa_clear(kvm); > - } > + if (!css_general_characteristics.aiv) > + return; > + kvm->arch.gisa = &kvm->arch.sie_page2->gisa; > + kvm_s390_gisa_clear(kvm); > + VM_EVENT(kvm, 3, "gisa 0x%pK initialized", kvm->arch.gisa); > } > > void kvm_s390_gisa_destroy(struct kvm *kvm)
Re: [PATCH v6 06/13] KVM: s390: remove kvm_s390_ from gisa static inline functions
On Thu, 24 Jan 2019 13:59:32 +0100 Michael Mueller wrote: > This will shorten the length of code lines. All GISA related > static inline functions are local to interrupt.c. > > Signed-off-by: Michael Mueller Reviewed-by: Halil Pasic [..]
Re: [PATCH v6 05/13] KVM: s390: use pending_irqs_no_gisa() where appropriate
On Thu, 24 Jan 2019 13:59:31 +0100 Michael Mueller wrote: > Interruption types that are not represented in GISA shall > use pending_irqs_no_gisa() to test pending interruptions. > > Signed-off-by: Michael Mueller > Reviewed-by: Cornelia Huck > Reviewed-by: Pierre Morel I guess this is just a tiny optimization. Reviewed-by: Halil Pasic > --- > arch/s390/kvm/interrupt.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index 005dbe7252e7..cb48736867ed 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -353,7 +353,7 @@ static void set_intercept_indicators_io(struct kvm_vcpu > *vcpu) > > static void set_intercept_indicators_ext(struct kvm_vcpu *vcpu) > { > - if (!(pending_irqs(vcpu) & IRQ_PEND_EXT_MASK)) > + if (!(pending_irqs_no_gisa(vcpu) & IRQ_PEND_EXT_MASK)) > return; > if (psw_extint_disabled(vcpu)) > kvm_s390_set_cpuflags(vcpu, CPUSTAT_EXT_INT); > @@ -363,7 +363,7 @@ static void set_intercept_indicators_ext(struct kvm_vcpu > *vcpu) > > static void set_intercept_indicators_mchk(struct kvm_vcpu *vcpu) > { > - if (!(pending_irqs(vcpu) & IRQ_PEND_MCHK_MASK)) > + if (!(pending_irqs_no_gisa(vcpu) & IRQ_PEND_MCHK_MASK)) > return; > if (psw_mchk_disabled(vcpu)) > vcpu->arch.sie_block->ictl |= ICTL_LPSW;
Re: [PATCH v6 06/13] KVM: s390: remove kvm_s390_ from gisa static inline functions
On Thu, 24 Jan 2019 13:59:32 +0100 Michael Mueller wrote: > Tags: inv, patch, s390 > From: Michael Mueller > To: KVM Mailing List > Cc: Linux-S390 Mailing List , > linux-kernel@vger.kernel.org, Martin Schwidefsky , > Heiko Carstens , Christian Borntraeger > , Janosch Frank , David > Hildenbrand , Cornelia Huck , Halil > Pasic , Pierre Morel , Michael > Mueller > Subject: [PATCH v6 06/13] KVM: s390: remove kvm_s390_ from gisa static inline > functions > Date: Thu, 24 Jan 2019 13:59:32 +0100 > X-Mailer: git-send-email 2.13.4 > > This will shorten the length of code lines. All GISA related > static inline functions are local to interrupt.c. > > Signed-off-by: Michael Mueller Reviewed-by: Halil Pasic
Re: [PATCH v6 07/13] KVM: s390: introduce struct kvm_s390_gisa_interrupt
On Thu, 24 Jan 2019 13:59:33 +0100 Michael Mueller wrote: > Use this struct analog to the kvm interruption structs > for kvm emulated floating and local interruptions. > Further fields will be added with this series as > required. > > Signed-off-by: Michael Mueller While looking at this I was asking myself what guards against invalid gisa pointer dereference e.g. when pending_irqs() is called (see below). AFAIU we set up gisa_int.origin only if we have css_general_characteristics.aiv. Opinions? Anyway if it is a problem, it is a pre-existing one (should be fixed ASAP but does not affect the validity of this patch). Reviewed-by: Halil Pasic > --- > arch/s390/include/asm/kvm_host.h | 6 - > arch/s390/kvm/interrupt.c| 52 > +++- > arch/s390/kvm/kvm-s390.c | 2 +- > 3 files changed, 36 insertions(+), 24 deletions(-) > > diff --git a/arch/s390/include/asm/kvm_host.h > b/arch/s390/include/asm/kvm_host.h > index c259a67c4298..0941571d34eb 100644 > --- a/arch/s390/include/asm/kvm_host.h > +++ b/arch/s390/include/asm/kvm_host.h > @@ -803,6 +803,10 @@ struct kvm_s390_vsie { > struct page *pages[KVM_MAX_VCPUS]; > }; > [..] > diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c > index 942cc7d33766..ee91d1de0036 100644 > --- a/arch/s390/kvm/interrupt.c > +++ b/arch/s390/kvm/interrupt.c > @@ -246,7 +246,8 @@ static inline unsigned long pending_irqs_no_gisa(struct > kvm_vcpu *vcpu) > static inline unsigned long pending_irqs(struct kvm_vcpu *vcpu) > { > return pending_irqs_no_gisa(vcpu) | > - gisa_get_ipm(vcpu->kvm->arch.gisa) << IRQ_PEND_IO_ISC_7; > + gisa_get_ipm(vcpu->kvm->arch.gisa_int.origin) << Unconditional call to gisa_get_ipm(), and get ipm just accesses ->ipm. > + IRQ_PEND_IO_ISC_7; > } > [..] > void kvm_s390_gisa_init(struct kvm *kvm) > { > + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; > + > if (!css_general_characteristics.aiv) > return; > - kvm->arch.gisa = &kvm->arch.sie_page2->gisa; > + gi->origin = &kvm->arch.sie_page2->gisa; Should stay NULL if !css_general_characteristics.aiv. > kvm_s390_gisa_clear(kvm); > - VM_EVENT(kvm, 3, "gisa 0x%pK initialized", kvm->arch.gisa); > + VM_EVENT(kvm, 3, "gisa 0x%pK initialized", gi->origin); > } >
Re: [PATCH v6 09/13] KVM: s390: add the GIB and its related life-cyle functions
On Thu, 24 Jan 2019 13:59:35 +0100 Michael Mueller wrote: > The Guest Information Block (GIB) links the GISA of all guests > that have adapter interrupts pending. These interrupts cannot be > delivered because all vcpus of these guests are currently in WAIT > state or have masked the respective Innterruption Sub Class (ISC). > If enabled, a GIB alert is issued on the host to schedule these > guests to run suitable vcpus to consume the pending interruptions. > > This mechanism allows to process adapter interrupts for currently > not running guests. > > The GIB is created during host initialization and associated with > the Adapter Interruption Facility in case an Adapter Interruption > Virtualization Facility is available. > > The GIB initialization and thus the activation of the related code > will be done in an upcoming patch of this series. > > Signed-off-by: Michael Mueller > Reviewed-by: Janosch Frank > Reviewed-by: Christian Borntraeger > Reviewed-by: Cornelia Huck > Reviewed-by: Pierre Morel I don't agree with the commit message to 100%, but I'd rather not start a discussion. Acked-by: Halil Pasic [..]
Re: [PATCH v6 11/13] KVM: s390: add functions to (un)register GISC with GISA
On Thu, 24 Jan 2019 13:59:37 +0100 Michael Mueller wrote: > Add the Interruption Alert Mask (IAM) to the architecture specific > kvm struct. This mask in the GISA is used to define for which ISC > a GIB alert will be issued. > > The functions kvm_s390_gisc_register() and kvm_s390_gisc_unregister() > are used to (un)register a GISC (guest ISC) with a virtual machine and > its GISA. > > Upon successful completion, kvm_s390_gisc_register() returns the > ISC to be used for GIB alert interruptions. A negative return code > indicates an error during registration. > > Theses functions will be used by other adapter types like AP and PCI to > request pass-through interruption support. I'm not sure this interface is going to to fit PCI that well. But IMHO no reason to delay the whole series -- we can think about zPCI later. Same goes for some of the names. Another idea for later would be to sanity check in gisa destroy that alert.mask is back to all zero -- to catch any corresponding driver bugs. Acked-by: Halil Pasic > > Signed-off-by: Michael Mueller [..] > static inline void gisa_set_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) > { > set_bit_inv(IPM_BIT_OFFSET + gisc, (unsigned long *) gisa); > @@ -2911,6 +2938,8 @@ void kvm_s390_gisa_init(struct kvm *kvm) > if (!css_general_characteristics.aiv) > return; > gi->origin = &kvm->arch.sie_page2->gisa; > + gi->alert.mask = 0; I don't think this is necessary. Otherwise you would need to zero the alert.ref[] too, or? Regards, Halil > + spin_lock_init(&gi->alert.ref_lock); > kvm_s390_gisa_clear(kvm); > VM_EVENT(kvm, 3, "gisa 0x%pK initialized", gi->origin); > }
Re: [PATCH 1/1] s390/vfio-ap: fix circular lockdep when setting/clearing crypto masks
On Thu, 11 Feb 2021 09:21:26 -0500 Tony Krowiak wrote: > Yes, it makes sense. I guess I didn't look closely at your > suggestion when I said it was exactly what I implemented > after agreeing with Connie. I had a slight difference in > my implementation: > > static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) > { > struct kvm *kvm; > > mutex_lock(&matrix_dev->lock); > > if (matrix_mdev->kvm) { > kvm = matrix_mdev->kvm; > mutex_unlock(&matrix_dev->lock); The problem with this one is that as soon as we drop the lock here, another thread can in theory execute the critical section below, which drops our reference to kvm via kvm_put_kvm(kvm). Thus when we enter kvm_arch_crypto_clear_mask(), even if we are guaranteed to have a non-null pointer, the pointee is not guaranteed to be around. So like Connie suggested, you better take another reference to kvm in the first critical section. Regards, Halil > kvm_arch_crypto_clear_masks(kvm); > mutex_lock(&matrix_dev->lock); > kvm->arch.crypto.pqap_hook = NULL; > vfio_ap_mdev_reset_queues(matrix_mdev->mdev); > matrix_mdev->kvm = NULL; > kvm_put_kvm(kvm); > } > > mutex_unlock(&matrix_dev->lock); > }
Re: [PATCH v2 1/1] s390/vfio-ap: fix circular lockdep when setting/clearing crypto masks
On Mon, 15 Feb 2021 20:15:47 -0500 Tony Krowiak wrote: > This patch fixes a circular locking dependency in the CI introduced by > commit f21916ec4826 ("s390/vfio-ap: clean up vfio_ap resources when KVM > pointer invalidated"). The lockdep only occurs when starting a Secure > Execution guest. Crypto virtualization (vfio_ap) is not yet supported for > SE guests; however, in order to avoid CI errors, this fix is being > provided. > > The circular lockdep was introduced when the masks in the guest's APCB > were taken under the matrix_dev->lock. While the lock is definitely > needed to protect the setting/unsetting of the KVM pointer, it is not > necessarily critical for setting the masks, so this will not be done under > protection of the matrix_dev->lock. With the one little thing I commented on below addressed: Acked-by: Halil Pasic This solution probably ain't a perfect one, but can't say I see a simple way to get around this problem. For instance I played with the thought of taking locks in a different order and keeping the critical sections intact, but that has problems of its own. Tony should have the best understanding of vfio_ap anyway. In theory the execution of vfio_ap_mdev_group_notifier() and vfio_ap_mdev_release() could interleave, and we could loose a clear because in theory some permutations of the critical sections need to be considered. In practice I hope that won't happen with QEMU. Tony, you gave this a decent amount of testing or? I think we should move forward with this. Any objections? > > Fixes: f21916ec4826 ("s390/vfio-ap: clean up vfio_ap resources when KVM > pointer invalidated") > Cc: sta...@vger.kernel.org > Signed-off-by: Tony Krowiak > --- > drivers/s390/crypto/vfio_ap_ops.c | 119 +- > 1 file changed, 84 insertions(+), 35 deletions(-) > > diff --git a/drivers/s390/crypto/vfio_ap_ops.c > b/drivers/s390/crypto/vfio_ap_ops.c > index 41fc2e4135fe..8574b6ecc9c5 100644 > --- a/drivers/s390/crypto/vfio_ap_ops.c > +++ b/drivers/s390/crypto/vfio_ap_ops.c > @@ -1027,8 +1027,21 @@ static const struct attribute_group > *vfio_ap_mdev_attr_groups[] = { > * @matrix_mdev: a mediated matrix device > * @kvm: reference to KVM instance > * > - * Verifies no other mediated matrix device has @kvm and sets a reference to > - * it in @matrix_mdev->kvm. > + * Sets all data for @matrix_mdev that are needed to manage AP resources > + * for the guest whose state is represented by @kvm: > + * 1. Verifies no other mediated device has a reference to @kvm. > + * 2. Increments the ref count for @kvm so it doesn't disappear until the > + *vfio_ap driver is notified the pointer is being nullified. > + * 3. Sets a reference to the PQAP hook (i.e., handle_pqap() function) into > + *@kvm to handle interception of the PQAP(AQIC) instruction. > + * 4. Sets the masks supplying the AP configuration to the KVM guest. > + * 5. Sets the KVM pointer into @kvm so the vfio_ap driver can access it. > + * Could for example a PQAP AQIC run across an unset matrix_mdev->kvm like this, in theory? I don't think it's likely to happen in the wild though. Why not set it up before setting the mask? > + * Note: The matrix_dev->lock must be taken prior to calling > + * this function; however, the lock will be temporarily released to avoid a > + * potential circular lock dependency with other asynchronous processes that > + * lock the kvm->lock mutex which is also needed to supply the guest's AP > + * configuration. > * > * Return 0 if no other mediated matrix device has a reference to @kvm; > * otherwise, returns an -EPERM. > @@ -1043,9 +1056,17 @@ static int vfio_ap_mdev_set_kvm(struct ap_matrix_mdev > *matrix_mdev, > return -EPERM; > } > > - matrix_mdev->kvm = kvm; > - kvm_get_kvm(kvm); > - kvm->arch.crypto.pqap_hook = &matrix_mdev->pqap_hook; > + if (kvm->arch.crypto.crycbd) { > + kvm_get_kvm(kvm); > + kvm->arch.crypto.pqap_hook = &matrix_mdev->pqap_hook; > + mutex_unlock(&matrix_dev->lock); > + kvm_arch_crypto_set_masks(kvm, > + matrix_mdev->matrix.apm, > + matrix_mdev->matrix.aqm, > + matrix_mdev->matrix.adm); > + mutex_lock(&matrix_dev->lock); > + matrix_mdev->kvm = kvm; > + } > > return 0; > } > @@ -1079,51 +1100,80 @@ static int vfio_ap_mdev_iommu_notifier(struct > notifier_block *nb, > return NOTIFY_DONE; > } > > +/** > + * vfio_ap_md
Re: [PATCH 1/1] s390/vfio-ap: fix circular lockdep when setting/clearing crypto masks
On Wed, 10 Feb 2021 11:53:34 +0100 Cornelia Huck wrote: > On Tue, 9 Feb 2021 14:48:30 -0500 > Tony Krowiak wrote: > > > This patch fixes a circular locking dependency in the CI introduced by > > commit f21916ec4826 ("s390/vfio-ap: clean up vfio_ap resources when KVM > > pointer invalidated"). The lockdep only occurs when starting a Secure > > Execution guest. Crypto virtualization (vfio_ap) is not yet supported for > > SE guests; however, in order to avoid CI errors, this fix is being > > provided. > > > > The circular lockdep was introduced when the masks in the guest's APCB > > were taken under the matrix_dev->lock. While the lock is definitely > > needed to protect the setting/unsetting of the KVM pointer, it is not > > necessarily critical for setting the masks, so this will not be done under > > protection of the matrix_dev->lock. > > > > Fixes: f21916ec4826 ("s390/vfio-ap: clean up vfio_ap resources when KVM > > pointer invalidated") > > Cc: sta...@vger.kernel.org > > Signed-off-by: Tony Krowiak > > --- > > drivers/s390/crypto/vfio_ap_ops.c | 75 ++- > > 1 file changed, 45 insertions(+), 30 deletions(-) > > > > > static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) > > { > > - kvm_arch_crypto_clear_masks(matrix_mdev->kvm); > > - matrix_mdev->kvm->arch.crypto.pqap_hook = NULL; > > - vfio_ap_mdev_reset_queues(matrix_mdev->mdev); > > - kvm_put_kvm(matrix_mdev->kvm); > > - matrix_mdev->kvm = NULL; > > + if (matrix_mdev->kvm) { > > If you're doing setting/unsetting under matrix_dev->lock, is it > possible that matrix_mdev->kvm gets unset between here and the next > line, as you don't hold the lock? > > Maybe you could > - grab a reference to kvm while holding the lock > - call the mask handling functions with that kvm reference > - lock again, drop the reference, and do the rest of the processing? I agree, matrix_mdev->kvm can go NULL any time and we are risking a null pointer dereference here. Another idea would be to do static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) { struct kvm *kvm; mutex_lock(&matrix_dev->lock); if (matrix_mdev->kvm) { kvm = matrix_mdev->kvm; matrix_mdev->kvm = NULL; mutex_unlock(&matrix_dev->lock); kvm_arch_crypto_clear_masks(kvm); mutex_lock(&matrix_dev->lock); matrix_mdev->kvm->arch.crypto.pqap_hook = NULL; vfio_ap_mdev_reset_queues(matrix_mdev->mdev); kvm_put_kvm(kvm); } mutex_unlock(&matrix_dev->lock); } That way only one unset would actually do the unset and cleanup and every other invocation would bail out with only checking matrix_mdev->kvm. > > + kvm_arch_crypto_clear_masks(matrix_mdev->kvm); > > + mutex_lock(&matrix_dev->lock); > > + matrix_mdev->kvm->arch.crypto.pqap_hook = NULL; > > + vfio_ap_mdev_reset_queues(matrix_mdev->mdev); > > + kvm_put_kvm(matrix_mdev->kvm); > > + matrix_mdev->kvm = NULL; > > + mutex_unlock(&matrix_dev->lock); > > + } > > } >
Re: [PATCH 1/1] s390/vfio-ap: fix circular lockdep when setting/clearing crypto masks
On Wed, 10 Feb 2021 16:24:29 +0100 Halil Pasic wrote: > > Maybe you could > > - grab a reference to kvm while holding the lock > > - call the mask handling functions with that kvm reference > > - lock again, drop the reference, and do the rest of the processing? > > I agree, matrix_mdev->kvm can go NULL any time and we are risking > a null pointer dereference here. > > Another idea would be to do > > > static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) > > { > > struct kvm *kvm; > > mutex_lock(&matrix_dev->lock); > > if (matrix_mdev->kvm) { > > kvm = matrix_mdev->kvm; > > matrix_mdev->kvm = NULL; > > mutex_unlock(&matrix_dev->lock); > > kvm_arch_crypto_clear_masks(kvm); > > mutex_lock(&matrix_dev->lock); > > matrix_mdev->kvm->arch.crypto.pqap_hook = NULL; s/matrix_mdev->kvm/kvm > vfio_ap_mdev_reset_queues(matrix_mdev->mdev); > > kvm_put_kvm(kvm); > > } > > mutex_unlock(&matrix_dev->lock); > > } > > That way only one unset would actually do the unset and cleanup > and every other invocation would bail out with only checking > matrix_mdev->kvm. But the problem with that is that we enable the the assign/unassign prematurely, which could interfere wit reset_queues(). Forget about it.
Re: [PATCH 1/1] s390/vfio-ap: fix circular lockdep when setting/clearing crypto masks
On Wed, 10 Feb 2021 17:05:48 -0500 Tony Krowiak wrote: > On 2/10/21 10:32 AM, Halil Pasic wrote: > > On Wed, 10 Feb 2021 16:24:29 +0100 > > Halil Pasic wrote: > > > >>> Maybe you could > >>> - grab a reference to kvm while holding the lock > >>> - call the mask handling functions with that kvm reference > >>> - lock again, drop the reference, and do the rest of the processing? > >> I agree, matrix_mdev->kvm can go NULL any time and we are risking > >> a null pointer dereference here. > >> > >> Another idea would be to do > >> > >> > >> static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) > >> { > >> struct kvm *kvm; > >> > >> mutex_lock(&matrix_dev->lock); > >> if (matrix_mdev->kvm) { > >> kvm = matrix_mdev->kvm; > >> matrix_mdev->kvm = NULL; > >> mutex_unlock(&matrix_dev->lock); > >> kvm_arch_crypto_clear_masks(kvm); > >> mutex_lock(&matrix_dev->lock); > >> matrix_mdev->kvm->arch.crypto.pqap_hook = NULL; > > s/matrix_mdev->kvm/kvm > >> vfio_ap_mdev_reset_queues(matrix_mdev->mdev); > >> kvm_put_kvm(kvm); > >> } > >> mutex_unlock(&matrix_dev->lock); > >> } > >> > >> That way only one unset would actually do the unset and cleanup > >> and every other invocation would bail out with only checking > >> matrix_mdev->kvm. > > But the problem with that is that we enable the the assign/unassign > > prematurely, which could interfere wit reset_queues(). Forget about > > it. > > Not sure what you mean by this. > > I mean because above I first do (1) matrix_mdev->kvm = NULL; and then do (2) vfio_ap_mdev_reset_queues(matrix_mdev->mdev); another thread could do static ssize_t unassign_adapter_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { int ret; unsigned long apid; struct mdev_device *mdev = mdev_from_dev(dev); struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev); /* If the guest is running, disallow un-assignment of adapter */ if (matrix_mdev->kvm) return -EBUSY; ... } between (1) and (2), and we would not bail out with -EBUSY because !!kvm because of (1). That means we would change matrix_mdev->matrix and we would not reset the queues that correspond to the apid that was just removed, because by the time we do the reset_queues, the queues are not in the matrix_mdev->matrix any more. Does that make sense?