On 11.07.2013, at 12:54, Alexey Kardashevskiy wrote: > On 07/11/2013 08:11 PM, Alexander Graf wrote: >> >> On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote: >> >>> On 07/10/2013 08:05 PM, Alexander Graf wrote: >>>> >>>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote: >>>> >>>>> On 07/10/2013 03:02 AM, Alexander Graf wrote: >>>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: >>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>>>>> devices or emulated PCI. These calls allow adding multiple entries >>>>>>> (up to 512) into the TCE table in one call which saves time on >>>>>>> transition to/from real mode. >>>>>> >>>>>> We don't mention QEMU explicitly in KVM code usually. >>>>>> >>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>>>>> (copied from user and verified) before writing the whole list into >>>>>>> the TCE table. This cache will be utilized more in the upcoming >>>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>>>>> mode in the case if the real mode handler failed for some reason. >>>>>>> >>>>>>> This adds a guest physical to host real address converter >>>>>>> and calls the existing H_PUT_TCE handler. The converting function >>>>>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>>>>> >>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>>>>> so in order to support the functionality of this patch, QEMU >>>>>>> needs to query for this capability and set the "hcall-multi-tce" >>>>>>> hypertas property only if the capability is present, otherwise >>>>>>> there will be serious performance degradation. >>>>>> >>>>>> Same as above. But really you're only giving recommendations here. What's >>>>>> the point? Please describe what the benefit of this patch is, not what >>>>>> some >>>>>> other random subsystem might do with the benefits it brings. >>>>>> >>>>>>> >>>>>>> Signed-off-by: Paul Mackerras<pau...@samba.org> >>>>>>> Signed-off-by: Alexey Kardashevskiy<a...@ozlabs.ru> >>>>>>> >>>>>>> --- >>>>>>> Changelog: >>>>>>> 2013/07/06: >>>>>>> * fixed number of wrong get_page()/put_page() calls >>>>>>> >>>>>>> 2013/06/27: >>>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte() >>>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now >>>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 >>>>>>> * updated doc >>>>>>> >>>>>>> 2013/06/05: >>>>>>> * fixed mistype about IBMVIO in the commit message >>>>>>> * updated doc and moved it to another section >>>>>>> * changed capability number >>>>>>> >>>>>>> 2013/05/21: >>>>>>> * added kvm_vcpu_arch::tce_tmp >>>>>>> * removed cleanup if put_indirect failed, instead we do not even start >>>>>>> writing to TCE table if we cannot get TCEs from the user and they are >>>>>>> invalid >>>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>>>>> and kvmppc_emulated_validate_tce (for the previous item) >>>>>>> * fixed bug with failthrough for H_IPI >>>>>>> * removed all get_user() from real mode handlers >>>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>>>>> >>>>>>> Signed-off-by: Alexey Kardashevskiy<a...@ozlabs.ru> >>>>>>> --- >>>>>>> Documentation/virtual/kvm/api.txt | 25 +++ >>>>>>> arch/powerpc/include/asm/kvm_host.h | 9 ++ >>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++- >>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260 >>>>>>> ++++++++++++++++++++++++++++---- >>>>>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++- >>>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>>>>> arch/powerpc/kvm/powerpc.c | 3 + >>>>>>> 9 files changed, 517 insertions(+), 34 deletions(-) >>>>>>> >>>>>>> diff --git a/Documentation/virtual/kvm/api.txt >>>>>>> b/Documentation/virtual/kvm/api.txt >>>>>>> index 6365fef..762c703 100644 >>>>>>> --- a/Documentation/virtual/kvm/api.txt >>>>>>> +++ b/Documentation/virtual/kvm/api.txt >>>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be >>>>>>> passed >>>>>>> to userspace to be >>>>>>> handled. >>>>>>> >>>>>>> >>>>>>> +4.86 KVM_CAP_PPC_MULTITCE >>>>>>> + >>>>>>> +Capability: KVM_CAP_PPC_MULTITCE >>>>>>> +Architectures: ppc >>>>>>> +Type: vm >>>>>>> + >>>>>>> +This capability means the kernel is capable of handling hypercalls >>>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user >>>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests. >>>>>> >>>>>> significanly? Please run this through a spell checker. >>>>>> >>>>>>> +The user space should expect that its handlers for these hypercalls >>>>>> >>>>>> s/The// >>>>>> >>>>>>> +are not going to be called. >>>>>> >>>>>> Is user space guaranteed they will not be called? Or can it still happen? >>>>> >>>>> ... if user space previously registered LIOBN in KVM (via >>>>> KVM_CREATE_SPAPR_TCE or similar calls). >>>>> >>>>> ok? >>>> >>>> How about this? >>>> >>>> The hypercalls mentioned above may or may not be processed successfully in >>>> the kernel based fast path. If they can not be handled by the kernel, they >>>> will get passed on to user space. So user space still has to have an >>>> implementation for these despite the in kernel acceleration. >>>> >>>> --- >>>> >>>> The target audience for this documentation is user space KVM API users. >>>> Someone developing kvm tool for example. They want to know implications >>>> specific CAPs have. >>>> >>>>> >>>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet >>>>> and may never get there. >>>>> >>>>> >>>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, >>>>>>> +the user space might have to advertise it for the guest. For example, >>>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in >>>>>>> +the "ibm,hypertas-functions" device-tree property. >>>>>> >>>>>> This paragraph describes sPAPR. That's fine, but please document it as >>>>>> such. Also please check your grammar. >>>>> >>>>>>> + >>>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and >>>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not >>>>>>> recommended >>>>>>> +unless the capability is present as passing hypercalls to the userspace >>>>>>> +slows operations a lot. >>>>>>> + >>>>>>> +Unlike other capabilities of this section, this one is always enabled. >>>>>> >>>>>> Why? Wouldn't that confuse older user space? >>>>> >>>>> >>>>> How? Old user space won't check for this capability and won't tell the >>>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there. >>>>> >>>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what, >>>>> then it is its problem - it won't work now anyway as neither QEMU nor host >>>>> kernel supports these calls. >>> >>> >>>> Always assume that you are a kernel developer without knowledge >>>> of any user space code using your interfaces. So there is the theoretical >>>> possibility that there is a user space client out there that implements >>>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. >>>> Would that client break? If so, we should definitely have >>>> the CAP disabled by default. >>> >>> >>> No, it won't break. Why would it break? I really do not get it. This user >>> space client has to do an extra step to get this acceleration by calling >>> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect >>> on H_PUT_TCE, now on all three hcalls. >> >> Hrm. It's a change of behavior, it probably wouldn't break, yes. > > > Aaand?
And that's bad. Jeez, seriously. Don't argue this case. We enable new features individually unless we're 100% sure we can keep everything working. In this case an ENABLE_CAP doesn't hurt at all, because user space still needs to handle the hypercalls if it wants them anyways. But you get debugging for free for example. > > >>>> But really, it's also as much about consistency as anything else. >>>> If we leave everything as is and always extend functionality >>>> by enabling new CAPs, we're pretty much guaranteed that we >>>> don't break anything by accident. It also makes debugging easier >>>> because you can for example disable this particular feature >>>> to see whether something has bad side effects. >>> >>> >>> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it >>> what you are saying? >>> >>> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or >>> anything like that. >> >> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too. > > > Yeah, Paul already explained. It is platform specific but ok. > And does not have "EXTENSION" in the name for some reason but ok too. > > KVM_ENABLE_CAP is vcpu ioctl. So kvm_arch_vcpu_ioctl() enables VCPU's > capabilities while KVM_CAP_SPAPR_MULTITCE is KVM (or more precisely > SPAPR-TCE/LIOBN but I really do not want it to be that specific) capability. > > Sure I can add to kvm_arch_vcpu_ioctl(): > > case KVM_CAP_SPAPR_MULTITCE: > r = 0; > vcpu->kvm->arch.spapr_multitce_enabled = cap->args[0]; > break; > > But I suspect you and Ben will call it ugly. SO do I have to implement > KVM_ENABLE_CAP in kvm_arch_vm_ioctl and change the api.txt that it is not > just about vcpu ioctl anymore? Or my brand new ioctl for this? There are 2 ways of dealing with this: 1) Call the ENABLE_CAP on every vcpu. That way one CPU may handle this hypercall in the kernel while another one may not. The same as we handle PAPR today. 2) Create a new ENABLE_CAP for the vm. I think in this case option 1 is fine - it's how we handle everything else already. > > > > >>>>>>> + >>>>>>> + >>>>>>> 5. The kvm_run structure >>>>>>> ------------------------ >>>>>>> >>>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h >>>>>>> b/arch/powerpc/include/asm/kvm_host.h >>>>>>> index af326cd..20d04bd 100644 >>>>>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { >>>>>>> struct kvm *kvm; >>>>>>> u64 liobn; >>>>>>> u32 window_size; >>>>>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } >>>>>>> stat; >>>>>> >>>>>> You don't need this. >>>>>> >>>>>>> struct page *pages[0]; >>>>>>> }; >>>>>>> >>>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch { >>>>>>> spinlock_t tbacct_lock; >>>>>>> u64 busy_stolen; >>>>>>> u64 busy_preempt; >>>>>>> + >>>>>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT >>>>>>> hcall */ >>>>>>> + enum { >>>>>>> + TCERM_NONE, >>>>>>> + TCERM_GETPAGE, >>>>>>> + TCERM_PUTTCE, >>>>>>> + TCERM_PUTLIST, >>>>>>> + } tce_rm_fail; /* failed stage of request processing */ >>>>>>> #endif >>>>>>> }; >>>>>>> >>>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h >>>>>>> b/arch/powerpc/include/asm/kvm_ppc.h >>>>>>> index a5287fe..fa722a0 100644 >>>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h >>>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h >>>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu >>>>>>> *vcpu); >>>>>>> >>>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, >>>>>>> struct kvm_create_spapr_tce *args); >>>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long >>>>>>> liobn, >>>>>>> - unsigned long ioba, unsigned long tce); >>>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( >>>>>>> + struct kvm_vcpu *vcpu, unsigned long liobn); >>>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce); >>>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, >>>>>>> + unsigned long ioba, unsigned long tce); >>>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu, >>>>>>> + unsigned long liobn, unsigned long ioba, >>>>>>> + unsigned long tce); >>>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu, >>>>>>> + unsigned long liobn, unsigned long ioba, >>>>>>> + unsigned long tce_list, unsigned long npages); >>>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu, >>>>>>> + unsigned long liobn, unsigned long ioba, >>>>>>> + unsigned long tce_value, unsigned long npages); >>>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, >>>>>>> struct kvm_allocate_rma *rma); >>>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void); >>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c >>>>>>> b/arch/powerpc/kvm/book3s_64_vio.c >>>>>>> index b2d3f3b..99bf4e5 100644 >>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c >>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c >>>>>>> @@ -14,6 +14,7 @@ >>>>>>> * >>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<pau...@au1.ibm.com> >>>>>>> * Copyright 2011 David Gibson, IBM Corporation<d...@au1.ibm.com> >>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM >>>>>>> Corporation<a...@au1.ibm.com> >>>>>>> */ >>>>>>> >>>>>>> #include<linux/types.h> >>>>>>> @@ -36,8 +37,10 @@ >>>>>>> #include<asm/ppc-opcode.h> >>>>>>> #include<asm/kvm_host.h> >>>>>>> #include<asm/udbg.h> >>>>>>> +#include<asm/iommu.h> >>>>>>> +#include<asm/tce.h> >>>>>>> >>>>>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) >>>>>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0) >>>>>>> >>>>>>> static long kvmppc_stt_npages(unsigned long window_size) >>>>>>> { >>>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct >>>>>>> kvmppc_spapr_tce_table *stt) >>>>>>> struct kvm *kvm = stt->kvm; >>>>>>> int i; >>>>>>> >>>>>>> +#define __SV(x) stt->stat.x >>>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0) >>>>>>> + pr_debug("%s stat for liobn=%llx\n" >>>>>>> + "--------------- realmode ----- virtmode ---\n" >>>>>>> + "put_tce %10ld %10ld\n" >>>>>>> + "put_tce_indir %10ld %10ld\n" >>>>>>> + "stuff_tce %10ld %10ld\n", >>>>>>> + __func__, stt->liobn, >>>>>>> + __SVD(put), __SV(vm.put), >>>>>>> + __SVD(indir), __SV(vm.indir), >>>>>>> + __SVD(stuff), __SV(vm.stuff)); >>>>>>> +#undef __SVD >>>>>>> +#undef __SV >>>>>> >>>>>> All of these stat points should just be trace points. You can do the >>>>>> statistic gathering from user space then. >>>>>> >>>>>>> + >>>>>>> mutex_lock(&kvm->lock); >>>>>>> list_del(&stt->list); >>>>>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++) >>>>>>> @@ -148,3 +165,138 @@ fail: >>>>>>> } >>>>>>> return ret; >>>>>>> } >>>>>>> + >>>>>>> +/* Converts guest physical address to host virtual address */ >>>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu, >>>>>> >>>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones >>>>>> are the special ones. >>>>>> >>>>>>> + unsigned long gpa, struct page **pg) >>>>>>> +{ >>>>>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT; >>>>>>> + struct kvm_memory_slot *memslot; >>>>>>> + >>>>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn); >>>>>>> + if (!memslot) >>>>>>> + return ERROR_ADDR; >>>>>>> + >>>>>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK); >>>>>> >>>>>> s/+/|/ >>>>>> >>>>>>> + >>>>>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1) >>>>>>> + return ERROR_ADDR; >>>>>>> + >>>>>>> + return (void *) hva; >>>>>>> +} >>>>>>> + >>>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu, >>>>>>> + unsigned long liobn, unsigned long ioba, >>>>>>> + unsigned long tce) >>>>>>> +{ >>>>>>> + long ret; >>>>>>> + struct kvmppc_spapr_tce_table *tt; >>>>>>> + >>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn); >>>>>>> + /* Didn't find the liobn, put it to userspace */ >>>>>> >>>>>> Unclear comment. >>>>> >>>>> >>>>> What detail is missing? >>>> >>> >>>> Grammar wise "it" in the second half of the sentence refers to liobn. >>>> So you "put" the "liobn to userspace". That sentence doesn't >>>> make any sense. >>> >>> >>> Removed it. H_TOO_HARD itself says enough already. >>> >>> >>>> What you really want to say is: >>>> >>>> /* Couldn't find the liobn. Something went wrong. Let user space handle >>>> the hypercall. That has better ways of dealing with errors. */ >>>> >>>>> >>>>> >>>>>>> + if (!tt) >>>>>>> + return H_TOO_HARD; >>>>>>> + >>>>>>> + ++tt->stat.vm.put; >>>>>>> + >>>>>>> + if (ioba>= tt->window_size) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + ret = kvmppc_emulated_validate_tce(tce); >>>>>>> + if (ret) >>>>>>> + return ret; >>>>>>> + >>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce); >>>>>>> + >>>>>>> + return H_SUCCESS; >>>>>>> +} >>>>>>> + >>>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu, >>>>>>> + unsigned long liobn, unsigned long ioba, >>>>>>> + unsigned long tce_list, unsigned long npages) >>>>>>> +{ >>>>>>> + struct kvmppc_spapr_tce_table *tt; >>>>>>> + long i, ret = H_SUCCESS; >>>>>>> + unsigned long __user *tces; >>>>>>> + struct page *pg = NULL; >>>>>>> + >>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn); >>>>>>> + /* Didn't find the liobn, put it to userspace */ >>>>>>> + if (!tt) >>>>>>> + return H_TOO_HARD; >>>>>>> + >>>>>>> + ++tt->stat.vm.indir; >>>>>>> + >>>>>>> + /* >>>>>>> + * The spec says that the maximum size of the list is 512 TCEs so >>>>>>> + * so the whole table addressed resides in 4K page >>>>>> >>>>>> so so? >>>>>> >>>>>>> + */ >>>>>>> + if (npages> 512) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + if (tce_list& ~IOMMU_PAGE_MASK) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg); >>>>>>> + if (tces == ERROR_ADDR) >>>>>>> + return H_TOO_HARD; >>>>>>> + >>>>>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST) >>>>>>> + goto put_list_page_exit; >>>>>>> + >>>>>>> + for (i = 0; i< npages; ++i) { >>>>>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) { >>>>>>> + ret = H_PARAMETER; >>>>>>> + goto put_list_page_exit; >>>>>>> + } >>>>>>> + >>>>>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]); >>>>>>> + if (ret) >>>>>>> + goto put_list_page_exit; >>>>>>> + } >>>>>>> + >>>>>>> + for (i = 0; i< npages; ++i) >>>>>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT), >>>>>>> + vcpu->arch.tce_tmp_hpas[i]); >>>>>>> +put_list_page_exit: >>>>>>> + if (pg) >>>>>>> + put_page(pg); >>>>>>> + >>>>>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) { >>>>>>> + vcpu->arch.tce_rm_fail = TCERM_NONE; >>>>>>> + if (pg&& !PageCompound(pg)) >>>>>>> + put_page(pg); /* finish pending realmode_put_page() */ >>>>>>> + } >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> + >>>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu, >>>>>>> + unsigned long liobn, unsigned long ioba, >>>>>>> + unsigned long tce_value, unsigned long npages) >>>>>>> +{ >>>>>>> + struct kvmppc_spapr_tce_table *tt; >>>>>>> + long i, ret; >>>>>>> + >>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn); >>>>>>> + /* Didn't find the liobn, put it to userspace */ >>>>>>> + if (!tt) >>>>>>> + return H_TOO_HARD; >>>>>>> + >>>>>>> + ++tt->stat.vm.stuff; >>>>>>> + >>>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + ret = kvmppc_emulated_validate_tce(tce_value); >>>>>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ))) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE) >>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value); >>>>>>> + >>>>>>> + return H_SUCCESS; >>>>>>> +} >>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c >>>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c >>>>>>> index 30c2f3b..cd3e6f9 100644 >>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c >>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c >>>>>>> @@ -14,6 +14,7 @@ >>>>>>> * >>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<pau...@au1.ibm.com> >>>>>>> * Copyright 2011 David Gibson, IBM Corporation<d...@au1.ibm.com> >>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM >>>>>>> Corporation<a...@au1.ibm.com> >>>>>>> */ >>>>>>> >>>>>>> #include<linux/types.h> >>>>>>> @@ -35,42 +36,243 @@ >>>>>>> #include<asm/ppc-opcode.h> >>>>>>> #include<asm/kvm_host.h> >>>>>>> #include<asm/udbg.h> >>>>>>> +#include<asm/iommu.h> >>>>>>> +#include<asm/tce.h> >>>>>>> >>>>>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) >>>>>>> +#define ERROR_ADDR (~(unsigned long)0x0) >>>>>>> >>>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual >>>>>>> - * mode on PR KVM >>>>>> >>>>>> What's wrong with the warning? >>>>> >>>>> >>>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode >>>>> anymore. >>>> >>>> I thought the comment applied to the whole file before? Hrm. Maybe I >>>> misread it then. >>>> >>>>> It is technically correct for kvmppc_find_tce_table() though. Should I put >>>>> this comment before every function which may be called from real and >>>>> virtual modes? >>>> >>>> Yes, please. Otherwise someone might stick an access to a non-linear >>>> address >>>> in there by accident. >>>> >>>>> >>>>> >>>>> >>>>>>> +/* >>>>>>> + * Finds a TCE table descriptor by LIOBN >>>>>>> */ >>>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu >>>>>>> *vcpu, >>>>>>> + unsigned long liobn) >>>>>>> +{ >>>>>>> + struct kvmppc_spapr_tce_table *tt; >>>>>>> + >>>>>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) { >>>>>>> + if (tt->liobn == liobn) >>>>>>> + return tt; >>>>>>> + } >>>>>>> + >>>>>>> + return NULL; >>>>>>> +} >>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table); >>>>>>> + >>>>>>> +#ifdef DEBUG >>>>>>> +/* >>>>>>> + * Lets user mode disable realmode handlers by putting big number >>>>>>> + * in the bottom value of LIOBN >>>>>> >>>>>> What? Seriously? Just don't enable the CAP. >>>>> >>>>> >>>>> It is under DEBUG. It really, really helps to be able to disable real mode >>>>> handlers without reboot. Ok, no debug code, I'll remove. >>>> >>>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like >>>> "code that doesn't do any hard when disabled". For me, #ifdefs read >>>> "code that definitely breaks because nobody turns the #define on". >>>> >>>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and >>>> off is a much better debug approach in this case. >>>> >>>>> >>>>> >>>>>>> + */ >>>>>>> +#define kvmppc_find_tce_table(a, b) \ >>>>>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b))) >>>>>>> +#endif >>>>>>> + >>>>>>> +/* >>>>>>> + * Validates TCE address. >>>>>>> + * At the moment only flags are validated as other checks will >>>>>>> significantly slow >>>>>>> + * down or can make it even impossible to handle TCE requests in real >>>>>>> mode. >>>>>> >>>>>> What? >>>>> >>>>> >>>>> What is missing here (besides good english)? >>>> >>>> What badness could slip through by not validating everything? >>> >>> >>> I cannot think of any good check which could be done in real mode and not >>> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at >>> all? How? Don't know. >> > >> If you say that our validation doesn't validate everything, that makes >> me really weary. > > > It checks that TCE does not have any bit set in bits 2..12. If they are > set, something went very wrong. Better than nothing. > > >> Could the guest use it to maliciously inject anything? >> Could a missing check make our code go berserk? > > > No. KVM does not do anything with those addresses, just puts them to the > table and lets QEMU or a guest deal with it. > > >> What checks exactly would you do in addition when this was virtual mode? > > > Check that TCE is within RAM boundaries. Or check that the page was > allocated. find_linux_pte_or_hugepte? It can fail in real mode but in > virtual mode I can call get_user_fast_page and confirm that the address is > ok. Not sure, did not think much about it. Compare page flags with TCE > flags if both or neither have "write" set, this kind of stuff. > > I am not really sure we need any of those checks for emulated TCE at all. > > Remove the comment then? No, extend it. Explain what we could check and that we rely on surroundings to ensure everything's fine. > > >>>>>>> + */ >>>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce) >>>>>> >>>>>> I don't like the naming scheme. Please turn this around and make it >>>>>> kvmppc_tce_validate(). >>>>> >>>>> >>>>> Oh. "Like"... Ok. >>>> >>>> Yes. Like. >>>> >>>>> >>>>> >>>>>>> +{ >>>>>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ)) >>>>>>> + return H_PARAMETER; >>>>>>> + >>>>>>> + return H_SUCCESS; >>>>>>> +} >>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce); >>>>>>> + >>>>>>> +/* >>>>>>> + * Handles TCE requests for QEMU emulated devices. >>>>>> >>>>>> We still don't mention QEMU in KVM code. And does it really matter >>>>>> whether >>>>>> they're emulated by QEMU? Devices could also be emulated by KVM. >>>>>> >>>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them >>>>>>> + * later in a QEMU device implementation. >>>>>>> + * Called in both real and virtual modes. >>>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before >>>>>>> it. >>>>>>> + */ >>>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, >>>>>> >>>>>> kvmppc_tce_put() >>>>>> >>>>>>> + unsigned long ioba, unsigned long tce) >>>>>>> +{ >>>>>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT; >>>>>>> + struct page *page; >>>>>>> + u64 *tbl; >>>>>>> + >>>>>>> + /* >>>>>>> + * Note on the use of page_address() in real mode, >>>>>>> + * >>>>>>> + * It is safe to use page_address() in real mode on ppc64 because >>>>>>> + * page_address() is always defined as lowmem_page_address() >>>>>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is >>>>>>> arithmetial >>>>>>> + * operation and does not access page struct. >>>>>>> + * >>>>>>> + * Theoretically page_address() could be defined different >>>>>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL >>>>>>> + * should be enabled. >>>>>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64, >>>>>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only >>>>>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP >>>>>>> + * is not expected to be enabled on ppc32, page_address() >>>>>>> + * is safe for ppc32 as well. >>>>>>> + */ >>>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL) >>>>>>> +#error TODO: fix to avoid page_address() here >>>>>>> +#endif >>>>>> >>>>>> Can you extract the text above, the check and the page_address call into >>>>>> a >>>>>> simple wrapper function? >>>>> >>>>> >>>>> Is this function also too big? Sorry, I do not understand the comment. >>>> >>>> All of the comment and #if here only deal with the fact that you >>>> have a real mode hack to call page_address() that happens >>>> to work under specific circumstances. >>>> >>>> There's nothing kvmppc_tce_put() specific about this. >>>> The page_address() code happens to get called here, sure. >>>> But if I read the kvmppc_tce_put() function I don't care about >>>> these details - I want to understand the code flow that ends >>>> up writing the TCE. >>>> >>>>>>> + page = tt->pages[idx / TCES_PER_PAGE]; >>>>>>> + tbl = (u64 *)page_address(page); >>>>>>> + >>>>>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */ >>>>>> >>>>>> This is not an RFC, is it? >>>>> >>>>> >>>>> Any debug code is prohibited? Ok, I'll remove. >>>> >>>> Debug code that requires code changes is prohibited, yes. >>>> Debug code that is runtime switchable (pr_debug, trace points, etc) >>>> are allowed. >>> >>> >>> Is there any easy way to enable just this specific udbg_printf (not all of >>> them at once)? Trace points do not work in real mode as we figured out. >> >> You can enable pr_debug by file IIRC. > > > On already running kernel? :-/ Wow. How? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dynamic-debug-howto.txt?id=HEAD Alex _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev