Re: Windows7 crashes inside the VM when starting a certain program
On 07/28/2011 07:44 PM, André Weidemann wrote: Hi, On 28.07.2011 15:49, Paolo Bonzini wrote: On 07/28/2011 03:21 PM, Avi Kivity wrote: I haven't used debuggers very much, so I hope I grabbed the correct lines from the disassembly: http://pastebin.com/t3sfvmTg That's the bug check routine. Can you go up a frame? Or just do what Gleb suggested. Open the dump, type "!analyze -v" and cut-paste the address from WinDbg's output into the Disassemble window. This is the output of "!analyze -v": http://pastebin.com/sCZSjr8m ...and this is the output from the disassembly window: http://pastebin.com/AVZuswkT Very useful, thanks! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows7 crashes inside the VM when starting a certain program
On 27.07.2011 10:56, Gleb Natapov wrote: On Tue, Jul 26, 2011 at 12:57:44PM +0200, André Weidemann wrote: Hi, On 26.07.2011 12:08, Gleb Natapov wrote: On Tue, Jul 26, 2011 at 07:29:04AM +0200, André Weidemann wrote: On 07.07.2011 07:26, André Weidemann wrote: Hi, I am running Windows7 x64 in a VM which crashes after starting a certain game. Actually there are two games both from the same company, that make the VM crash after starting them. Windows crashes right after starting the game. With the 1st game the screen goes black as usual and the cursor keeps spinning for 3-5 seconds until Windows crashes. With the second game I get to 3D the login screen. The game then crashes after logging in. Windows displays this error message on the first crash: http://pastebin.com/kMzk9Jif Windows then finishes writing the crash dump and restarts. I can reproduce Windows crashing every time I start the game while the VM keeps running without any problems. When Windows reboots after the first crash and the game is started again, the message on the following blue screen changes slightly and stays the same(except for the addresses) for every following crash: http://pastebin.com/jVtBc4ZH I first thought that this might be related to a certain feature in 3D acceleration being used, but Futuremark 3DMark Vantage or 3DMark 11 run without any problems. They run a bit choppy on some occasions, but do that without crashing Windows7 or the VM. How can I proceed to investigate what is going wrong? I did some testing and found out that Windows7 does not crash anymore when changing "-cpu host" to "-cpu Nehalem". After doing so, What is your host cpu (cat /proc/cpuinfo)? The server is currently running on 2 out of 8 cores with kernel boot parameter "maxcpus=2". flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid Flags that are present on -cpu host but not -cpu Nehalem (excluding vmx related flags): vme dts acpi ss ht tm pbe rdtscp constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf dtes64 monitor ds_cpl est tm2 xtpr pdcm ida Some of them may be synthetic and some of them may be filtered by KVM. Can you try to run "-cpu host,-vme,-dts..." (specifying all of those flags with -). Drop those that qemu does not recognize. See if result will be the same as with -cpu Nehalem. If yes, then try to find out with flag make the difference. I started the VM with all flags that differ between the two CPUs. After removing the ones qemu-kvm did not recognize, I started the VM again with the following line: -cpu host,-vme,-acpi,-ss,-ht,-tm,-pbe,-rdtscp,-dtes64,-monitor,-ds_cpl,-est,-tm2,-xtpr,-pdcm \ Running the program under Windows7 inside the VM, caused Windows to crash again with a BSoD. The disassembly of the address f8000288320c shows the following: http://pastebin.com/7yzTYJSG André -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
Hi Stefan On 07/28/2011 11:44 PM, Stefan Hajnoczi wrote: On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan wrote: Did you investigate userspace virtio-blk performance? If so, what issues did you find? Yes, in the performance table I presented, virtio-blk in the user space lags behind the vhost-blk(although this prototype is very primitive impl.) in the kernel by about 15%. Actually, the motivation to start vhost-blk is that, in our observation, KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO perspective, especially for sequential read/write (around 20% gap). We'll deploy a large number of KVM-based systems as the infrastructure of some service and this gap is really unpleasant. By the design, IMHO, virtio performance is supposed to be comparable to the para-vulgarization solution if not better, because for KVM, guest and backend driver could sit in the same address space via mmaping. This would reduce the overhead involved in page table modification, thus speed up the buffer management and transfer a lot compared with Xen PV. I am not in a qualified position to talk about QEMU , but I think the surprised performance improvement by this very primitive vhost-blk simply manifest that, the internal structure for qemu io is the way bloated. I say it *surprised* because basically vhost just reduces the number of system calls, which is heavily tuned by chip manufacture for years. So, I guess the performance number vhost-blk gains mainly could possibly be contributed to *shorter and simpler* code path. Anyway, IMHO, compared with user space approach, the in-kernel one would allow more flexibility and better integration with the kernel IO stack, since we don't need two IO stacks for guest OS. I have a hacked up world here that basically implements vhost-blk in userspace: http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c * A dedicated virtqueue thread sleeps on ioeventfd * Guest memory is pre-mapped and accessed directly (not using QEMU's usually memory access functions) * Linux AIO is used, the QEMU block layer is bypassed * Completion interrupts are injected from the virtqueue thread using ioctl I will try to rebase onto qemu-kvm.git/master (this work is several months old). Then we can compare to see how much of the benefit can be gotten in userspace. I don't really get you about vhost-blk in user space since vhost infrastructure itself means an in-kernel accelerator that implemented in kernel . I guess what you meant is somewhat a re-write of virtio-blk in user space with a dedicated thread handling requests, and shorter code path similar to vhost-blk. [performance] Currently, the fio benchmarking number is rather promising. The seq read is imporved as much as 16% for throughput and the latency is dropped up to 14%. For seq write, 13.5% and 13% respectively. sequential read: +-+-+---+---+ | iodepth | 1 | 2 | 3 | +-+-+---+ | virtio-blk | 4116(214) | 7814(222) | 8867(306) | +-+-+---+---+ | vhost-blk | 4755(183) | 8645(202) | 10084(266) | +-+-+---+---+ 4116(214) means 4116 IOPS/s, the it is completion latency is 214 us. seqeuential write: +-+-++--+ | iodepth | 1 |2 | 3 | +-+-++--+ | virtio-blk | 3848(228) | 6505(275)| 9335(291) | +-+-++--+ | vhost-blk | 4370(198) | 7009(249)| 9938(264) | +-+-++--+ the fio command for sequential read: sudo fio -name iops -readonly -rw=read -runtime=120 -iodepth 1 -filename /dev/vda -ioengine libaio -direct=1 -bs=512 and config file for sequential write is: dev@taobao:~$ cat rw.fio - [test] rw=rw size=200M directory=/home/dev/data ioengine=libaio iodepth=1 direct=1 bs=512 - 512 byte blocksize is very small, given that you can expect a file system to have 4 KB or so block sizes. It would be interesting to measure a wider range of block sizes: 4 KB, 64 KB, and 128 KB for example. Stefan Actually, I have tested 4KB, it shows the same improvement. What I care more is iodepth, since batched AIO would benefit it.But my laptop SATA doesn't behave well as it advertises: it says its NCQ queue depth is 32 and kernel tells me it support 31 requests in one go. When increase iodepth in the test up to 4, both the host and guest' IOPS drops drastically. Yuan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
Hi On 07/29/2011 12:48 PM, Stefan Hajnoczi wrote: On Thu, Jul 28, 2011 at 4:44 PM, Stefan Hajnoczi wrote: On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan wrote: Did you investigate userspace virtio-blk performance? If so, what issues did you find? I have a hacked up world here that basically implements vhost-blk in userspace: http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c * A dedicated virtqueue thread sleeps on ioeventfd * Guest memory is pre-mapped and accessed directly (not using QEMU's usually memory access functions) * Linux AIO is used, the QEMU block layer is bypassed * Completion interrupts are injected from the virtqueue thread using ioctl I will try to rebase onto qemu-kvm.git/master (this work is several months old). Then we can compare to see how much of the benefit can be gotten in userspace. Here is the rebased virtio-blk-data-plane tree: http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane When I run it on my laptop with an Intel X-25M G2 SSD I see a latency reduction compared to mainline userspace virtio-blk. I'm not posting results because I did quick fio runs without ensuring a quiet benchmarking environment. There are a couple of things that could be modified: * I/O request merging is done to mimic bdrv_aio_multiwrite() - but vhost-blk does not do this. Try turning it off? I noted bdrv_aio_multiwrite() do the murging job, but I am not sure if this trick is really needed since we have an io scheduler down the path that is in a much more better position to murge requests. I think the duplicate *pre-mature* merging of bdrv_aio_multiwrite is the result of laio_submit()'s lack of submitting the requests in a batch mode. io_submit() in the fs/aio.c says that every time we call laio_submit(), it will submit the very request into the driver's request queue, which would be run when we blk_finish_plug(). IMHO, you can simply batch io_submit() requests instead of this tricks if you already bypass the QEMU block layer. * epoll(2) is used but perhaps select(2)/poll(2) have lower latency for this use case. Try another event mechanism. Let's see how it compares to vhost-blk first. I can tweak it if we want to investigate further. Yuan: Do you want to try the virtio-blk-data-plane tree? You don't need to change the qemu-kvm command-line options. Stefan Yes, please, sounds interesting. BTW, I think the user space would achieve the same performance gain if you bypass qemu io layer all the way down to the system calls in a request handling cycle, compared to the current vhost-blk implementation that uses linux AIO. But hey, I would go further to optimise it with block layer and other resources in the mind. ;) and I don't add complexity to the current qemu io layer. Yuan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested VMX - L1 hangs on running L2
2011/7/27 Nadav Har'El : > On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs > on running L2": >> > > No, both patches are wrong. >> > >> >> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC, >> regardless of the setting of any MSR bitmap. The reason why is that it >> is being called by the L0 hypervisor kernel, which handles only >> interactions with the L1 MSRs. > > guest_read_tsc() (called by the above get_msr) currently does this: > > static u64 guest_read_tsc(void) > { > u64 host_tsc, tsc_offset; > > rdtscll(host_tsc); > tsc_offset = vmcs_read64(TSC_OFFSET); > return host_tsc + tsc_offset; > } That's wrong. You should NEVER believe the offset written into the hardware VMCS to be the current, real L1 TSC offset, as that is not an invariant. Instead, you should ALWAYS return the host TSC + the L1 TSC offset. Sometimes, this may be the hardware value. > I guess you'd want this to change to something like: > > tsc_offset = is_guest_mode(vcpu) ? > vmx->nested.vmcs01_tsc_offset : > vmcs_read64(TSC_OFFSET); > > But I still am not convinced that that would be right I believe this is correct. But may it be cheaper to read from the in-memory structure than the actual hardware VMCS? > E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't > trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC > MSR read will not exit (because L1 doesn't trap it) but *will* do the extra > offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which > is the sum of that of vmcs01 and vmcs12), is correct, and the new code will > be incorrect. Or am I misunderstanding the SDM? In that case, you need to distinguish between reads of the TSC MSR by the guest and reads by the host (as done internally to track drift and compensation). The code that needs to change isn't guest_read_tsc(), that code must respect the invariant of only returning the L1 guest TSC (in fact, that may be a good name change for the function). What needs to change is the actual code involved in the MSR read. If it determines that something other than the L1 guest is running, it needs to ignore the hardware TSC offset and return the TSC as if read by the L1 guest. Unfortunately, the layering currently doesn't seem to allow for this, and it looks like both vendor specific variants currently get this wrong. The call stack: kvm_get_msr() kvm_x86_ops->get_msr() vendor_get_msr() vendor_guest_read_tsc() offers no insight as to the intention of the caller. Is it trying to get the guest TSC to return to the guest, or is it trying to get the guest TSC to calibrate / measure and compensate for TSC effects? So you are right, this is still wrong for the case in which L1 does not trap TSC MSR reads. Note however, the RDTSC instruction is still virtualized properly, it is only the relatively rare actual TSC MSR read via RDMSR which is mis-virtualized (this bug exists today in the SVM implementation if I am reading it correctly - cc'd Joerg to notify him of that). That, combined with the relative importance of supporting a guest which does not trap on these MSR reads suggest this is a low priority design issue however (RDTSC still works even if the MSR is trapped, correct?) If you want to go the extra mile to support such guests, the only fully correct approach then is to do one of the following: 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and transform current uses of the code which does TSC compensation to use this new API. *Bonus* - no need to do double indirection through the generic MSR code. or, alternatively: 2) Do not trap MSR reads of the TSC if the current L1 guest is not trapping MSR reads of the TSC. This is not possible if you cannot enforce specific read vs. write permission in hardware - it may be possible however, if you can trap all MSR writes regardless of the permission bitmap. > Can you tell me in which case the original code would returen incorrect > results to a guest (L1 or L2) doing anything MSR-related? It never returns incorrect values to a guest. It does however return incorrect values to the L0 hypervisor, which is expecting to do arithmetic based on the L1 TSC value, and this fails catastrophically when it receives values for other nested guests. > I'm assuming that some code in KVM also uses kvm_read_msr and assumes it > gets the TSC value for L1, not for the guest currently running (L2 or L1). Exactly. > I don't understand why it needs to assume that... Why would it be wrong to > return L2's TSC, and just remember that *changing* the L2 TSC really means > changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which > we can't touch? L0 measures the L1 TSC at various points to be sure the L1 TSC never regresses by going backwards, and also to
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On Fri, Jul 29, 2011 at 8:22 AM, Liu Yuan wrote: > Hi Stefan > On 07/28/2011 11:44 PM, Stefan Hajnoczi wrote: >> >> On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan wrote: >> >> Did you investigate userspace virtio-blk performance? If so, what >> issues did you find? >> > > Yes, in the performance table I presented, virtio-blk in the user space lags > behind the vhost-blk(although this prototype is very primitive impl.) in the > kernel by about 15%. I mean did you investigate *why* userspace virtio-blk has higher latency? Did you profile it and drill down on its performance? It's important to understand what is going on before replacing it with another mechanism. What I'm saying is, if I have a buggy program I can sometimes rewrite it from scratch correctly but that doesn't tell me what the bug was. Perhaps the inefficiencies in userspace virtio-blk can be solved by adjusting the code (removing inefficient notification mechanisms, introducing a dedicated thread outside of the QEMU iothread model, etc). Then we'd get the performance benefit for non-raw images and perhaps non-virtio and non-Linux host platforms too. > Actually, the motivation to start vhost-blk is that, in our observation, > KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO > perspective, especially for sequential read/write (around 20% gap). > > We'll deploy a large number of KVM-based systems as the infrastructure of > some service and this gap is really unpleasant. > > By the design, IMHO, virtio performance is supposed to be comparable to the > para-vulgarization solution if not better, because for KVM, guest and > backend driver could sit in the same address space via mmaping. This would > reduce the overhead involved in page table modification, thus speed up the > buffer management and transfer a lot compared with Xen PV. Yes, guest memory is just a region of QEMU userspace memory. So it's easy to reach inside and there are no page table tricks or copying involved. > I am not in a qualified position to talk about QEMU , but I think the > surprised performance improvement by this very primitive vhost-blk simply > manifest that, the internal structure for qemu io is the way bloated. I say > it *surprised* because basically vhost just reduces the number of system > calls, which is heavily tuned by chip manufacture for years. So, I guess the > performance number vhost-blk gains mainly could possibly be contributed to > *shorter and simpler* code path. First we need to understand exactly what the latency overhead is. If we discover that it's simply not possible to do this equally well in userspace, then it makes perfect sense to use vhost-blk. So let's gather evidence and learn what the overheads really are. Last year I spent time looking at virtio-blk latency: http://www.linux-kvm.org/page/Virtio/Block/Latency See especially this diagram: http://www.linux-kvm.org/page/Image:Threads.png The goal wasn't specifically to reduce synchronous sequential I/O, instead the aim was to reduce overheads for a variety of scenarios, especially multithreaded workloads. In most cases it was helpful to move I/O submission out of the vcpu thread by using the ioeventfd model just like vhost. Ioeventfd for userspace virtio-blk is now on by default in qemu-kvm. Try running the userspace virtio-blk benchmark with -drive if=none,id=drive0,file=... -device virtio-blk-pci,drive=drive0,ioeventfd=off. This causes QEMU to do I/O submission in the vcpu thread, which might reduce latency at the cost of stealing guest time. > Anyway, IMHO, compared with user space approach, the in-kernel one would > allow more flexibility and better integration with the kernel IO stack, > since we don't need two IO stacks for guest OS. I agree that there may be advantages to integrating with in-kernel I/O mechanisms. An interesting step would be to implement the submit_bio() approach that Christoph suggested and seeing if that improves things further. Push virtio-blk as far as you can and let's see what the performance is! >> I have a hacked up world here that basically implements vhost-blk in >> userspace: >> >> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c >> >> * A dedicated virtqueue thread sleeps on ioeventfd >> * Guest memory is pre-mapped and accessed directly (not using QEMU's >> usually memory access functions) >> * Linux AIO is used, the QEMU block layer is bypassed >> * Completion interrupts are injected from the virtqueue thread using >> ioctl >> >> I will try to rebase onto qemu-kvm.git/master (this work is several >> months old). Then we can compare to see how much of the benefit can >> be gotten in userspace. >> > I don't really get you about vhost-blk in user space since vhost > infrastructure itself means an in-kernel accelerator that implemented in > kernel . I guess what you meant is somewhat a re-write of virtio-blk in user > space with a dedicated thread handling requests, a
Re: Nested VMX - L1 hangs on running L2
On Fri, Jul 29, 2011 at 05:01:16AM -0400, Zachary Amsden wrote: > So you are right, this is still wrong for the case in which L1 does > not trap TSC MSR reads. Note however, the RDTSC instruction is still > virtualized properly, it is only the relatively rare actual TSC MSR > read via RDMSR which is mis-virtualized (this bug exists today in the > SVM implementation if I am reading it correctly - cc'd Joerg to notify > him of that). That, combined with the relative importance of > supporting a guest which does not trap on these MSR reads suggest this > is a low priority design issue however (RDTSC still works even if the > MSR is trapped, correct?) Actually, the documentation is not entirely clear about this. But I tend to agree that direct _reads_ of MSR 0x10 in guest-mode should return the tsc with tsc_offset applied. But on the other side, there is even SVM hardware which does this wrong. For some K8s there is an erratum that the tsc_offset is not applied when the MSR is read directly in guest mode. But yes, to be architecturaly correct the msr read should always return the tsc of the currently running guest level. In reality this shouldn't be am issue, though, and rdtsc[p] is still working correctly. Regards, Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On Fri, Jul 29, 2011 at 03:59:53PM +0800, Liu Yuan wrote: > I noted bdrv_aio_multiwrite() do the murging job, but I am not sure Just like I/O schedulers it's actually fairly harmful on high IOPS, low latency devices. I've just started doing a lot of qemu bencharks, and disabling that multiwrite mess alone gives fairly nice speedups. The major issue seems to be additional memory allocations and cache lines - a problem that actually is fairly inherent all over the qemu code. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 39412] Win Vista and Win2K8 guests' network breaks down
https://bugzilla.kernel.org/show_bug.cgi?id=39412 Jay Ren changed: What|Removed |Added Status|NEW |RESOLVED Resolution||CODE_FIX --- Comment #3 from Jay Ren 2011-07-29 11:16:55 --- This bug got fixed and I verified it. It doesn't exist in latest kvm.git tree e72ef590a3ef3047f6ed5bcb8808a9734f6c4b32. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 39412] Win Vista and Win2K8 guests' network breaks down
https://bugzilla.kernel.org/show_bug.cgi?id=39412 Jay Ren changed: What|Removed |Added Status|RESOLVED|VERIFIED -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange MySQL behaviour
Hello! On Thu, Jul 28, 2011 at 11:34, Avi Kivity wrote: > Looks like you are blocked on disk. What does iostat say about disk > utilization (in both guest and host)? I also thought so, but host cpu states doesn't show any disk blocking. iostat 5 with cache=none: Guest: avg-cpu: %user %nice %system %iowait %steal %idle 0,100,000,25 36,480,00 63,17 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn vda 177,0076,80 1308,80384 6544 dm-0167,0076,80 1308,80384 6544 dm-1 0,00 0,00 0,00 0 0 Host: avg-cpu: %user %nice %system %iowait %steal %idle 5,620,001,470,500,00 92,40 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 185,40 0,00 1212,80 0 6064 sdb 188,6025,60 1212,80128 6064 md1 198,4025,60 1212,80128 6064 [skip] dm-12 195,0025,60 1177,60128 5888 time mysql ...: real7m13.876s user0m0.338s sys 0m0.182s > Try s/cache=none/cache=unsafe/ as an experiment. Does it help? With cache=unsafe: Guest: avg-cpu: %user %nice %system %iowait %steal %idle 3,150,008,60 11,000,00 77,24 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn vda3827,60 782,40 20587,20 3912 102936 dm-0 2638,60 779,20 20576,00 3896 102880 dm-1 0,00 0,00 0,00 0 0 Host: avg-cpu: %user %nice %system %iowait %steal %idle 10,410,007,730,000,00 81,86 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 5,40 307,2040,00 1536200 sdb 5,40 460,8040,00 2304200 md1 99,00 768,0027,20 3840136 dm-1296,00 768,00 0,00 3840 0 4 times followed by avg-cpu: %user %nice %system %iowait %steal %idle 10,480,008,180,030,00 81,31 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 29,00 256,00 16236,80 1280 81184 sdb 27,40 0,00 16236,80 0 81184 md12057,80 256,00 16224,00 1280 81120 dm-12 2050,80 256,00 16150,40 1280 80752 2 times time: real0m19.133s user0m0.429s sys 0m0.271s > Try s/cache=none/cache=none,aio=native/. Does it help? This one is safe, > you can keep it if it works. With aio=native: Guest: avg-cpu: %user %nice %system %iowait %steal %idle 0,200,000,50 37,080,00 62,22 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn vda 192,0078,40 1038,40392 5192 dm-0133,6078,40 1040,00392 5200 dm-1 0,00 0,00 0,00 0 0 Host: avg-cpu: %user %nice %system %iowait %steal %idle 2,580,005,440,200,00 91,77 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 176,8022,40 1096,00112 5480 sdb 174,40 0,00 1096,00 0 5480 md1 181,6022,40 1096,00112 5480 dm-12 175,2022,40 1038,40112 5192 Time: real7m7.770s user0m0.352s sys 0m0.217s If the same mysql command is executed on host directly: Host: avg-cpu: %user %nice %system %iowait %steal %idle 3,960,007,548,900,00 79,60 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda1273,40 164,80 10252,80824 51264 sdb1304,60 352,00 10252,80 1760 51264 md11345,00 516,80 10252,80 2584 51264 dm-0 1343,60 516,80 10232,00 2584 51160 Time: real0m10.161s user0m0.294s sys 0m0.284s This is completely testing server, so I can try some new versons or patches. -- Boris Dolgov. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
On 07/28/2011 10:47 PM, Christoph Hellwig wrote: On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote: From: Liu Yuan Vhost-blk driver is an in-kernel accelerator, intercepting the IO requests from KVM virtio-capable guests. It is based on the vhost infrastructure. This is supposed to be a module over latest kernel tree, but it needs some symbols from fs/aio.c and fs/eventfd.c to compile with. So currently, after applying the patch, you need to *recomplie* the kernel. Usage: $kernel-src: make M=drivers/vhost $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko After insmod, you'll see /dev/vhost-blk created. done! You'll need to send the changes for existing code separately. Thanks for reminding. If you're going mostly for raw blockdevice access just calling submit_bio will shave even more overhead off, and simplify the code a lot. Yes, sounds cool, I'll give it a try. Yuan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: Do not unconditionally read PDPTE from guest memory
On Thu, Jul 28, 2011 at 04:36:17AM -0400, Avi Kivity wrote: > Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded. > On SVM, it is not possible to implement this, but on VMX this is possible > and was indeed implemented until nested SVM changed this to unconditionally > read PDPTEs dynamically. This has noticable impact when running PAE guests. > > Fix by changing the MMU to read PDPTRs from the cache, falling back to > reading from memory for the nested MMU. > > Signed-off-by: Avi Kivity Hmm, interesting. Sorry for breaking it. I tested the patch on nested svm, it works fine. Tested-by: Joerg Roedel -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows7 crashes inside the VM when starting a certain program
On Fri, Jul 29, 2011 at 09:20:35AM +0200, André Weidemann wrote: > On 27.07.2011 10:56, Gleb Natapov wrote: > >On Tue, Jul 26, 2011 at 12:57:44PM +0200, André Weidemann wrote: > >>Hi, > >> > >>On 26.07.2011 12:08, Gleb Natapov wrote: > >>>On Tue, Jul 26, 2011 at 07:29:04AM +0200, André Weidemann wrote: > On 07.07.2011 07:26, André Weidemann wrote: > >Hi, > >I am running Windows7 x64 in a VM which crashes after starting a certain > >game. Actually there are two games both from the same company, that make > >the VM crash after starting them. > >Windows crashes right after starting the game. With the 1st game the > >screen goes black as usual and the cursor keeps spinning for 3-5 seconds > >until Windows crashes. With the second game I get to 3D the login > >screen. The game then crashes after logging in. > >Windows displays this error message on the first crash: > >http://pastebin.com/kMzk9Jif > >Windows then finishes writing the crash dump and restarts. > >I can reproduce Windows crashing every time I start the game while the > >VM keeps running without any problems. > >When Windows reboots after the first crash and the game is started > >again, the message on the following blue screen changes slightly and > >stays the same(except for the addresses) for every following crash: > >http://pastebin.com/jVtBc4ZH > > > >I first thought that this might be related to a certain feature in 3D > >acceleration being used, but Futuremark 3DMark Vantage or 3DMark 11 run > >without any problems. They run a bit choppy on some occasions, but do > >that without crashing Windows7 or the VM. > > > >How can I proceed to investigate what is going wrong? > > I did some testing and found out that Windows7 does not crash > anymore when changing "-cpu host" to "-cpu Nehalem". After doing so, > >>>What is your host cpu (cat /proc/cpuinfo)? > >> > >>The server is currently running on 2 out of 8 cores with kernel boot > >>parameter "maxcpus=2". > >> > >>flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > >>pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm > >>pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good > >>xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est > >>tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow > >>vnmi flexpriority ept vpid > >Flags that are present on -cpu host but not -cpu Nehalem (excluding vmx > >related flags): > > > >vme dts acpi ss ht tm pbe rdtscp constant_tsc arch_perfmon pebs bts rep_good > >xtopology nonstop_tsc aperfmperf dtes64 monitor ds_cpl est tm2 xtpr pdcm ida > > > >Some of them may be synthetic and some of them may be filtered by KVM. > > > >Can you try to run "-cpu host,-vme,-dts..." (specifying all of those > >flags with -). Drop those that qemu does not recognize. See if result > >will be the same as with -cpu Nehalem. If yes, then try to find out with > >flag make the difference. > > I started the VM with all flags that differ between the two CPUs. > After removing the ones qemu-kvm did not recognize, I started the VM > again with the following line: > -cpu > host,-vme,-acpi,-ss,-ht,-tm,-pbe,-rdtscp,-dtes64,-monitor,-ds_cpl,-est,-tm2,-xtpr,-pdcm > \ > > Running the program under Windows7 inside the VM, caused Windows to > crash again with a BSoD. > The disassembly of the address f8000288320c shows the following: > http://pastebin.com/7yzTYJSG > Looks like it tries to read MSR_LASTBRANCH_TOS MSR which kvm does not support. Do you see something interesting in dmesg? I wonder how availability of the MSR should be checked. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Biweekly KVM Test report, kernel e72ef590... qemu fda19064...
Hi All, This is KVM test result against kvm.git e72ef590a3ef3047f6ed5bcb8808a9734f6c4b32 based on kernel 3.0.0+, and qemu-kvm.git fda19064e889d4419dd3dc69ca8e6e7a1535fdf5. We found no new bugs during the past two weeks. We found 2 bugs got fixed. One fixed bug is about Win2k8 and Vista guest’s network issue, and the other is a qemu-kvm build issue. And an old bug about VT-D also exists. https://bugs.launchpad.net/qemu/+bug/799036 New issue: Fixed issues: 1. Win Vista and Win2K8 guests' network breaks down https://bugzilla.kernel.org/show_bug.cgi?id=39412 2. qemu-kvm.git make error when ‘CC ui/vnc-enc-tight.o’ https://bugs.launchpad.net/qemu/+bug/802588 Old Issues list: Old Issues: 1. ltp diotest running time is 2.54 times than before https://sourceforge.net/tracker/?func=detail&aid=2723366&group_id=180599&atid=893831 2. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detail&aid=2721640&group_id=180599&atid=893831 3. [vt-d] NIC assignment order in command line make some NIC can't work https://bugs.launchpad.net/qemu/+bug/799036 Test environment: == Platform Westmere-EP SanyBridge-EP CPU Cores 24 32 Memory size 10G 32G Report summary of IA32E on Westmere-EP: Summary Test Report of Last Session = Total PassFailNoResult Crash = control_panel_ept_vpid 12 12 0 00 control_panel_ept 4 4 0 00 control_panel_vpid 3 3 0 00 control_panel 3 3 0 00 gtest_vpid 1 1 0 00 gtest_ept 1 1 0 00 gtest 3 2 0 00 vtd_ept_vpid3 1 1 00 gtest_ept_vpid 12 11 1 00 sriov_ept_vpid 6 6 0 00 = control_panel_ept_vpid 12 12 0 00 :KVM_LM_Continuity_64_g3 1 1 0 00 :KVM_four_dguest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 :KVM_SR_SMP_64_g32e1 1 0 00 :KVM_LM_SMP_64_g32e1 1 0 00 :KVM_linux_win_64_g32e 1 1 0 00 :KVM_two_winxp_64_g32e 1 1 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_256M_guest_64_gPAE1 1 0 00 :KVM_SR_Continuity_64_g3 1 1 0 00 :KVM_256M_guest_64_g32e1 1 0 00 :KVM_four_sguest_64_g32e 1 1 0 00 control_panel_ept 4 4 0 00 :KVM_linux_win_64_g32e 1 1 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_g32e1 1 0 00 control_panel_vpid 3 3 0 00 :KVM_linux_win_64_g32e 1 1 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 control_panel 3 3 0 00 :KVM_1500M_guest_64_g32e 1 1 0 00 :KVM_1500M_guest_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_g32e1 1 0 00 gtest_vpid 1 1 0 00 :boot_smp_win7_ent_64_g3 1 1 0 00 gtest_ept 1 1 0 00 :boot_smp_win7_ent_64_g3 1 1 0 00 gtest 3 3 0 00 :boot_smp_win2008_64_g32 1 1 0 00 :boot_smp_win7_ent_64_gP 1 1 0 00 :boot_smp_vista_64_g32e1 1 0 00 vtd_ept_vpid3 2 1 00 :one_pcie_smp_xp_64_g32e 1 1 0 00 :one_pcie_smp_64_g32e 1 1 0 00 :two_dev_smp_64_g32e 1 0 1 00 gtest_ept_vpid 12 11 1 00 :boot_up_acpi_64_g32e 1 1 0 00 :boot_base_kernel_64_g32 1
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote: I mean did you investigate *why* userspace virtio-blk has higher latency? Did you profile it and drill down on its performance? It's important to understand what is going on before replacing it with another mechanism. What I'm saying is, if I have a buggy program I can sometimes rewrite it from scratch correctly but that doesn't tell me what the bug was. Perhaps the inefficiencies in userspace virtio-blk can be solved by adjusting the code (removing inefficient notification mechanisms, introducing a dedicated thread outside of the QEMU iothread model, etc). Then we'd get the performance benefit for non-raw images and perhaps non-virtio and non-Linux host platforms too. As Christoph mentioned, the unnecessary memory allocation and too much cache line unfriendly function pointers might be culprit. For example, the read quests code path for linux aio would be qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output ->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes again nested called!)->raw_aio_readv->laio_submit->io_submit... Looking at this long list,most are function pointers that can not be inlined, and the internal data structures used by these functions are dozons. Leave aside code complexity, this long code path would really need retrofit. As Christoph simply put, this kind of mess is inherent all over the qemu code. So I am afraid, the 'retrofit' would end up to be a re-write the entire (sub)system. I have to admit that, I am inclined to the MST's vhost approach, that write a new subsystem other than tedious profiling and fixing, that would possibly goes as far as actually re-writing it. Actually, the motivation to start vhost-blk is that, in our observation, KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO perspective, especially for sequential read/write (around 20% gap). We'll deploy a large number of KVM-based systems as the infrastructure of some service and this gap is really unpleasant. By the design, IMHO, virtio performance is supposed to be comparable to the para-vulgarization solution if not better, because for KVM, guest and backend driver could sit in the same address space via mmaping. This would reduce the overhead involved in page table modification, thus speed up the buffer management and transfer a lot compared with Xen PV. Yes, guest memory is just a region of QEMU userspace memory. So it's easy to reach inside and there are no page table tricks or copying involved. I am not in a qualified position to talk about QEMU , but I think the surprised performance improvement by this very primitive vhost-blk simply manifest that, the internal structure for qemu io is the way bloated. I say it *surprised* because basically vhost just reduces the number of system calls, which is heavily tuned by chip manufacture for years. So, I guess the performance number vhost-blk gains mainly could possibly be contributed to *shorter and simpler* code path. First we need to understand exactly what the latency overhead is. If we discover that it's simply not possible to do this equally well in userspace, then it makes perfect sense to use vhost-blk. So let's gather evidence and learn what the overheads really are. Last year I spent time looking at virtio-blk latency: http://www.linux-kvm.org/page/Virtio/Block/Latency Nice stuff. See especially this diagram: http://www.linux-kvm.org/page/Image:Threads.png The goal wasn't specifically to reduce synchronous sequential I/O, instead the aim was to reduce overheads for a variety of scenarios, especially multithreaded workloads. In most cases it was helpful to move I/O submission out of the vcpu thread by using the ioeventfd model just like vhost. Ioeventfd for userspace virtio-blk is now on by default in qemu-kvm. Try running the userspace virtio-blk benchmark with -drive if=none,id=drive0,file=... -device virtio-blk-pci,drive=drive0,ioeventfd=off. This causes QEMU to do I/O submission in the vcpu thread, which might reduce latency at the cost of stealing guest time. Anyway, IMHO, compared with user space approach, the in-kernel one would allow more flexibility and better integration with the kernel IO stack, since we don't need two IO stacks for guest OS. I agree that there may be advantages to integrating with in-kernel I/O mechanisms. An interesting step would be to implement the submit_bio() approach that Christoph suggested and seeing if that improves things further. Push virtio-blk as far as you can and let's see what the performance is! I have a hacked up world here that basically implements vhost-blk in userspace: http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c * A dedicated virtqueue thread sleeps on ioeventfd * Guest memory is pre-mapped and accessed directly (not using QEMU's usually memory access function
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On Fri, Jul 29, 2011 at 1:01 PM, Liu Yuan wrote: > On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote: >> >> I mean did you investigate *why* userspace virtio-blk has higher >> latency? Did you profile it and drill down on its performance? >> >> It's important to understand what is going on before replacing it with >> another mechanism. What I'm saying is, if I have a buggy program I >> can sometimes rewrite it from scratch correctly but that doesn't tell >> me what the bug was. >> >> Perhaps the inefficiencies in userspace virtio-blk can be solved by >> adjusting the code (removing inefficient notification mechanisms, >> introducing a dedicated thread outside of the QEMU iothread model, >> etc). Then we'd get the performance benefit for non-raw images and >> perhaps non-virtio and non-Linux host platforms too. >> > > As Christoph mentioned, the unnecessary memory allocation and too much cache > line unfriendly > function pointers might be culprit. For example, the read quests code path > for linux aio would be > > > qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output > ->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes > again nested called!)->raw_aio_readv->laio_submit->io_submit... > > Looking at this long list,most are function pointers that can not be > inlined, and the internal data structures used by these functions are > dozons. Leave aside code complexity, this long code path would really need > retrofit. As Christoph simply put, this kind of mess is inherent all over > the qemu code. So I am afraid, the 'retrofit' would end up to be a re-write > the entire (sub)system. I have to admit that, I am inclined to the MST's > vhost approach, that write a new subsystem other than tedious profiling and > fixing, that would possibly goes as far as actually re-writing it. I'm totally for vhost-blk if there are unique benefits that make it worth maintaining. But better benchmark results are not a cause, they are an effect. So the thing to do is to drill down on both vhost-blk and userspace virtio-blk to understand what causes overheads. Evidence showing that userspace can never compete is needed to justify vhost-blk IMO. >>> Actually, the motivation to start vhost-blk is that, in our observation, >>> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO >>> perspective, especially for sequential read/write (around 20% gap). >>> >>> We'll deploy a large number of KVM-based systems as the infrastructure of >>> some service and this gap is really unpleasant. >>> >>> By the design, IMHO, virtio performance is supposed to be comparable to >>> the >>> para-vulgarization solution if not better, because for KVM, guest and >>> backend driver could sit in the same address space via mmaping. This >>> would >>> reduce the overhead involved in page table modification, thus speed up >>> the >>> buffer management and transfer a lot compared with Xen PV. >> >> Yes, guest memory is just a region of QEMU userspace memory. So it's >> easy to reach inside and there are no page table tricks or copying >> involved. >> >>> I am not in a qualified position to talk about QEMU , but I think the >>> surprised performance improvement by this very primitive vhost-blk simply >>> manifest that, the internal structure for qemu io is the way bloated. I >>> say >>> it *surprised* because basically vhost just reduces the number of system >>> calls, which is heavily tuned by chip manufacture for years. So, I guess >>> the >>> performance number vhost-blk gains mainly could possibly be contributed >>> to >>> *shorter and simpler* code path. >> >> First we need to understand exactly what the latency overhead is. If >> we discover that it's simply not possible to do this equally well in >> userspace, then it makes perfect sense to use vhost-blk. >> >> So let's gather evidence and learn what the overheads really are. >> Last year I spent time looking at virtio-blk latency: >> http://www.linux-kvm.org/page/Virtio/Block/Latency >> > > Nice stuff. > >> See especially this diagram: >> http://www.linux-kvm.org/page/Image:Threads.png >> >> The goal wasn't specifically to reduce synchronous sequential I/O, >> instead the aim was to reduce overheads for a variety of scenarios, >> especially multithreaded workloads. >> >> In most cases it was helpful to move I/O submission out of the vcpu >> thread by using the ioeventfd model just like vhost. Ioeventfd for >> userspace virtio-blk is now on by default in qemu-kvm. >> >> Try running the userspace virtio-blk benchmark with -drive >> if=none,id=drive0,file=... -device >> virtio-blk-pci,drive=drive0,ioeventfd=off. This causes QEMU to do I/O >> submission in the vcpu thread, which might reduce latency at the cost >> of stealing guest time. >> >>> Anyway, IMHO, compared with user space approach, the in-kernel one would >>> allow more flexibility and better integration with the kernel IO stack, >>> since we don't n
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
I hit a weirdness yesterday, just want to mention it in case you notice it too. When running vanilla qemu-kvm I forgot to use aio=native. When I compared the results against virtio-blk-data-plane (which *always* uses Linux AIO) I was surprised to find average 4k read latency was lower and the standard deviation was also lower. So from now on I will run tests both with and without aio=native. aio=native should be faster and if I can reproduce the reverse I'll try to figure out why. Stefan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 00/23] Memory API, batch 1
On 07/26/2011 06:25 AM, Avi Kivity wrote: This patchset contains the core of the memory API, with one device (usb-ohci) coverted for reference. The API is currently implemented on top of the old ram_addr_t/cpu_register_physical_memory() API, but the plan is to make it standalone later. The goals of the API are: - correctness: by modelling the memory hierarchy, things like the 440FX PAM registers and movable, overlapping PCI BARs can be modelled accurately. - efficiency: by maintaining an object tree describing guest memory, we can eventually get rid of the page descriptor array - security: by having more information available declaratively, we reduce coding errors that may be exploited by malicious guests Applied all. Thanks. Regards, Anthony Liguori Also available from git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git refs/tags/memory-region-batch-1-v2 Changes from v1: - switched to gtk-doc - more copyright blurbs - simplified flatview_simplify() - use assert() instead of abort() for invariant checks (but keep abort() for runtime errors) - commit log fixups Avi Kivity (23): Add memory API documentation Hierarchical memory region API memory: implement dirty tracking memory: merge adjacent segments of a single memory region Internal interfaces for memory API memory: abstract address space operations memory: rename MemoryRegion::has_ram_addr to ::terminates memory: late initialization of ram_addr memory: I/O address space support memory: add backward compatibility for old portio registration memory: add backward compatibility for old mmio registration memory: add ioeventfd support memory: separate building the final memory map into two steps memory: transaction API exec.c: initialize memory map ioport: register ranges by byte aligned addresses always pc: grab system_memory pc: convert pc_memory_init() to memory API pc: move global memory map out of pc_init1() and into its callers pci: pass address space to pci bus when created pci: add MemoryRegion based BAR management API sysbus: add MemoryRegion based memory management API usb-ohci: convert to MemoryRegion Makefile.target|1 + docs/memory.txt| 172 exec-memory.h | 39 ++ exec.c | 19 + hw/apb_pci.c |2 + hw/bonito.c|4 +- hw/grackle_pci.c |5 +- hw/gt64xxx.c |4 +- hw/pc.c| 62 ++- hw/pc.h|9 +- hw/pc_piix.c | 20 +- hw/pci.c | 63 +++- hw/pci.h | 15 +- hw/pci_host.h |1 + hw/pci_internals.h |1 + hw/piix_pci.c | 13 +- hw/ppc4xx_pci.c|5 +- hw/ppc_mac.h |9 +- hw/ppc_newworld.c |5 +- hw/ppc_oldworld.c |3 +- hw/ppc_prep.c |3 +- hw/ppce500_pci.c |6 +- hw/prep_pci.c |5 +- hw/prep_pci.h |3 +- hw/sh_pci.c|4 +- hw/sysbus.c| 27 ++- hw/sysbus.h|3 + hw/unin_pci.c | 10 +- hw/usb-ohci.c | 42 +-- hw/versatile_pci.c |2 + ioport.c |4 +- memory.c | 1141 memory.h | 469 + 33 files changed, 2072 insertions(+), 99 deletions(-) create mode 100644 docs/memory.txt create mode 100644 exec-memory.h create mode 100644 memory.c create mode 100644 memory.h -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On 07/29/2011 08:50 PM, Stefan Hajnoczi wrote: I hit a weirdness yesterday, just want to mention it in case you notice it too. When running vanilla qemu-kvm I forgot to use aio=native. When I compared the results against virtio-blk-data-plane (which *always* uses Linux AIO) I was surprised to find average 4k read latency was lower and the standard deviation was also lower. So from now on I will run tests both with and without aio=native. aio=native should be faster and if I can reproduce the reverse I'll try to figure out why. Stefan On my laptop, I don't meet this weirdo. the emulated POSIX AIO is much worse than the Linux AIO as expected. If iodepth goes deeper, the gap gets wider. If not set aio=none, qemu uses emulated posix aio interface to do the IO. I peek at the posix-aio-compat.c,it uses thread pool and sync preadv/pwritev to emulate the AIO behaviour. The sync IO interface would even cause much poorer performance for random rw, since io-scheduler would possibly never get a chance to merge the requests stream. (blk_finish_plug->queue_unplugged->__blk_run_queue) Yuan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On 07/29/2011 10:45 PM, Liu Yuan wrote: On 07/29/2011 08:50 PM, Stefan Hajnoczi wrote: I hit a weirdness yesterday, just want to mention it in case you notice it too. When running vanilla qemu-kvm I forgot to use aio=native. When I compared the results against virtio-blk-data-plane (which *always* uses Linux AIO) I was surprised to find average 4k read latency was lower and the standard deviation was also lower. So from now on I will run tests both with and without aio=native. aio=native should be faster and if I can reproduce the reverse I'll try to figure out why. Stefan On my laptop, I don't meet this weirdo. the emulated POSIX AIO is much worse than the Linux AIO as expected. If iodepth goes deeper, the gap gets wider. If not set aio=none, qemu uses emulated posix aio interface to do the IO. I peek at the posix-aio-compat.c,it uses thread pool and sync preadv/pwritev to emulate the AIO behaviour. The sync IO interface would even cause much poorer performance for random rw, since io-scheduler would possibly never get a chance to merge the requests stream. (blk_finish_plug->queue_unplugged->__blk_run_queue) Yuan Typo. not merge, I mean *sort* the reqs -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk
On 07/28/2011 11:22 PM, Michael S. Tsirkin wrote: On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote: From: Liu Yuan Vhost-blk driver is an in-kernel accelerator, intercepting the IO requests from KVM virtio-capable guests. It is based on the vhost infrastructure. This is supposed to be a module over latest kernel tree, but it needs some symbols from fs/aio.c and fs/eventfd.c to compile with. So currently, after applying the patch, you need to *recomplie* the kernel. Usage: $kernel-src: make M=drivers/vhost $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko After insmod, you'll see /dev/vhost-blk created. done! Signed-off-by: Liu Yuan Thanks, this is an interesting patch. There are some coding style issues in this patch, could you please change the code to match the kernel coding style? In particular pls prefix functions macros etc with vhost_blk to avoid confusion. scripts/checkpatch.pl can find some, but not all, issues. --- drivers/vhost/Makefile |3 + drivers/vhost/blk.c| 568 drivers/vhost/vhost.h | 11 + fs/aio.c | 44 ++--- fs/eventfd.c |1 + include/linux/aio.h| 31 +++ As others said, core changes need to be split out and get acks from relevant people. Use scripts/get_maintainer.pl to get a list. 6 files changed, 631 insertions(+), 27 deletions(-) create mode 100644 drivers/vhost/blk.c diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..31f8b2e 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o +obj-m += vhost_blk.o + vhost_net-y := vhost.o net.o +vhost_blk-y := vhost.o blk.o diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c new file mode 100644 index 000..f3462be --- /dev/null +++ b/drivers/vhost/blk.c @@ -0,0 +1,568 @@ +/* Copyright (C) 2011 Taobao, Inc. + * Author: Liu Yuan + * + * This work is licensed under the terms of the GNU GPL, version 2. + * + * Vhost-blk driver is an in-kernel accelerator, intercepting the + * IO requests from KVM virtio-capable guests. It is based on the + * vhost infrastructure. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "vhost.h" + +#define DEBUG 0 + +#if DEBUG> 0 +#define dprintk printk +#else +#define dprintk(x...) do { ; } while (0) +#endif There are standard macros for these. + +enum { + virtqueue_max = 1, +}; + +#define MAX_EVENTS 128 + +struct vhost_blk { + struct vhost_virtqueue vq; + struct vhost_dev dev; + int should_stop; + struct kioctx *ioctx; + struct eventfd_ctx *ectx; + struct file *efile; + struct task_struct *worker; +}; + +struct used_info { + void *status; + int head; + int len; +}; + +static struct io_event events[MAX_EVENTS]; + +static void blk_flush(struct vhost_blk *blk) +{ + vhost_poll_flush(&blk->vq.poll); +} + +static long blk_set_features(struct vhost_blk *blk, u64 features) +{ + blk->dev.acked_features = features; + return 0; +} + +static void blk_stop(struct vhost_blk *blk) +{ + struct vhost_virtqueue *vq =&blk->vq; + struct file *f; + + mutex_lock(&vq->mutex); + f = rcu_dereference_protected(vq->private_data, + lockdep_is_held(&vq->mutex)); + rcu_assign_pointer(vq->private_data, NULL); + mutex_unlock(&vq->mutex); + + if (f) + fput(f); +} + +static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file *backend) +{ + int idx = backend->index; + struct vhost_virtqueue *vq =&blk->vq; + struct file *file, *oldfile; + int ret; + + mutex_lock(&blk->dev.mutex); + ret = vhost_dev_check_owner(&blk->dev); + if (ret) + goto err_dev; + if (idx>= virtqueue_max) { + ret = -ENOBUFS; + goto err_dev; + } + + mutex_lock(&vq->mutex); + + if (!vhost_vq_access_ok(vq)) { + ret = -EFAULT; + goto err_vq; + } NET used -1 backend to remove a backend. I think it's a good idea, to make an operation reversible. + + file = fget(backend->fd); We need to verify that the file type passed makes sense. For example, it's possible to create reference loops by passng the vhost-blk fd. + if (IS_ERR(file)) { + ret = PTR_ERR(file); + goto err_vq; + } + + oldfile = rcu_dereference_protected(vq->private_data, + lockdep_is_held(&vq->mutex)); + if (file != oldfile) + rcu_assign_pointer(vq->private_data, file); + + mutex_unlock(&vq->mutex); + + if (oldfile) { + blk_flush(blk); + fput(oldfile); + } + + mutex_unlock(&bl
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
On Fri, 2011-07-29 at 20:01 +0800, Liu Yuan wrote: > Looking at this long list,most are function pointers that can not be > inlined, and the internal data structures used by these functions are > dozons. Leave aside code complexity, this long code path would really > need retrofit. As Christoph simply put, this kind of mess is inherent > all over the qemu code. So I am afraid, the 'retrofit' would end up to > be a re-write the entire (sub)system. I have to admit that, I am > inclined to the MST's vhost approach, that write a new subsystem other > than tedious profiling and fixing, that would possibly goes as far as > actually re-writing it. I don't think the fix for problematic userspace is to write more kernel code. vhost-net improved throughput and latency by several factors, allowing to achieve much more than was possible at userspace alone. With vhost-blk we see an improvement of ~15% - which I assume by your and Christoph's comments can be mostly attributed to QEMU. Merging a module which won't improve performance dramatically compared to what is possible to achieve in userspace (even if it would require a code rewrite) sounds a bit wrong to me. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
Hi Liu Yuan, I am glad to see that you started looking at vhost-blk. I did an attempt year ago to improve block performance using vhost-blk approach. http://lwn.net/Articles/379864/ http://lwn.net/Articles/382543/ I will take a closer look at your patchset to find differences and similarities. - I focused on using vfs interfaces in the kernel, so that I can use it for file-backed devices. Our use-case scenario is mostly file-backed images. - In few cases, virtio-blk did outperform vhost-blk -- which was counter intuitive - but couldn't exactly nail down. why ? - I had to implement my own threads for parellism. I see that you are using aio infrastructure to get around it. - In our high scale performance testing, what we found is block-backed device performance is pretty close to bare-metal (91% of bare-metal). vhost-blk didn't add any major benefits to it. I am curious on your performance analysis & data on where you see the gains and why ? Hence I prioritized my work low :( Now that you are interested in driving this, I am happy to work with you and see what vhost-blk brings to the tables. (even if helps us improve virtio-blk). Thanks, Badari -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 0/3] separate thread for VM migration
Following patch deals with VCPU and iothread starvation during the migration of a guest. Currently the iothread is responsible for performing the guest migration. It holds qemu_mutex during the migration and doesn't allow VCPU to enter the qemu mode and delays its return to the guest. The guest migration, executed as an iohandler also delays the execution of other iohandlers. In the following patch series, The migration has been moved to a separate thread to reduce the qemu_mutex contention and iohandler starvation. Also current dirty bitmap is split into per memslot bitmap to reduce its size. Umesh Deshpande (3): separate thread for VM migration fine grained qemu_mutex locking for migration per memslot dirty bitmap arch_init.c | 14 ++-- buffered_file.c | 28 - buffered_file.h |4 +++ cpu-all.h | 40 ++-- exec.c | 38 +- migration.c | 60 -- migration.h |3 ++ savevm.c| 22 +--- savevm.h| 29 ++ xen-all.c |6 +--- 10 files changed, 173 insertions(+), 71 deletions(-) create mode 100644 savevm.h -- 1.7.4.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 1/3] separate thread for VM migration
This patch creates a separate thread for the guest migration on the source side. Signed-off-by: Umesh Deshpande --- buffered_file.c | 28 - buffered_file.h |4 +++ migration.c | 59 +++--- migration.h |3 ++ savevm.c| 22 +--- savevm.h| 29 +++ 6 files changed, 102 insertions(+), 43 deletions(-) create mode 100644 savevm.h diff --git a/buffered_file.c b/buffered_file.c index 41b42c3..d4146bf 100644 --- a/buffered_file.c +++ b/buffered_file.c @@ -16,12 +16,16 @@ #include "qemu-timer.h" #include "qemu-char.h" #include "buffered_file.h" +#include "migration.h" +#include "savevm.h" +#include "qemu-thread.h" //#define DEBUG_BUFFERED_FILE typedef struct QEMUFileBuffered { BufferedPutFunc *put_buffer; +BufferedBeginFunc *begin; BufferedPutReadyFunc *put_ready; BufferedWaitForUnfreezeFunc *wait_for_unfreeze; BufferedCloseFunc *close; @@ -35,6 +39,7 @@ typedef struct QEMUFileBuffered size_t buffer_size; size_t buffer_capacity; QEMUTimer *timer; +QemuThread thread; } QEMUFileBuffered; #ifdef DEBUG_BUFFERED_FILE @@ -181,8 +186,6 @@ static int buffered_close(void *opaque) ret = s->close(s->opaque); -qemu_del_timer(s->timer); -qemu_free_timer(s->timer); qemu_free(s->buffer); qemu_free(s); @@ -228,17 +231,15 @@ static int64_t buffered_get_rate_limit(void *opaque) return s->xfer_limit; } -static void buffered_rate_tick(void *opaque) +void buffered_rate_tick(QEMUFile *file) { -QEMUFileBuffered *s = opaque; +QEMUFileBuffered *s = file->opaque; if (s->has_error) { buffered_close(s); return; } -qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100); - if (s->freeze_output) return; @@ -250,9 +251,17 @@ static void buffered_rate_tick(void *opaque) s->put_ready(s->opaque); } +static void *migrate_vm(void *opaque) +{ +QEMUFileBuffered *s = opaque; +s->begin(s->opaque); +return NULL; +} + QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t bytes_per_sec, BufferedPutFunc *put_buffer, + BufferedBeginFunc *begin, BufferedPutReadyFunc *put_ready, BufferedWaitForUnfreezeFunc *wait_for_unfreeze, BufferedCloseFunc *close) @@ -264,6 +273,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, s->opaque = opaque; s->xfer_limit = bytes_per_sec / 10; s->put_buffer = put_buffer; +s->begin = begin; s->put_ready = put_ready; s->wait_for_unfreeze = wait_for_unfreeze; s->close = close; @@ -271,11 +281,9 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, buffered_close, buffered_rate_limit, buffered_set_rate_limit, -buffered_get_rate_limit); - -s->timer = qemu_new_timer_ms(rt_clock, buffered_rate_tick, s); + buffered_get_rate_limit); -qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100); +qemu_thread_create(&s->thread, migrate_vm, s); return s->file; } diff --git a/buffered_file.h b/buffered_file.h index 98d358b..cfe2833 100644 --- a/buffered_file.h +++ b/buffered_file.h @@ -17,12 +17,16 @@ #include "hw/hw.h" typedef ssize_t (BufferedPutFunc)(void *opaque, const void *data, size_t size); +typedef void (BufferedBeginFunc)(void *opaque); typedef void (BufferedPutReadyFunc)(void *opaque); typedef void (BufferedWaitForUnfreezeFunc)(void *opaque); typedef int (BufferedCloseFunc)(void *opaque); +void buffered_rate_tick(QEMUFile *file); + QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit, BufferedPutFunc *put_buffer, + BufferedBeginFunc *begin, BufferedPutReadyFunc *put_ready, BufferedWaitForUnfreezeFunc *wait_for_unfreeze, BufferedCloseFunc *close); diff --git a/migration.c b/migration.c index af3a1f2..bf86067 100644 --- a/migration.c +++ b/migration.c @@ -31,6 +31,8 @@ do { } while (0) #endif +static int64_t expire_time; + /* Migration speed throttling */ static int64_t max_throttle = (32 << 20); @@ -284,8 +286,6 @@ int migrate_fd_cleanup(FdMigrationState *s) { int ret = 0; -qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); - if (s->file) { DPRINTF("closing file\n"); if (qemu_fclose(s->file) != 0) { @@ -310,8 +310,7 @@ int migrate_fd_cleanup(FdMigrationState *s) void migrate_fd_put_notify(void *opaque) { FdMigrationState *s = op
[RFC PATCH v2 3/3] Per memslot dirty bitmap
This patch creates a separate dirty bitmap for each slot. Currently dirty bitmap is created for addresses ranging from 0 to the end address of the last memory slot. Since the memslots are not necessarily contiguous, current bitmap might contain empty region or holes that doesn't represent any VM pages. This patch reduces the size of the dirty bitmap by allocating per memslot dirty bitmaps. Signed-off-by: Umesh Deshpande --- cpu-all.h | 40 +--- exec.c| 38 +++--- xen-all.c |6 ++ 3 files changed, 58 insertions(+), 26 deletions(-) diff --git a/cpu-all.h b/cpu-all.h index e839100..9517a9b 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -920,6 +920,7 @@ extern ram_addr_t ram_size; typedef struct RAMBlock { uint8_t *host; +uint8_t *phys_dirty; ram_addr_t offset; ram_addr_t length; uint32_t flags; @@ -931,7 +932,6 @@ typedef struct RAMBlock { } RAMBlock; typedef struct RAMList { -uint8_t *phys_dirty; QLIST_HEAD(ram, RAMBlock) blocks; } RAMList; extern RAMList ram_list; @@ -961,32 +961,55 @@ extern int mem_prealloc; #define CODE_DIRTY_FLAG 0x02 #define MIGRATION_DIRTY_FLAG 0x08 +RAMBlock *qemu_addr_to_ramblock(ram_addr_t); + +static inline int get_page_nr(ram_addr_t addr, RAMBlock **block) +{ +int page_nr; +*block = qemu_addr_to_ramblock(addr); + +page_nr = addr - (*block)->offset; +page_nr = page_nr >> TARGET_PAGE_BITS; + +return page_nr; +} + /* read dirty bit (return 0 or 1) */ static inline int cpu_physical_memory_is_dirty(ram_addr_t addr) { -return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] == 0xff; +RAMBlock *block; +int page_nr = get_page_nr(addr, &block); +return block->phys_dirty[page_nr] == 0xff; } static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr) { -return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS]; +RAMBlock *block; +int page_nr = get_page_nr(addr, &block); +return block->phys_dirty[page_nr]; } static inline int cpu_physical_memory_get_dirty(ram_addr_t addr, int dirty_flags) { -return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] & dirty_flags; +RAMBlock *block; +int page_nr = get_page_nr(addr, &block); +return block->phys_dirty[page_nr] & dirty_flags; } static inline void cpu_physical_memory_set_dirty(ram_addr_t addr) { -ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] = 0xff; +RAMBlock *block; +int page_nr = get_page_nr(addr, &block); +block->phys_dirty[page_nr] = 0xff; } static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr, int dirty_flags) { -return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] |= dirty_flags; +RAMBlock *block; +int page_nr = get_page_nr(addr, &block); +return block->phys_dirty[page_nr] |= dirty_flags; } static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start, @@ -995,10 +1018,13 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start, { int i, mask, len; uint8_t *p; +RAMBlock *block; +int page_nr = get_page_nr(start, &block); len = length >> TARGET_PAGE_BITS; mask = ~dirty_flags; -p = ram_list.phys_dirty + (start >> TARGET_PAGE_BITS); + +p = block->phys_dirty + page_nr; for (i = 0; i < len; i++) { p[i] &= mask; } diff --git a/exec.c b/exec.c index 0e2ce57..6312550 100644 --- a/exec.c +++ b/exec.c @@ -2106,6 +2106,10 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end, abort(); } +if (kvm_enabled()) { +return; +} + for(env = first_cpu; env != NULL; env = env->next_cpu) { int mmu_idx; for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { @@ -2894,17 +2898,6 @@ static ram_addr_t find_ram_offset(ram_addr_t size) return offset; } -static ram_addr_t last_ram_offset(void) -{ -RAMBlock *block; -ram_addr_t last = 0; - -QLIST_FOREACH(block, &ram_list.blocks, next) -last = MAX(last, block->offset + block->length); - -return last; -} - ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, ram_addr_t size, void *host) { @@ -2974,10 +2967,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next); -ram_list.phys_dirty = qemu_realloc(ram_list.phys_dirty, - last_ram_offset() >> TARGET_PAGE_BITS); -memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS), - 0xff, size >> TARGET_PAGE_BITS); +new_block->phys_dirty = qemu_mallocz(new_block->length >> TARGET_PAGE_BITS); +memset(new_block->phys_dirty, 0xff, new_block->length >> TARGET_PAGE_BITS); if (kvm_enabled()) kvm_setup_gu
[RFC PATCH v2 2/3] fine grained qemu_mutex locking for migration
In the migration thread, qemu_mutex is released during the most time consuming part. i.e. during is_dup_page which identifies the uniform data pages and during the put_buffer. qemu_mutex is also released while blocking on select to wait for the descriptor to become ready for writes. Signed-off-by: Umesh Deshpande --- arch_init.c | 14 +++--- migration.c | 11 +++ 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/arch_init.c b/arch_init.c index 484b39d..cd545bc 100644 --- a/arch_init.c +++ b/arch_init.c @@ -110,7 +110,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) static RAMBlock *last_block; static ram_addr_t last_offset; -static int ram_save_block(QEMUFile *f) +static int ram_save_block(QEMUFile *f, int stage) { RAMBlock *block = last_block; ram_addr_t offset = last_offset; @@ -131,6 +131,10 @@ static int ram_save_block(QEMUFile *f) current_addr + TARGET_PAGE_SIZE, MIGRATION_DIRTY_FLAG); +if (stage != 3) { +qemu_mutex_unlock_iothread(); +} + p = block->host + offset; if (is_dup_page(p, *p)) { @@ -153,6 +157,10 @@ static int ram_save_block(QEMUFile *f) bytes_sent = TARGET_PAGE_SIZE; } +if (stage != 3) { +qemu_mutex_lock_iothread(); +} + break; } @@ -301,7 +309,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) while (!qemu_file_rate_limit(f)) { int bytes_sent; -bytes_sent = ram_save_block(f); +bytes_sent = ram_save_block(f, stage); bytes_transferred += bytes_sent; if (bytes_sent == 0) { /* no more blocks */ break; @@ -322,7 +330,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) int bytes_sent; /* flush all remaining blocks regardless of rate limiting */ -while ((bytes_sent = ram_save_block(f)) != 0) { +while ((bytes_sent = ram_save_block(f, stage)) != 0) { bytes_transferred += bytes_sent; } cpu_physical_memory_set_dirty_tracking(0); diff --git a/migration.c b/migration.c index bf86067..992fef5 100644 --- a/migration.c +++ b/migration.c @@ -375,15 +375,19 @@ void migrate_fd_begin(void *arg) if (ret < 0) { DPRINTF("failed, %d\n", ret); migrate_fd_error(s); -goto out; +qemu_mutex_unlock_iothread(); +return; } expire_time = qemu_get_clock_ms(rt_clock) + 100; migrate_fd_put_ready(s); +qemu_mutex_unlock_iothread(); while (s->state == MIG_STATE_ACTIVE) { if (migrate_fd_check_expire()) { +qemu_mutex_lock_iothread(); buffered_rate_tick(s->file); +qemu_mutex_unlock_iothread(); } if (s->state != MIG_STATE_ACTIVE) { @@ -392,12 +396,11 @@ void migrate_fd_begin(void *arg) if (s->callback) { migrate_fd_wait_for_unfreeze(s); +qemu_mutex_lock_iothread(); s->callback(s); +qemu_mutex_unlock_iothread(); } } - -out: -qemu_mutex_unlock_iothread(); } -- 1.7.4.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC net-next] virtio_net: refill buffer right after being used
To even the latency, refill buffer right after being used. Sign-off-by: Shirley Ma --- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0c7321c..c8201d4 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -429,6 +429,22 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp) return err; } +static bool fill_one(struct virtio_net *vi, gfp_t gfp) +{ + int err; + + if (vi->mergeable_rx_bufs) + err = add_recvbuf_mergeable(vi, gfp); + else if (vi->big_packets) + err = add_recvbuf_big(vi, gfp); + else + err = add_recvbuf_small(vi, gfp); + + if (err >= 0) + ++vi->num; + return err; +} + /* Returns false if we couldn't fill entirely (OOM). */ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) { @@ -436,17 +452,10 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) bool oom; do { - if (vi->mergeable_rx_bufs) - err = add_recvbuf_mergeable(vi, gfp); - else if (vi->big_packets) - err = add_recvbuf_big(vi, gfp); - else - err = add_recvbuf_small(vi, gfp); - + err = fill_one(vi, gfp); oom = err == -ENOMEM; if (err < 0) break; - ++vi->num; } while (err > 0); if (unlikely(vi->num > vi->max)) vi->max = vi->num; @@ -506,13 +515,13 @@ again: receive_buf(vi->dev, buf, len); --vi->num; received++; - } - - if (vi->num < vi->max / 2) { - if (!try_fill_recv(vi, GFP_ATOMIC)) + if (fill_one(vi, GFP_ATOMIC) < 0) schedule_delayed_work(&vi->refill, 0); } + /* notify buffers are refilled */ + virtqueue_kick(vi->rvq); + /* Out of packets? */ if (received < budget) { napi_complete(napi); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: x86: report valid microcode update ID
Windows Server 2008 SP2 checked build with smp > 1 BSOD's during boot due to lack of microcode update: *** Assertion failed: The system BIOS on this machine does not properly support the processor. The system BIOS did not load any microcode update. A BIOS containing the latest microcode update is needed for system reliability. (CurrentUpdateRevision != 0) *** Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440 Report a non-zero microcode update signature to make it happy. Signed-off-by: Marcelo Tosatti diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e80f0d7..f435591 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1841,7 +1841,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) switch (msr) { case MSR_IA32_PLATFORM_ID: - case MSR_IA32_UCODE_REV: case MSR_IA32_EBL_CR_POWERON: case MSR_IA32_DEBUGCTLMSR: case MSR_IA32_LASTBRANCHFROMIP: @@ -1862,6 +1861,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case MSR_FAM10H_MMIO_CONF_BASE: data = 0; break; + case MSR_IA32_UCODE_REV: + data = 0x1ULL; + break; case MSR_MTRRcap: data = 0x500 | KVM_NR_VAR_MTRR; break; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC net-next] virtio_net: refill buffer right after being used
Resubmit it with a typo fix. Signed-off-by: Shirley Ma --- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0c7321c..c8201d4 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -429,6 +429,22 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp) return err; } +static int fill_one(struct virtnet_info *vi, gfp_t gfp) +{ + int err; + + if (vi->mergeable_rx_bufs) + err = add_recvbuf_mergeable(vi, gfp); + else if (vi->big_packets) + err = add_recvbuf_big(vi, gfp); + else + err = add_recvbuf_small(vi, gfp); + + if (err >= 0) + ++vi->num; + return err; +} + /* Returns false if we couldn't fill entirely (OOM). */ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) { @@ -436,17 +452,10 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) bool oom; do { - if (vi->mergeable_rx_bufs) - err = add_recvbuf_mergeable(vi, gfp); - else if (vi->big_packets) - err = add_recvbuf_big(vi, gfp); - else - err = add_recvbuf_small(vi, gfp); - + err = fill_one(vi, gfp); oom = err == -ENOMEM; if (err < 0) break; - ++vi->num; } while (err > 0); if (unlikely(vi->num > vi->max)) vi->max = vi->num; @@ -506,13 +515,13 @@ again: receive_buf(vi->dev, buf, len); --vi->num; received++; - } - - if (vi->num < vi->max / 2) { - if (!try_fill_recv(vi, GFP_ATOMIC)) + if (fill_one(vi, GFP_ATOMIC) < 0) schedule_delayed_work(&vi->refill, 0); } + /* notify buffers are refilled */ + virtqueue_kick(vi->rvq); + /* Out of packets? */ if (received < budget) { napi_complete(napi); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm PCI assignment & VFIO ramblings
Hi folks ! So I promised Anthony I would try to summarize some of the comments & issues we have vs. VFIO after we've tried to use it for PCI pass-through on POWER. It's pretty long, there are various items with more or less impact, some of it is easily fixable, some are API issues, and we'll probably want to discuss them separately, but for now here's a brain dump. David, Alexei, please make sure I haven't missed anything :-) * Granularity of pass-through So let's first start with what is probably the main issue and the most contentious, which is the problem of dealing with the various constraints which define the granularity of pass-through, along with exploiting features like the VTd iommu domains. For the sake of clarity, let me first talk a bit about the "granularity" issue I've mentioned above. There are various constraints that can/will force several devices to be "owned" by the same guest and on the same side of the host/guest boundary. This is generally because some kind of HW resource is shared and thus not doing so would break the isolation barrier and enable a guest to disrupt the operations of the host and/or another guest. Some of those constraints are well know, such as shared interrupts. Some are more subtle, for example, if a PCIe->PCI bridge exist in the system, there is no way for the iommu to identify transactions from devices coming from the PCI segment of that bridge with a granularity other than "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic) behind such a bridge must be treated as a single "entity" for pass-trough purposes. In IBM POWER land, we call this a "partitionable endpoint" (the term "endpoint" here is historic, such a PE can be made of several PCIe "endpoints"). I think "partitionable" is a pretty good name tho to represent the constraints, so I'll call this a "partitionable group" from now on. Other examples of such HW imposed constraints can be a shared iommu with no filtering capability (some older POWER hardware which we might want to support fall into that category, each PCI host bridge is its own domain but doesn't have a finer granularity... however those machines tend to have a lot of host bridges :) If we are ever going to consider applying some of this to non-PCI devices (see the ongoing discussions here), then we will be faced with the crazyness of embedded designers which probably means all sort of new constraints we can't even begin to think about This leads me to those initial conclusions: - The -minimum- granularity of pass-through is not always a single device and not always under SW control - Having a magic heuristic in libvirt to figure out those constraints is WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel knowledge of PCI resource management and getting it wrong in many many cases, something that took years to fix essentially by ripping it all out. This is kernel knowledge and thus we need the kernel to expose in a way or another what those constraints are, what those "partitionable groups" are. - That does -not- mean that we cannot specify for each individual device within such a group where we want to put it in qemu (what devfn etc...). As long as there is a clear understanding that the "ownership" of the device goes with the group, this is somewhat orthogonal to how they are represented in qemu. (Not completely... if the iommu is exposed to the guest ,via paravirt for example, some of these constraints must be exposed but I'll talk about that more later). The interface currently proposed for VFIO (and associated uiommu) doesn't handle that problem at all. Instead, it is entirely centered around a specific "feature" of the VTd iommu's for creating arbitrary domains with arbitrary devices (tho those devices -do- have the same constraints exposed above, don't try to put 2 legacy PCI devices behind the same bridge into 2 different domains !), but the API totally ignores the problem, leaves it to libvirt "magic foo" and focuses on something that is both quite secondary in the grand scheme of things, and quite x86 VTd specific in the implementation and API definition. Now, I'm not saying these programmable iommu domains aren't a nice feature and that we shouldn't exploit them when available, but as it is, it is too much a central part of the API. I'll talk a little bit more about recent POWER iommu's here to illustrate where I'm coming from with my idea of groups: On p7ioc (the IO chip used on recent P7 machines), there -is- a concept of domain and a per-RID filtering. However it differs from VTd in a few ways: The "domains" (aka PEs) encompass more than just an iommu filtering scheme. The MMIO space and PIO space are also segmented, and those segments assigned to domains. Interrupts (well, MSI ports at least) are assigned to domains. Inbound PCIe error messages are targeted to domains, etc... Basically, the PEs provide a very strong isolation feature which includes errors, and has the ability
Re: [PATCH RFC net-next] virtio_net: refill buffer right after being used
On Fri, Jul 29, 2011 at 3:55 PM, Shirley Ma wrote: > Resubmit it with a typo fix. > > Signed-off-by: Shirley Ma > --- > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 0c7321c..c8201d4 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -429,6 +429,22 @@ static int add_recvbuf_mergeable(struct virtnet_info > *vi, gfp_t gfp) > return err; > } > > +static int fill_one(struct virtnet_info *vi, gfp_t gfp) > +{ > + int err; > + > + if (vi->mergeable_rx_bufs) > + err = add_recvbuf_mergeable(vi, gfp); > + else if (vi->big_packets) > + err = add_recvbuf_big(vi, gfp); > + else > + err = add_recvbuf_small(vi, gfp); > + > + if (err >= 0) > + ++vi->num; > + return err; > +} > + > /* Returns false if we couldn't fill entirely (OOM). */ > static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) > { > @@ -436,17 +452,10 @@ static bool try_fill_recv(struct virtnet_info *vi, > gfp_t gfp) > bool oom; > > do { > - if (vi->mergeable_rx_bufs) > - err = add_recvbuf_mergeable(vi, gfp); > - else if (vi->big_packets) > - err = add_recvbuf_big(vi, gfp); > - else > - err = add_recvbuf_small(vi, gfp); > - > + err = fill_one(vi, gfp); > oom = err == -ENOMEM; > if (err < 0) > break; > - ++vi->num; > } while (err > 0); > if (unlikely(vi->num > vi->max)) > vi->max = vi->num; > @@ -506,13 +515,13 @@ again: > receive_buf(vi->dev, buf, len); > --vi->num; > received++; > - } > - > - if (vi->num < vi->max / 2) { > - if (!try_fill_recv(vi, GFP_ATOMIC)) > + if (fill_one(vi, GFP_ATOMIC) < 0) > schedule_delayed_work(&vi->refill, 0); > } > > + /* notify buffers are refilled */ > + virtqueue_kick(vi->rvq); > + How does this reduce latency? We are doing the same amount of work in both cases, and in both cases the newly available buffers are not visible to the device until the virtqueue_kick.. > /* Out of packets? */ > if (received < budget) { > napi_complete(napi); > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html