Re: Windows7 crashes inside the VM when starting a certain program

2011-07-29 Thread Paolo Bonzini

On 07/28/2011 07:44 PM, André Weidemann wrote:

Hi,

On 28.07.2011 15:49, Paolo Bonzini wrote:

On 07/28/2011 03:21 PM, Avi Kivity wrote:

I haven't used debuggers very much, so I hope I grabbed the correct
lines from the disassembly:
http://pastebin.com/t3sfvmTg


That's the bug check routine. Can you go up a frame?


Or just do what Gleb suggested. Open the dump, type "!analyze -v" and
cut-paste the address from WinDbg's output into the Disassemble window.


This is the output of "!analyze -v":
http://pastebin.com/sCZSjr8m

...and this is the output from the disassembly window:
http://pastebin.com/AVZuswkT


Very useful, thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows7 crashes inside the VM when starting a certain program

2011-07-29 Thread André Weidemann

On 27.07.2011 10:56, Gleb Natapov wrote:

On Tue, Jul 26, 2011 at 12:57:44PM +0200, André Weidemann wrote:

Hi,

On 26.07.2011 12:08, Gleb Natapov wrote:

On Tue, Jul 26, 2011 at 07:29:04AM +0200, André Weidemann wrote:

On 07.07.2011 07:26, André Weidemann wrote:

Hi,
I am running Windows7 x64 in a VM which crashes after starting a certain
game. Actually there are two games both from the same company, that make
the VM crash after starting them.
Windows crashes right after starting the game. With the 1st game the
screen goes black as usual and the cursor keeps spinning for 3-5 seconds
until Windows crashes. With the second game I get to 3D the login
screen. The game then crashes after logging in.
Windows displays this error message on the first crash:
http://pastebin.com/kMzk9Jif
Windows then finishes writing the crash dump and restarts.
I can reproduce Windows crashing every time I start the game while the
VM keeps running without any problems.
When Windows reboots after the first crash and the game is started
again, the message on the following blue screen changes slightly and
stays the same(except for the addresses) for every following crash:
http://pastebin.com/jVtBc4ZH

I first thought that this might be related to a certain feature in 3D
acceleration being used, but Futuremark 3DMark Vantage or 3DMark 11 run
without any problems. They run a bit choppy on some occasions, but do
that without crashing Windows7 or the VM.

How can I proceed to investigate what is going wrong?


I did some testing and found out that Windows7 does not crash
anymore when changing "-cpu host" to "-cpu Nehalem". After doing so,

What is your host cpu (cat /proc/cpuinfo)?


The server is currently running on 2 out of 8 cores with kernel boot
parameter "maxcpus=2".

flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est
tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow
vnmi flexpriority ept vpid

Flags that are present on -cpu host but not -cpu Nehalem (excluding vmx
related flags):

vme dts acpi ss ht tm pbe rdtscp constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf dtes64 monitor ds_cpl est tm2 xtpr pdcm  ida

Some of them may be synthetic and some of them may be filtered by KVM.

Can you try to run "-cpu host,-vme,-dts..." (specifying all of those
flags with -). Drop those that qemu does not recognize. See if result
will be the same as with -cpu Nehalem. If yes, then try to find out with
flag make the difference.


I started the VM with all flags that differ between the two CPUs. After 
removing the ones qemu-kvm did not recognize, I started the VM again 
with the following line:
-cpu 
host,-vme,-acpi,-ss,-ht,-tm,-pbe,-rdtscp,-dtes64,-monitor,-ds_cpl,-est,-tm2,-xtpr,-pdcm 
\


Running the program under Windows7 inside the VM, caused Windows to 
crash again with a BSoD.

The disassembly of the address f8000288320c shows the following:
http://pastebin.com/7yzTYJSG

André

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Liu Yuan

Hi Stefan
On 07/28/2011 11:44 PM, Stefan Hajnoczi wrote:

On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan  wrote:

Did you investigate userspace virtio-blk performance?  If so, what
issues did you find?



Yes, in the performance table I presented, virtio-blk in the user space 
lags behind the vhost-blk(although this prototype is very primitive 
impl.) in the kernel by about 15%.


Actually, the motivation to start vhost-blk is that, in our observation, 
KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO 
perspective, especially for sequential read/write (around 20% gap).


We'll deploy a large number of KVM-based systems as the infrastructure 
of some service and this gap is really unpleasant.


By the design, IMHO, virtio performance is supposed to be comparable to 
the para-vulgarization solution if not better, because for KVM, guest 
and backend driver could sit in the same address space via mmaping. This 
would reduce the overhead involved in page table modification, thus 
speed up the buffer management and transfer a lot compared with Xen PV.


I am not in a qualified  position to talk about QEMU , but I think the 
surprised performance improvement by this very primitive vhost-blk 
simply manifest that, the internal structure for qemu io is the way 
bloated. I say it *surprised* because basically vhost just reduces the 
number of system calls, which is heavily tuned by chip manufacture for 
years. So, I guess the performance number vhost-blk gains mainly could 
possibly be contributed to *shorter and simpler* code path.


Anyway, IMHO, compared with user space approach, the in-kernel one would 
allow more flexibility and better integration with the kernel IO stack, 
since we don't need two IO stacks for guest OS.



I have a hacked up world here that basically implements vhost-blk in userspace:
http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c

  * A dedicated virtqueue thread sleeps on ioeventfd
  * Guest memory is pre-mapped and accessed directly (not using QEMU's
usually memory access functions)
  * Linux AIO is used, the QEMU block layer is bypassed
  * Completion interrupts are injected from the virtqueue thread using ioctl

I will try to rebase onto qemu-kvm.git/master (this work is several
months old).  Then we can compare to see how much of the benefit can
be gotten in userspace.

I don't really get you about vhost-blk in user space since vhost 
infrastructure itself means an in-kernel accelerator that implemented in 
kernel . I guess what you meant is somewhat a re-write of virtio-blk in 
user space with a dedicated thread handling requests, and shorter code 
path similar to vhost-blk.



[performance]

Currently, the fio benchmarking number is rather promising. The seq 
read is imporved as much as 16% for throughput and the latency is dropped up to 
14%. For seq write, 13.5% and 13% respectively.

sequential read:
+-+-+---+---+
| iodepth | 1   |   2   |   3   |
+-+-+---+
| virtio-blk  | 4116(214)   |   7814(222)   |   8867(306)   |
+-+-+---+---+
| vhost-blk   | 4755(183)   |   8645(202)   |   10084(266)  |
+-+-+---+---+

4116(214) means 4116 IOPS/s, the it is completion latency is 214 us.

seqeuential write:
+-+-++--+
| iodepth |  1  |2   |  3   |
+-+-++--+
| virtio-blk  | 3848(228)   |   6505(275)|  9335(291)   |
+-+-++--+
| vhost-blk   | 4370(198)   |   7009(249)|  9938(264)   |
+-+-++--+

the fio command for sequential read:

sudo fio -name iops -readonly -rw=read -runtime=120 -iodepth 1 -filename 
/dev/vda -ioengine libaio -direct=1 -bs=512

and config file for sequential write is:

dev@taobao:~$ cat rw.fio
-
[test]

rw=rw
size=200M
directory=/home/dev/data
ioengine=libaio
iodepth=1
direct=1
bs=512
-

512 byte blocksize is very small, given that you can expect a file
system to have 4 KB or so block sizes.  It would be interesting to
measure a wider range of block sizes: 4 KB, 64 KB, and 128 KB for
example.

Stefan
Actually, I have tested 4KB, it shows the same improvement. What I care 
more is iodepth, since batched AIO would benefit it.But my laptop SATA 
doesn't behave well as it advertises: it says its NCQ queue depth is 32 
and kernel tells me it support 31 requests in one go. When increase 
iodepth in the test up to 4, both the host and guest' IOPS drops 
drastically.


Yuan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Liu Yuan

Hi
On 07/29/2011 12:48 PM, Stefan Hajnoczi wrote:

On Thu, Jul 28, 2011 at 4:44 PM, Stefan Hajnoczi  wrote:

On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan  wrote:

Did you investigate userspace virtio-blk performance?  If so, what
issues did you find?

I have a hacked up world here that basically implements vhost-blk in userspace:
http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c

  * A dedicated virtqueue thread sleeps on ioeventfd
  * Guest memory is pre-mapped and accessed directly (not using QEMU's
usually memory access functions)
  * Linux AIO is used, the QEMU block layer is bypassed
  * Completion interrupts are injected from the virtqueue thread using ioctl

I will try to rebase onto qemu-kvm.git/master (this work is several
months old).  Then we can compare to see how much of the benefit can
be gotten in userspace.

Here is the rebased virtio-blk-data-plane tree:
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane

When I run it on my laptop with an Intel X-25M G2 SSD I see a latency
reduction compared to mainline userspace virtio-blk.  I'm not posting
results because I did quick fio runs without ensuring a quiet
benchmarking environment.

There are a couple of things that could be modified:
  * I/O request merging is done to mimic bdrv_aio_multiwrite() - but
vhost-blk does not do this.  Try turning it off?


I noted bdrv_aio_multiwrite() do the murging job, but  I am not sure if 
this trick is really needed since we have an io scheduler down the path 
that is in a much more better position to murge requests. I think the 
duplicate *pre-mature* merging of bdrv_aio_multiwrite is the result of  
laio_submit()'s lack of submitting the requests in a batch mode. 
io_submit() in the fs/aio.c says that every time we call laio_submit(), 
it will submit the very request into the driver's request queue, which 
would be run when we blk_finish_plug(). IMHO, you can simply batch 
io_submit() requests instead of this tricks if you already bypass the 
QEMU block layer.



  * epoll(2) is used but perhaps select(2)/poll(2) have lower latency
for this use case.  Try another event mechanism.

Let's see how it compares to vhost-blk first.  I can tweak it if we
want to investigate further.

Yuan: Do you want to try the virtio-blk-data-plane tree?  You don't
need to change the qemu-kvm command-line options.

Stefan
Yes, please, sounds interesting. BTW, I think the user space would 
achieve the same performance gain if you bypass qemu io layer all the 
way down to the system calls in a request handling cycle, compared to 
the current vhost-blk implementation that uses linux AIO. But hey, I 
would go further to optimise it with block layer and other resources in 
the mind. ;) and I don't add complexity to the current qemu io layer.


Yuan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested VMX - L1 hangs on running L2

2011-07-29 Thread Zachary Amsden
2011/7/27 Nadav Har'El :
> On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs 
> on running L2":
>> > > No, both patches are wrong.
>> >
>>
>> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
>> regardless of the setting of any MSR bitmap. The reason why is that it
>> is being called by the L0 hypervisor kernel, which handles only
>> interactions with the L1 MSRs.
>
> guest_read_tsc() (called by the above get_msr) currently does this:
>
>        static u64 guest_read_tsc(void)
>        {
>                u64 host_tsc, tsc_offset;
>
>                rdtscll(host_tsc);
>                tsc_offset = vmcs_read64(TSC_OFFSET);
>                return host_tsc + tsc_offset;
>        }

That's wrong.  You should NEVER believe the offset written into the
hardware VMCS to be the current, real L1 TSC offset, as that is not an
invariant.

Instead, you should ALWAYS return the host TSC + the L1 TSC offset.
Sometimes, this may be the hardware value.

> I guess you'd want this to change to something like:
>
>                tsc_offset = is_guest_mode(vcpu) ?
>                        vmx->nested.vmcs01_tsc_offset :
>                        vmcs_read64(TSC_OFFSET);
>
> But I still am not convinced that that would be right

I believe this is correct.  But may it be cheaper to read from the
in-memory structure than the actual hardware VMCS?

> E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't
> trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC
> MSR read will not exit (because L1 doesn't trap it) but *will* do the extra
> offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which
> is the sum of that of vmcs01 and vmcs12), is correct, and the new code will
> be incorrect. Or am I misunderstanding the SDM?

In that case, you need to distinguish between reads of the TSC MSR by
the guest and reads by the host (as done internally to track drift and
compensation).  The code that needs to change isn't guest_read_tsc(),
that code must respect the invariant of only returning the L1 guest
TSC (in fact, that may be a good name change for the function).  What
needs to change is the actual code involved in the MSR read.  If it
determines that something other than the L1 guest is running, it needs
to ignore the hardware TSC offset and return the TSC as if read by the
L1 guest.

Unfortunately, the layering currently doesn't seem to allow for this,
and it looks like both vendor specific variants currently get this
wrong.

The call stack:

kvm_get_msr()
kvm_x86_ops->get_msr()
vendor_get_msr()
vendor_guest_read_tsc()

offers no insight as to the intention of the caller.  Is it trying to
get the guest TSC to return to the guest, or is it trying to get the
guest TSC to calibrate / measure and compensate for TSC effects?

So you are right, this is still wrong for the case in which L1 does
not trap TSC MSR reads.  Note however, the RDTSC instruction is still
virtualized properly, it is only the relatively rare actual TSC MSR
read via RDMSR which is mis-virtualized (this bug exists today in the
SVM implementation if I am reading it correctly - cc'd Joerg to notify
him of that).  That, combined with the relative importance of
supporting a guest which does not trap on these MSR reads suggest this
is a low priority design issue however (RDTSC still works even if the
MSR is trapped, correct?)

If you want to go the extra mile to support such guests, the only
fully correct approach then is to do one of the following:

1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
transform current uses of the code which does TSC compensation to use
this new API.  *Bonus* - no need to do double indirection through the
generic MSR code.

or, alternatively:

2) Do not trap MSR reads of the TSC if the current L1 guest is not
trapping MSR reads of the TSC.  This is not possible if you cannot
enforce specific read vs. write permission in hardware - it may be
possible however, if you can trap all MSR writes regardless of the
permission bitmap.

> Can you tell me in which case the original code would returen incorrect
> results to a guest (L1 or L2) doing anything MSR-related?

It never returns incorrect values to a guest.  It does however return
incorrect values to the L0 hypervisor, which is expecting to do
arithmetic based on the L1 TSC value, and this fails catastrophically
when it receives values for other nested guests.

> I'm assuming that some code in KVM also uses kvm_read_msr and assumes it
> gets the TSC value for L1, not for the guest currently running (L2 or L1).

Exactly.

> I don't understand why it needs to assume that... Why would it be wrong to
> return L2's TSC, and just remember that *changing* the L2 TSC really means
> changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which
> we can't touch?

L0 measures the L1 TSC at various points to be sure the L1 TSC never
regresses by going backwards, and also to

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Stefan Hajnoczi
On Fri, Jul 29, 2011 at 8:22 AM, Liu Yuan  wrote:
> Hi Stefan
> On 07/28/2011 11:44 PM, Stefan Hajnoczi wrote:
>>
>> On Thu, Jul 28, 2011 at 3:29 PM, Liu Yuan  wrote:
>>
>> Did you investigate userspace virtio-blk performance?  If so, what
>> issues did you find?
>>
>
> Yes, in the performance table I presented, virtio-blk in the user space lags
> behind the vhost-blk(although this prototype is very primitive impl.) in the
> kernel by about 15%.

I mean did you investigate *why* userspace virtio-blk has higher
latency?  Did you profile it and drill down on its performance?

It's important to understand what is going on before replacing it with
another mechanism.  What I'm saying is, if I have a buggy program I
can sometimes rewrite it from scratch correctly but that doesn't tell
me what the bug was.

Perhaps the inefficiencies in userspace virtio-blk can be solved by
adjusting the code (removing inefficient notification mechanisms,
introducing a dedicated thread outside of the QEMU iothread model,
etc).  Then we'd get the performance benefit for non-raw images and
perhaps non-virtio and non-Linux host platforms too.

> Actually, the motivation to start vhost-blk is that, in our observation,
> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
> perspective, especially for sequential read/write (around 20% gap).
>
> We'll deploy a large number of KVM-based systems as the infrastructure of
> some service and this gap is really unpleasant.
>
> By the design, IMHO, virtio performance is supposed to be comparable to the
> para-vulgarization solution if not better, because for KVM, guest and
> backend driver could sit in the same address space via mmaping. This would
> reduce the overhead involved in page table modification, thus speed up the
> buffer management and transfer a lot compared with Xen PV.

Yes, guest memory is just a region of QEMU userspace memory.  So it's
easy to reach inside and there are no page table tricks or copying
involved.

> I am not in a qualified  position to talk about QEMU , but I think the
> surprised performance improvement by this very primitive vhost-blk simply
> manifest that, the internal structure for qemu io is the way bloated. I say
> it *surprised* because basically vhost just reduces the number of system
> calls, which is heavily tuned by chip manufacture for years. So, I guess the
> performance number vhost-blk gains mainly could possibly be contributed to
> *shorter and simpler* code path.

First we need to understand exactly what the latency overhead is.  If
we discover that it's simply not possible to do this equally well in
userspace, then it makes perfect sense to use vhost-blk.

So let's gather evidence and learn what the overheads really are.
Last year I spent time looking at virtio-blk latency:
http://www.linux-kvm.org/page/Virtio/Block/Latency

See especially this diagram:
http://www.linux-kvm.org/page/Image:Threads.png

The goal wasn't specifically to reduce synchronous sequential I/O,
instead the aim was to reduce overheads for a variety of scenarios,
especially multithreaded workloads.

In most cases it was helpful to move I/O submission out of the vcpu
thread by using the ioeventfd model just like vhost.  Ioeventfd for
userspace virtio-blk is now on by default in qemu-kvm.

Try running the userspace virtio-blk benchmark with -drive
if=none,id=drive0,file=... -device
virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
submission in the vcpu thread, which might reduce latency at the cost
of stealing guest time.

> Anyway, IMHO, compared with user space approach, the in-kernel one would
> allow more flexibility and better integration with the kernel IO stack,
> since we don't need two IO stacks for guest OS.

I agree that there may be advantages to integrating with in-kernel I/O
mechanisms.  An interesting step would be to implement the
submit_bio() approach that Christoph suggested and seeing if that
improves things further.

Push virtio-blk as far as you can and let's see what the performance is!

>> I have a hacked up world here that basically implements vhost-blk in
>> userspace:
>>
>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>>
>>  * A dedicated virtqueue thread sleeps on ioeventfd
>>  * Guest memory is pre-mapped and accessed directly (not using QEMU's
>> usually memory access functions)
>>  * Linux AIO is used, the QEMU block layer is bypassed
>>  * Completion interrupts are injected from the virtqueue thread using
>> ioctl
>>
>> I will try to rebase onto qemu-kvm.git/master (this work is several
>> months old).  Then we can compare to see how much of the benefit can
>> be gotten in userspace.
>>
> I don't really get you about vhost-blk in user space since vhost
> infrastructure itself means an in-kernel accelerator that implemented in
> kernel . I guess what you meant is somewhat a re-write of virtio-blk in user
> space with a dedicated thread handling requests, a

Re: Nested VMX - L1 hangs on running L2

2011-07-29 Thread Roedel, Joerg
On Fri, Jul 29, 2011 at 05:01:16AM -0400, Zachary Amsden wrote:
> So you are right, this is still wrong for the case in which L1 does
> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
> virtualized properly, it is only the relatively rare actual TSC MSR
> read via RDMSR which is mis-virtualized (this bug exists today in the
> SVM implementation if I am reading it correctly - cc'd Joerg to notify
> him of that).  That, combined with the relative importance of
> supporting a guest which does not trap on these MSR reads suggest this
> is a low priority design issue however (RDTSC still works even if the
> MSR is trapped, correct?)

Actually, the documentation is not entirely clear about this. But I tend
to agree that direct _reads_ of MSR 0x10 in guest-mode should return the
tsc with tsc_offset applied.
But on the other side, there is even SVM hardware which does this wrong.
For some K8s there is an erratum that the tsc_offset is not applied when
the MSR is read directly in guest mode.
But yes, to be architecturaly correct the msr read should always return
the tsc of the currently running guest level. In reality this shouldn't
be am issue, though, and rdtsc[p] is still working correctly.

Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Christoph Hellwig
On Fri, Jul 29, 2011 at 03:59:53PM +0800, Liu Yuan wrote:
> I noted bdrv_aio_multiwrite() do the murging job, but  I am not sure

Just like I/O schedulers it's actually fairly harmful on high IOPS,
low latency devices.  I've just started doing a lot of qemu bencharks,
and disabling that multiwrite mess alone gives fairly nice speedups.

The major issue seems to be additional memory allocations and cache
lines - a problem that actually is fairly inherent all over the qemu
code.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 39412] Win Vista and Win2K8 guests' network breaks down

2011-07-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=39412


Jay Ren  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||CODE_FIX




--- Comment #3 from Jay Ren   2011-07-29 11:16:55 ---
This bug got fixed and I verified it. It doesn't exist in latest kvm.git tree
e72ef590a3ef3047f6ed5bcb8808a9734f6c4b32.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 39412] Win Vista and Win2K8 guests' network breaks down

2011-07-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=39412


Jay Ren  changed:

   What|Removed |Added

 Status|RESOLVED|VERIFIED




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange MySQL behaviour

2011-07-29 Thread Boris Dolgov
Hello!

On Thu, Jul 28, 2011 at 11:34, Avi Kivity  wrote:
> Looks like you are blocked on disk.  What does iostat say about disk
> utilization (in both guest and host)?
I also thought so, but host cpu states doesn't show any disk blocking.
iostat 5 with cache=none:
Guest:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,100,000,25   36,480,00   63,17

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
vda 177,0076,80  1308,80384   6544
dm-0167,0076,80  1308,80384   6544
dm-1  0,00 0,00 0,00  0  0

Host:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   5,620,001,470,500,00   92,40

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda 185,40 0,00  1212,80  0   6064
sdb 188,6025,60  1212,80128   6064
md1 198,4025,60  1212,80128   6064
[skip]
dm-12   195,0025,60  1177,60128   5888

time mysql ...:
real7m13.876s
user0m0.338s
sys 0m0.182s


> Try s/cache=none/cache=unsafe/ as an experiment.  Does it help?
With cache=unsafe:
Guest:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3,150,008,60   11,000,00   77,24

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
vda3827,60   782,40 20587,20   3912 102936
dm-0   2638,60   779,20 20576,00   3896 102880
dm-1  0,00 0,00 0,00  0  0

Host:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  10,410,007,730,000,00   81,86

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda   5,40   307,2040,00   1536200
sdb   5,40   460,8040,00   2304200
md1  99,00   768,0027,20   3840136
dm-1296,00   768,00 0,00   3840  0
4 times followed by
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  10,480,008,180,030,00   81,31

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda  29,00   256,00 16236,80   1280  81184
sdb  27,40 0,00 16236,80  0  81184
md12057,80   256,00 16224,00   1280  81120
dm-12  2050,80   256,00 16150,40   1280  80752
2 times

time:
real0m19.133s
user0m0.429s
sys 0m0.271s

> Try s/cache=none/cache=none,aio=native/.  Does it help?  This one is safe,
> you can keep it if it works.
With aio=native:
Guest:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,200,000,50   37,080,00   62,22

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
vda 192,0078,40  1038,40392   5192
dm-0133,6078,40  1040,00392   5200
dm-1  0,00 0,00 0,00  0  0

Host:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2,580,005,440,200,00   91,77

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda 176,8022,40  1096,00112   5480
sdb 174,40 0,00  1096,00  0   5480
md1 181,6022,40  1096,00112   5480
dm-12   175,2022,40  1038,40112   5192

Time:
real7m7.770s
user0m0.352s
sys 0m0.217s

If the same mysql command is executed on host directly:
Host:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3,960,007,548,900,00   79,60

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda1273,40   164,80 10252,80824  51264
sdb1304,60   352,00 10252,80   1760  51264
md11345,00   516,80 10252,80   2584  51264
dm-0   1343,60   516,80 10232,00   2584  51160
Time:
real0m10.161s
user0m0.294s
sys 0m0.284s


This is completely testing server, so I can try some new versons or patches.

-- 
Boris Dolgov.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk

2011-07-29 Thread Liu Yuan

On 07/28/2011 10:47 PM, Christoph Hellwig wrote:

On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:

From: Liu Yuan

Vhost-blk driver is an in-kernel accelerator, intercepting the
IO requests from KVM virtio-capable guests. It is based on the
vhost infrastructure.

This is supposed to be a module over latest kernel tree, but it
needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
So currently, after applying the patch, you need to *recomplie*
the kernel.

Usage:
$kernel-src: make M=drivers/vhost
$kernel-src: sudo insmod drivers/vhost/vhost_blk.ko

After insmod, you'll see /dev/vhost-blk created. done!

You'll need to send the changes for existing code separately.


Thanks for reminding.


If you're going mostly for raw blockdevice access just calling
submit_bio will shave even more overhead off, and simplify the
code a lot.

Yes, sounds cool, I'll give it a try.

Yuan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Do not unconditionally read PDPTE from guest memory

2011-07-29 Thread Roedel, Joerg
On Thu, Jul 28, 2011 at 04:36:17AM -0400, Avi Kivity wrote:
> Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
> On SVM, it is not possible to implement this, but on VMX this is possible
> and was indeed implemented until nested SVM changed this to unconditionally
> read PDPTEs dynamically.  This has noticable impact when running PAE guests.
> 
> Fix by changing the MMU to read PDPTRs from the cache, falling back to
> reading from memory for the nested MMU.
> 
> Signed-off-by: Avi Kivity 

Hmm, interesting. Sorry for breaking it. I tested the patch on nested
svm, it works fine.

Tested-by: Joerg Roedel 

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows7 crashes inside the VM when starting a certain program

2011-07-29 Thread Gleb Natapov
On Fri, Jul 29, 2011 at 09:20:35AM +0200, André Weidemann wrote:
> On 27.07.2011 10:56, Gleb Natapov wrote:
> >On Tue, Jul 26, 2011 at 12:57:44PM +0200, André Weidemann wrote:
> >>Hi,
> >>
> >>On 26.07.2011 12:08, Gleb Natapov wrote:
> >>>On Tue, Jul 26, 2011 at 07:29:04AM +0200, André Weidemann wrote:
> On 07.07.2011 07:26, André Weidemann wrote:
> >Hi,
> >I am running Windows7 x64 in a VM which crashes after starting a certain
> >game. Actually there are two games both from the same company, that make
> >the VM crash after starting them.
> >Windows crashes right after starting the game. With the 1st game the
> >screen goes black as usual and the cursor keeps spinning for 3-5 seconds
> >until Windows crashes. With the second game I get to 3D the login
> >screen. The game then crashes after logging in.
> >Windows displays this error message on the first crash:
> >http://pastebin.com/kMzk9Jif
> >Windows then finishes writing the crash dump and restarts.
> >I can reproduce Windows crashing every time I start the game while the
> >VM keeps running without any problems.
> >When Windows reboots after the first crash and the game is started
> >again, the message on the following blue screen changes slightly and
> >stays the same(except for the addresses) for every following crash:
> >http://pastebin.com/jVtBc4ZH
> >
> >I first thought that this might be related to a certain feature in 3D
> >acceleration being used, but Futuremark 3DMark Vantage or 3DMark 11 run
> >without any problems. They run a bit choppy on some occasions, but do
> >that without crashing Windows7 or the VM.
> >
> >How can I proceed to investigate what is going wrong?
> 
> I did some testing and found out that Windows7 does not crash
> anymore when changing "-cpu host" to "-cpu Nehalem". After doing so,
> >>>What is your host cpu (cat /proc/cpuinfo)?
> >>
> >>The server is currently running on 2 out of 8 cores with kernel boot
> >>parameter "maxcpus=2".
> >>
> >>flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> >>pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
> >>pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
> >>xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est
> >>tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow
> >>vnmi flexpriority ept vpid
> >Flags that are present on -cpu host but not -cpu Nehalem (excluding vmx
> >related flags):
> >
> >vme dts acpi ss ht tm pbe rdtscp constant_tsc arch_perfmon pebs bts rep_good
> >xtopology nonstop_tsc aperfmperf dtes64 monitor ds_cpl est tm2 xtpr pdcm  ida
> >
> >Some of them may be synthetic and some of them may be filtered by KVM.
> >
> >Can you try to run "-cpu host,-vme,-dts..." (specifying all of those
> >flags with -). Drop those that qemu does not recognize. See if result
> >will be the same as with -cpu Nehalem. If yes, then try to find out with
> >flag make the difference.
> 
> I started the VM with all flags that differ between the two CPUs.
> After removing the ones qemu-kvm did not recognize, I started the VM
> again with the following line:
> -cpu 
> host,-vme,-acpi,-ss,-ht,-tm,-pbe,-rdtscp,-dtes64,-monitor,-ds_cpl,-est,-tm2,-xtpr,-pdcm
> \
> 
> Running the program under Windows7 inside the VM, caused Windows to
> crash again with a BSoD.
> The disassembly of the address f8000288320c shows the following:
> http://pastebin.com/7yzTYJSG
> 
Looks like it tries to read MSR_LASTBRANCH_TOS MSR which kvm does not
support. Do you see something interesting in dmesg? I wonder how
availability of the MSR should be checked.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Biweekly KVM Test report, kernel e72ef590... qemu fda19064...

2011-07-29 Thread Ren, Yongjie
Hi All,
This is KVM test result against kvm.git 
e72ef590a3ef3047f6ed5bcb8808a9734f6c4b32 based on kernel 3.0.0+, and 
qemu-kvm.git fda19064e889d4419dd3dc69ca8e6e7a1535fdf5.

We found no new bugs during the past two weeks.
We found 2 bugs got fixed. One fixed bug is about Win2k8 and Vista guest’s 
network issue, and the other is a qemu-kvm build issue.
And an old bug about VT-D also exists.  
https://bugs.launchpad.net/qemu/+bug/799036

New issue:

Fixed issues:
1. Win Vista and Win2K8 guests' network breaks down
   https://bugzilla.kernel.org/show_bug.cgi?id=39412
2. qemu-kvm.git make error when ‘CC ui/vnc-enc-tight.o’
   https://bugs.launchpad.net/qemu/+bug/802588

Old Issues list:
Old Issues:
1. ltp diotest running time is 2.54 times than before
 
https://sourceforge.net/tracker/?func=detail&aid=2723366&group_id=180599&atid=893831
2. perfctr wrmsr warning when booting 64bit RHEl5.3
 
https://sourceforge.net/tracker/?func=detail&aid=2721640&group_id=180599&atid=893831
 
3. [vt-d] NIC assignment order in command line make some NIC can't work
 https://bugs.launchpad.net/qemu/+bug/799036

Test environment:
==
  Platform   Westmere-EP  SanyBridge-EP
  CPU Cores   24   32
  Memory size 10G 32G

Report summary of IA32E on Westmere-EP:
Summary Test Report of Last Session
=
Total   PassFailNoResult   Crash
=
control_panel_ept_vpid  12  12  0 00
control_panel_ept   4   4   0 00
control_panel_vpid  3   3   0 00
control_panel   3   3   0 00
gtest_vpid  1   1   0 00
gtest_ept   1   1   0 00
gtest   3   2   0 00
vtd_ept_vpid3   1   1 00
gtest_ept_vpid  12  11  1 00
sriov_ept_vpid  6   6   0 00
=
control_panel_ept_vpid  12  12  0 00
 :KVM_LM_Continuity_64_g3   1   1   0 00
 :KVM_four_dguest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
 :KVM_SR_SMP_64_g32e1   1   0 00
 :KVM_LM_SMP_64_g32e1   1   0 00
 :KVM_linux_win_64_g32e 1   1   0 00
 :KVM_two_winxp_64_g32e 1   1   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_256M_guest_64_gPAE1   1   0 00
 :KVM_SR_Continuity_64_g3   1   1   0 00
 :KVM_256M_guest_64_g32e1   1   0 00
 :KVM_four_sguest_64_g32e   1   1   0 00
control_panel_ept   4   4   0 00
 :KVM_linux_win_64_g32e 1   1   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
 :KVM_LM_SMP_64_g32e1   1   0 00
control_panel_vpid  3   3   0 00
 :KVM_linux_win_64_g32e 1   1   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
control_panel   3   3   0 00
 :KVM_1500M_guest_64_g32e   1   1   0 00
 :KVM_1500M_guest_64_gPAE   1   1   0 00
 :KVM_LM_SMP_64_g32e1   1   0 00
gtest_vpid  1   1   0 00
 :boot_smp_win7_ent_64_g3   1   1   0 00
gtest_ept   1   1   0 00
 :boot_smp_win7_ent_64_g3   1   1   0 00
gtest   3   3   0 00
 :boot_smp_win2008_64_g32   1   1   0 00
 :boot_smp_win7_ent_64_gP   1   1   0 00
 :boot_smp_vista_64_g32e1   1   0 00
vtd_ept_vpid3   2   1 00
 :one_pcie_smp_xp_64_g32e   1   1   0 00
 :one_pcie_smp_64_g32e  1   1   0 00
 :two_dev_smp_64_g32e   1   0   1 00
gtest_ept_vpid  12  11  1 00
 :boot_up_acpi_64_g32e  1   1   0 00
 :boot_base_kernel_64_g32   1  

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Liu Yuan

On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote:

I mean did you investigate *why* userspace virtio-blk has higher
latency?  Did you profile it and drill down on its performance?

It's important to understand what is going on before replacing it with
another mechanism.  What I'm saying is, if I have a buggy program I
can sometimes rewrite it from scratch correctly but that doesn't tell
me what the bug was.

Perhaps the inefficiencies in userspace virtio-blk can be solved by
adjusting the code (removing inefficient notification mechanisms,
introducing a dedicated thread outside of the QEMU iothread model,
etc).  Then we'd get the performance benefit for non-raw images and
perhaps non-virtio and non-Linux host platforms too.



As Christoph mentioned, the unnecessary memory allocation and too much 
cache line unfriendly
function pointers might be culprit. For example, the read quests code 
path for linux aio would be



qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output
->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes 
again nested called!)->raw_aio_readv->laio_submit->io_submit...


Looking at this long list,most are function pointers that can not be 
inlined, and the internal data structures used by these functions are 
dozons. Leave aside code complexity, this long code path would really 
need retrofit. As Christoph simply put, this kind of mess is inherent 
all over the qemu code. So I am afraid, the 'retrofit'  would end up to 
be a re-write the entire (sub)system. I have to admit that, I am 
inclined to the MST's vhost approach, that write a new subsystem other 
than tedious profiling and fixing, that would possibly goes as far as 
actually re-writing it.



Actually, the motivation to start vhost-blk is that, in our observation,
KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
perspective, especially for sequential read/write (around 20% gap).

We'll deploy a large number of KVM-based systems as the infrastructure of
some service and this gap is really unpleasant.

By the design, IMHO, virtio performance is supposed to be comparable to the
para-vulgarization solution if not better, because for KVM, guest and
backend driver could sit in the same address space via mmaping. This would
reduce the overhead involved in page table modification, thus speed up the
buffer management and transfer a lot compared with Xen PV.

Yes, guest memory is just a region of QEMU userspace memory.  So it's
easy to reach inside and there are no page table tricks or copying
involved.


I am not in a qualified  position to talk about QEMU , but I think the
surprised performance improvement by this very primitive vhost-blk simply
manifest that, the internal structure for qemu io is the way bloated. I say
it *surprised* because basically vhost just reduces the number of system
calls, which is heavily tuned by chip manufacture for years. So, I guess the
performance number vhost-blk gains mainly could possibly be contributed to
*shorter and simpler* code path.

First we need to understand exactly what the latency overhead is.  If
we discover that it's simply not possible to do this equally well in
userspace, then it makes perfect sense to use vhost-blk.

So let's gather evidence and learn what the overheads really are.
Last year I spent time looking at virtio-blk latency:
http://www.linux-kvm.org/page/Virtio/Block/Latency



Nice stuff.


See especially this diagram:
http://www.linux-kvm.org/page/Image:Threads.png

The goal wasn't specifically to reduce synchronous sequential I/O,
instead the aim was to reduce overheads for a variety of scenarios,
especially multithreaded workloads.

In most cases it was helpful to move I/O submission out of the vcpu
thread by using the ioeventfd model just like vhost.  Ioeventfd for
userspace virtio-blk is now on by default in qemu-kvm.

Try running the userspace virtio-blk benchmark with -drive
if=none,id=drive0,file=... -device
virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
submission in the vcpu thread, which might reduce latency at the cost
of stealing guest time.


Anyway, IMHO, compared with user space approach, the in-kernel one would
allow more flexibility and better integration with the kernel IO stack,
since we don't need two IO stacks for guest OS.

I agree that there may be advantages to integrating with in-kernel I/O
mechanisms.  An interesting step would be to implement the
submit_bio() approach that Christoph suggested and seeing if that
improves things further.

Push virtio-blk as far as you can and let's see what the performance is!


I have a hacked up world here that basically implements vhost-blk in
userspace:

http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c

  * A dedicated virtqueue thread sleeps on ioeventfd
  * Guest memory is pre-mapped and accessed directly (not using QEMU's
usually memory access function

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Stefan Hajnoczi
On Fri, Jul 29, 2011 at 1:01 PM, Liu Yuan  wrote:
> On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote:
>>
>> I mean did you investigate *why* userspace virtio-blk has higher
>> latency?  Did you profile it and drill down on its performance?
>>
>> It's important to understand what is going on before replacing it with
>> another mechanism.  What I'm saying is, if I have a buggy program I
>> can sometimes rewrite it from scratch correctly but that doesn't tell
>> me what the bug was.
>>
>> Perhaps the inefficiencies in userspace virtio-blk can be solved by
>> adjusting the code (removing inefficient notification mechanisms,
>> introducing a dedicated thread outside of the QEMU iothread model,
>> etc).  Then we'd get the performance benefit for non-raw images and
>> perhaps non-virtio and non-Linux host platforms too.
>>
>
> As Christoph mentioned, the unnecessary memory allocation and too much cache
> line unfriendly
> function pointers might be culprit. For example, the read quests code path
> for linux aio would be
>
>
>  qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output
> ->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes
> again nested called!)->raw_aio_readv->laio_submit->io_submit...
>
> Looking at this long list,most are function pointers that can not be
> inlined, and the internal data structures used by these functions are
> dozons. Leave aside code complexity, this long code path would really need
> retrofit. As Christoph simply put, this kind of mess is inherent all over
> the qemu code. So I am afraid, the 'retrofit'  would end up to be a re-write
> the entire (sub)system. I have to admit that, I am inclined to the MST's
> vhost approach, that write a new subsystem other than tedious profiling and
> fixing, that would possibly goes as far as actually re-writing it.

I'm totally for vhost-blk if there are unique benefits that make it
worth maintaining.  But better benchmark results are not a cause, they
are an effect.  So the thing to do is to drill down on both vhost-blk
and userspace virtio-blk to understand what causes overheads.
Evidence showing that userspace can never compete is needed to justify
vhost-blk IMO.

>>> Actually, the motivation to start vhost-blk is that, in our observation,
>>> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
>>> perspective, especially for sequential read/write (around 20% gap).
>>>
>>> We'll deploy a large number of KVM-based systems as the infrastructure of
>>> some service and this gap is really unpleasant.
>>>
>>> By the design, IMHO, virtio performance is supposed to be comparable to
>>> the
>>> para-vulgarization solution if not better, because for KVM, guest and
>>> backend driver could sit in the same address space via mmaping. This
>>> would
>>> reduce the overhead involved in page table modification, thus speed up
>>> the
>>> buffer management and transfer a lot compared with Xen PV.
>>
>> Yes, guest memory is just a region of QEMU userspace memory.  So it's
>> easy to reach inside and there are no page table tricks or copying
>> involved.
>>
>>> I am not in a qualified  position to talk about QEMU , but I think the
>>> surprised performance improvement by this very primitive vhost-blk simply
>>> manifest that, the internal structure for qemu io is the way bloated. I
>>> say
>>> it *surprised* because basically vhost just reduces the number of system
>>> calls, which is heavily tuned by chip manufacture for years. So, I guess
>>> the
>>> performance number vhost-blk gains mainly could possibly be contributed
>>> to
>>> *shorter and simpler* code path.
>>
>> First we need to understand exactly what the latency overhead is.  If
>> we discover that it's simply not possible to do this equally well in
>> userspace, then it makes perfect sense to use vhost-blk.
>>
>> So let's gather evidence and learn what the overheads really are.
>> Last year I spent time looking at virtio-blk latency:
>> http://www.linux-kvm.org/page/Virtio/Block/Latency
>>
>
> Nice stuff.
>
>> See especially this diagram:
>> http://www.linux-kvm.org/page/Image:Threads.png
>>
>> The goal wasn't specifically to reduce synchronous sequential I/O,
>> instead the aim was to reduce overheads for a variety of scenarios,
>> especially multithreaded workloads.
>>
>> In most cases it was helpful to move I/O submission out of the vcpu
>> thread by using the ioeventfd model just like vhost.  Ioeventfd for
>> userspace virtio-blk is now on by default in qemu-kvm.
>>
>> Try running the userspace virtio-blk benchmark with -drive
>> if=none,id=drive0,file=... -device
>> virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
>> submission in the vcpu thread, which might reduce latency at the cost
>> of stealing guest time.
>>
>>> Anyway, IMHO, compared with user space approach, the in-kernel one would
>>> allow more flexibility and better integration with the kernel IO stack,
>>> since we don't n

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Stefan Hajnoczi
I hit a weirdness yesterday, just want to mention it in case you notice it too.

When running vanilla qemu-kvm I forgot to use aio=native.  When I
compared the results against virtio-blk-data-plane (which *always*
uses Linux AIO) I was surprised to find average 4k read latency was
lower and the standard deviation was also lower.

So from now on I will run tests both with and without aio=native.
aio=native should be faster and if I can reproduce the reverse I'll
try to figure out why.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 00/23] Memory API, batch 1

2011-07-29 Thread Anthony Liguori

On 07/26/2011 06:25 AM, Avi Kivity wrote:

This patchset contains the core of the memory API, with one device
(usb-ohci) coverted for reference.  The API is currently implemented on
top of the old ram_addr_t/cpu_register_physical_memory() API, but the plan
is to make it standalone later.

The goals of the API are:
  - correctness: by modelling the memory hierarchy, things like the 440FX PAM
registers and movable, overlapping PCI BARs can be modelled accurately.
  - efficiency: by maintaining an object tree describing guest memory, we
can eventually get rid of the page descriptor array
  - security: by having more information available declaratively, we reduce
coding errors that may be exploited by malicious guests


Applied all.  Thanks.

Regards,

Anthony Liguori



Also available from

   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
  refs/tags/memory-region-batch-1-v2

Changes from v1:
  - switched to gtk-doc
  - more copyright blurbs
  - simplified flatview_simplify()
  - use assert() instead of abort() for invariant checks
(but keep abort() for runtime errors)
  - commit log fixups

Avi Kivity (23):
   Add memory API documentation
   Hierarchical memory region API
   memory: implement dirty tracking
   memory: merge adjacent segments of a single memory region
   Internal interfaces for memory API
   memory: abstract address space operations
   memory: rename MemoryRegion::has_ram_addr to ::terminates
   memory: late initialization of ram_addr
   memory: I/O address space support
   memory: add backward compatibility for old portio registration
   memory: add backward compatibility for old mmio registration
   memory: add ioeventfd support
   memory: separate building the final memory map into two steps
   memory: transaction API
   exec.c: initialize memory map
   ioport: register ranges by byte aligned addresses always
   pc: grab system_memory
   pc: convert pc_memory_init() to memory API
   pc: move global memory map out of pc_init1() and into its callers
   pci: pass address space to pci bus when created
   pci: add MemoryRegion based BAR management API
   sysbus: add MemoryRegion based memory management API
   usb-ohci: convert to MemoryRegion

  Makefile.target|1 +
  docs/memory.txt|  172 
  exec-memory.h  |   39 ++
  exec.c |   19 +
  hw/apb_pci.c   |2 +
  hw/bonito.c|4 +-
  hw/grackle_pci.c   |5 +-
  hw/gt64xxx.c   |4 +-
  hw/pc.c|   62 ++-
  hw/pc.h|9 +-
  hw/pc_piix.c   |   20 +-
  hw/pci.c   |   63 +++-
  hw/pci.h   |   15 +-
  hw/pci_host.h  |1 +
  hw/pci_internals.h |1 +
  hw/piix_pci.c  |   13 +-
  hw/ppc4xx_pci.c|5 +-
  hw/ppc_mac.h   |9 +-
  hw/ppc_newworld.c  |5 +-
  hw/ppc_oldworld.c  |3 +-
  hw/ppc_prep.c  |3 +-
  hw/ppce500_pci.c   |6 +-
  hw/prep_pci.c  |5 +-
  hw/prep_pci.h  |3 +-
  hw/sh_pci.c|4 +-
  hw/sysbus.c|   27 ++-
  hw/sysbus.h|3 +
  hw/unin_pci.c  |   10 +-
  hw/usb-ohci.c  |   42 +--
  hw/versatile_pci.c |2 +
  ioport.c   |4 +-
  memory.c   | 1141 
  memory.h   |  469 +
  33 files changed, 2072 insertions(+), 99 deletions(-)
  create mode 100644 docs/memory.txt
  create mode 100644 exec-memory.h
  create mode 100644 memory.c
  create mode 100644 memory.h



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Liu Yuan

On 07/29/2011 08:50 PM, Stefan Hajnoczi wrote:

I hit a weirdness yesterday, just want to mention it in case you notice it too.

When running vanilla qemu-kvm I forgot to use aio=native.  When I
compared the results against virtio-blk-data-plane (which *always*
uses Linux AIO) I was surprised to find average 4k read latency was
lower and the standard deviation was also lower.

So from now on I will run tests both with and without aio=native.
aio=native should be faster and if I can reproduce the reverse I'll
try to figure out why.

Stefan
On my laptop, I don't meet this weirdo. the emulated POSIX AIO is much 
worse than the Linux AIO as expected. If iodepth goes deeper, the gap 
gets wider.


If not set aio=none, qemu uses emulated posix aio interface to do the 
IO. I peek at the posix-aio-compat.c,it uses thread pool and sync 
preadv/pwritev to emulate the AIO behaviour. The sync IO interface would 
even cause much poorer performance for random rw, since io-scheduler 
would possibly never get a chance to merge the requests stream. 
(blk_finish_plug->queue_unplugged->__blk_run_queue)


Yuan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Liu Yuan

On 07/29/2011 10:45 PM, Liu Yuan wrote:

On 07/29/2011 08:50 PM, Stefan Hajnoczi wrote:
I hit a weirdness yesterday, just want to mention it in case you 
notice it too.


When running vanilla qemu-kvm I forgot to use aio=native.  When I
compared the results against virtio-blk-data-plane (which *always*
uses Linux AIO) I was surprised to find average 4k read latency was
lower and the standard deviation was also lower.

So from now on I will run tests both with and without aio=native.
aio=native should be faster and if I can reproduce the reverse I'll
try to figure out why.

Stefan
On my laptop, I don't meet this weirdo. the emulated POSIX AIO is much 
worse than the Linux AIO as expected. If iodepth goes deeper, the gap 
gets wider.


If not set aio=none, qemu uses emulated posix aio interface to do the 
IO. I peek at the posix-aio-compat.c,it uses thread pool and sync 
preadv/pwritev to emulate the AIO behaviour. The sync IO interface 
would even cause much poorer performance for random rw, since 
io-scheduler would possibly never get a chance to merge the requests 
stream. (blk_finish_plug->queue_unplugged->__blk_run_queue)


Yuan

Typo. not merge, I mean *sort* the reqs
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk

2011-07-29 Thread Liu Yuan

On 07/28/2011 11:22 PM, Michael S. Tsirkin wrote:

On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:

From: Liu Yuan

Vhost-blk driver is an in-kernel accelerator, intercepting the
IO requests from KVM virtio-capable guests. It is based on the
vhost infrastructure.

This is supposed to be a module over latest kernel tree, but it
needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
So currently, after applying the patch, you need to *recomplie*
the kernel.

Usage:
$kernel-src: make M=drivers/vhost
$kernel-src: sudo insmod drivers/vhost/vhost_blk.ko

After insmod, you'll see /dev/vhost-blk created. done!

Signed-off-by: Liu Yuan

Thanks, this is an interesting patch.

There are some coding style issues in this patch, could you please
change the code to match the kernel coding style?

In particular pls prefix functions macros etc with vhost_blk to avoid
confusion.

scripts/checkpatch.pl can find some, but not all, issues.


---
  drivers/vhost/Makefile |3 +
  drivers/vhost/blk.c|  568 
  drivers/vhost/vhost.h  |   11 +
  fs/aio.c   |   44 ++---
  fs/eventfd.c   |1 +
  include/linux/aio.h|   31 +++

As others said, core changes need to be split out
and get acks from relevant people.

Use scripts/get_maintainer.pl to get a list.



  6 files changed, 631 insertions(+), 27 deletions(-)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..31f8b2e 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,5 @@
  obj-$(CONFIG_VHOST_NET) += vhost_net.o
+obj-m += vhost_blk.o
+
  vhost_net-y := vhost.o net.o
+vhost_blk-y := vhost.o blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 000..f3462be
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,568 @@
+/* Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Vhost-blk driver is an in-kernel accelerator, intercepting the
+ * IO requests from KVM virtio-capable guests. It is based on the
+ * vhost infrastructure.
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include "vhost.h"
+
+#define DEBUG 0
+
+#if DEBUG>  0
+#define dprintk printk
+#else
+#define dprintk(x...)   do { ; } while (0)
+#endif

There are standard macros for these.


+
+enum {
+   virtqueue_max = 1,
+};
+
+#define MAX_EVENTS 128
+
+struct vhost_blk {
+   struct vhost_virtqueue vq;
+   struct vhost_dev dev;
+   int should_stop;
+   struct kioctx *ioctx;
+   struct eventfd_ctx *ectx;
+   struct file *efile;
+   struct task_struct *worker;
+};
+
+struct used_info {
+   void *status;
+   int head;
+   int len;
+};
+
+static struct io_event events[MAX_EVENTS];
+
+static void blk_flush(struct vhost_blk *blk)
+{
+   vhost_poll_flush(&blk->vq.poll);
+}
+
+static long blk_set_features(struct vhost_blk *blk, u64 features)
+{
+   blk->dev.acked_features = features;
+   return 0;
+}
+
+static void blk_stop(struct vhost_blk *blk)
+{
+   struct vhost_virtqueue *vq =&blk->vq;
+   struct file *f;
+
+   mutex_lock(&vq->mutex);
+   f = rcu_dereference_protected(vq->private_data,
+   lockdep_is_held(&vq->mutex));
+   rcu_assign_pointer(vq->private_data, NULL);
+   mutex_unlock(&vq->mutex);
+
+   if (f)
+   fput(f);
+}
+
+static long blk_set_backend(struct vhost_blk *blk, struct vhost_vring_file 
*backend)
+{
+   int idx = backend->index;
+   struct vhost_virtqueue *vq =&blk->vq;
+   struct file *file, *oldfile;
+   int ret;
+
+   mutex_lock(&blk->dev.mutex);
+   ret = vhost_dev_check_owner(&blk->dev);
+   if (ret)
+   goto err_dev;
+   if (idx>= virtqueue_max) {
+   ret = -ENOBUFS;
+   goto err_dev;
+   }
+
+   mutex_lock(&vq->mutex);
+
+   if (!vhost_vq_access_ok(vq)) {
+   ret = -EFAULT;
+   goto err_vq;
+   }

NET used -1 backend to remove a backend.
I think it's a good idea, to make an operation reversible.


+
+   file = fget(backend->fd);

We need to verify that the file type passed makes sense.
For example, it's possible to create reference loops
by passng the vhost-blk fd.



+   if (IS_ERR(file)) {
+   ret = PTR_ERR(file);
+   goto err_vq;
+   }
+
+   oldfile = rcu_dereference_protected(vq->private_data,
+   lockdep_is_held(&vq->mutex));
+   if (file != oldfile)
+   rcu_assign_pointer(vq->private_data, file);
+
+   mutex_unlock(&vq->mutex);
+
+   if (oldfile) {
+   blk_flush(blk);
+   fput(oldfile);
+   }
+
+   mutex_unlock(&bl

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Sasha Levin
On Fri, 2011-07-29 at 20:01 +0800, Liu Yuan wrote:
> Looking at this long list,most are function pointers that can not be 
> inlined, and the internal data structures used by these functions are 
> dozons. Leave aside code complexity, this long code path would really 
> need retrofit. As Christoph simply put, this kind of mess is inherent 
> all over the qemu code. So I am afraid, the 'retrofit'  would end up to 
> be a re-write the entire (sub)system. I have to admit that, I am 
> inclined to the MST's vhost approach, that write a new subsystem other 
> than tedious profiling and fixing, that would possibly goes as far as 
> actually re-writing it.

I don't think the fix for problematic userspace is to write more kernel
code.

vhost-net improved throughput and latency by several factors, allowing
to achieve much more than was possible at userspace alone.

With vhost-blk we see an improvement of ~15% - which I assume by your
and Christoph's comments can be mostly attributed to QEMU. Merging a
module which won't improve performance dramatically compared to what is
possible to achieve in userspace (even if it would require a code
rewrite) sounds a bit wrong to me.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Badari Pulavarty

Hi Liu Yuan,

I am glad to see that you started looking at vhost-blk.   I did an 
attempt year ago to improve block

performance using vhost-blk approach.

http://lwn.net/Articles/379864/
http://lwn.net/Articles/382543/

I will take a closer look at your patchset to find differences and 
similarities.


- I focused on using vfs interfaces in the kernel, so that I can use it 
for file-backed devices.

Our use-case scenario is mostly file-backed images.

- In few cases, virtio-blk did outperform vhost-blk -- which was counter 
intuitive - but

couldn't exactly nail down. why ?

- I had to implement my own threads for parellism. I see that you are 
using aio infrastructure

to get around it.

- In our high scale performance testing, what we found is block-backed 
device performance is
pretty close to bare-metal (91% of bare-metal). vhost-blk didn't add any 
major benefits to it.
I am curious on your performance analysis & data on where you see the 
gains and why ?


Hence I prioritized my work low :(

Now that you are interested in driving this, I am happy to work with you 
and see what

vhost-blk brings to the tables. (even if helps us improve virtio-blk).

Thanks,
Badari


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 0/3] separate thread for VM migration

2011-07-29 Thread Umesh Deshpande
Following patch deals with VCPU and iothread starvation during the migration of
a guest. Currently the iothread is responsible for performing the guest
migration. It holds qemu_mutex during the migration and doesn't allow VCPU to
enter the qemu mode and delays its return to the guest. The guest migration,
executed as an iohandler also delays the execution of other iohandlers.
In the following patch series,

The migration has been moved to a separate thread to
reduce the qemu_mutex contention and iohandler starvation.

Also current dirty bitmap is split into per memslot bitmap to reduce its size.

Umesh Deshpande (3):
  separate thread for VM migration
  fine grained qemu_mutex locking for migration
  per memslot dirty bitmap

 arch_init.c |   14 ++--
 buffered_file.c |   28 -
 buffered_file.h |4 +++
 cpu-all.h   |   40 ++--
 exec.c  |   38 +-
 migration.c |   60 --
 migration.h |3 ++
 savevm.c|   22 +---
 savevm.h|   29 ++
 xen-all.c   |6 +---
 10 files changed, 173 insertions(+), 71 deletions(-)
 create mode 100644 savevm.h

-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 1/3] separate thread for VM migration

2011-07-29 Thread Umesh Deshpande
This patch creates a separate thread for the guest migration on the source side.

Signed-off-by: Umesh Deshpande 
---
 buffered_file.c |   28 -
 buffered_file.h |4 +++
 migration.c |   59 +++---
 migration.h |3 ++
 savevm.c|   22 +---
 savevm.h|   29 +++
 6 files changed, 102 insertions(+), 43 deletions(-)
 create mode 100644 savevm.h

diff --git a/buffered_file.c b/buffered_file.c
index 41b42c3..d4146bf 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -16,12 +16,16 @@
 #include "qemu-timer.h"
 #include "qemu-char.h"
 #include "buffered_file.h"
+#include "migration.h"
+#include "savevm.h"
+#include "qemu-thread.h"
 
 //#define DEBUG_BUFFERED_FILE
 
 typedef struct QEMUFileBuffered
 {
 BufferedPutFunc *put_buffer;
+BufferedBeginFunc *begin;
 BufferedPutReadyFunc *put_ready;
 BufferedWaitForUnfreezeFunc *wait_for_unfreeze;
 BufferedCloseFunc *close;
@@ -35,6 +39,7 @@ typedef struct QEMUFileBuffered
 size_t buffer_size;
 size_t buffer_capacity;
 QEMUTimer *timer;
+QemuThread thread;
 } QEMUFileBuffered;
 
 #ifdef DEBUG_BUFFERED_FILE
@@ -181,8 +186,6 @@ static int buffered_close(void *opaque)
 
 ret = s->close(s->opaque);
 
-qemu_del_timer(s->timer);
-qemu_free_timer(s->timer);
 qemu_free(s->buffer);
 qemu_free(s);
 
@@ -228,17 +231,15 @@ static int64_t buffered_get_rate_limit(void *opaque)
 return s->xfer_limit;
 }
 
-static void buffered_rate_tick(void *opaque)
+void buffered_rate_tick(QEMUFile *file)
 {
-QEMUFileBuffered *s = opaque;
+QEMUFileBuffered *s = file->opaque;
 
 if (s->has_error) {
 buffered_close(s);
 return;
 }
 
-qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
-
 if (s->freeze_output)
 return;
 
@@ -250,9 +251,17 @@ static void buffered_rate_tick(void *opaque)
 s->put_ready(s->opaque);
 }
 
+static void *migrate_vm(void *opaque)
+{
+QEMUFileBuffered *s = opaque;
+s->begin(s->opaque);
+return NULL;
+}
+
 QEMUFile *qemu_fopen_ops_buffered(void *opaque,
   size_t bytes_per_sec,
   BufferedPutFunc *put_buffer,
+  BufferedBeginFunc *begin,
   BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close)
@@ -264,6 +273,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
 s->opaque = opaque;
 s->xfer_limit = bytes_per_sec / 10;
 s->put_buffer = put_buffer;
+s->begin = begin;
 s->put_ready = put_ready;
 s->wait_for_unfreeze = wait_for_unfreeze;
 s->close = close;
@@ -271,11 +281,9 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
 s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
  buffered_close, buffered_rate_limit,
  buffered_set_rate_limit,
-buffered_get_rate_limit);
-
-s->timer = qemu_new_timer_ms(rt_clock, buffered_rate_tick, s);
+ buffered_get_rate_limit);
 
-qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
+qemu_thread_create(&s->thread, migrate_vm, s);
 
 return s->file;
 }
diff --git a/buffered_file.h b/buffered_file.h
index 98d358b..cfe2833 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -17,12 +17,16 @@
 #include "hw/hw.h"
 
 typedef ssize_t (BufferedPutFunc)(void *opaque, const void *data, size_t size);
+typedef void (BufferedBeginFunc)(void *opaque);
 typedef void (BufferedPutReadyFunc)(void *opaque);
 typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
 typedef int (BufferedCloseFunc)(void *opaque);
 
+void buffered_rate_tick(QEMUFile *file);
+
 QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
   BufferedPutFunc *put_buffer,
+  BufferedBeginFunc *begin,
   BufferedPutReadyFunc *put_ready,
   BufferedWaitForUnfreezeFunc 
*wait_for_unfreeze,
   BufferedCloseFunc *close);
diff --git a/migration.c b/migration.c
index af3a1f2..bf86067 100644
--- a/migration.c
+++ b/migration.c
@@ -31,6 +31,8 @@
 do { } while (0)
 #endif
 
+static int64_t expire_time;
+
 /* Migration speed throttling */
 static int64_t max_throttle = (32 << 20);
 
@@ -284,8 +286,6 @@ int migrate_fd_cleanup(FdMigrationState *s)
 {
 int ret = 0;
 
-qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
-
 if (s->file) {
 DPRINTF("closing file\n");
 if (qemu_fclose(s->file) != 0) {
@@ -310,8 +310,7 @@ int migrate_fd_cleanup(FdMigrationState *s)
 void migrate_fd_put_notify(void *opaque)
 {
 FdMigrationState *s = op

[RFC PATCH v2 3/3] Per memslot dirty bitmap

2011-07-29 Thread Umesh Deshpande
This patch creates a separate dirty bitmap for each slot. Currently dirty bitmap
is created for addresses ranging from 0 to the end address of the last memory
slot. Since the memslots are not necessarily contiguous, current bitmap might
contain empty region or holes that doesn't represent any VM pages. This patch
reduces the size of the dirty bitmap by allocating per memslot dirty bitmaps.

Signed-off-by: Umesh Deshpande 
---
 cpu-all.h |   40 +---
 exec.c|   38 +++---
 xen-all.c |6 ++
 3 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index e839100..9517a9b 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -920,6 +920,7 @@ extern ram_addr_t ram_size;
 
 typedef struct RAMBlock {
 uint8_t *host;
+uint8_t *phys_dirty;
 ram_addr_t offset;
 ram_addr_t length;
 uint32_t flags;
@@ -931,7 +932,6 @@ typedef struct RAMBlock {
 } RAMBlock;
 
 typedef struct RAMList {
-uint8_t *phys_dirty;
 QLIST_HEAD(ram, RAMBlock) blocks;
 } RAMList;
 extern RAMList ram_list;
@@ -961,32 +961,55 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG  0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+RAMBlock *qemu_addr_to_ramblock(ram_addr_t);
+
+static inline int get_page_nr(ram_addr_t addr, RAMBlock **block)
+{
+int page_nr;
+*block = qemu_addr_to_ramblock(addr);
+
+page_nr = addr - (*block)->offset;
+page_nr = page_nr >> TARGET_PAGE_BITS;
+
+return page_nr;
+}
+
 /* read dirty bit (return 0 or 1) */
 static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
 {
-return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] == 0xff;
+RAMBlock *block;
+int page_nr = get_page_nr(addr, &block);
+return block->phys_dirty[page_nr] == 0xff;
 }
 
 static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
 {
-return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS];
+RAMBlock *block;
+int page_nr = get_page_nr(addr, &block);
+return block->phys_dirty[page_nr];
 }
 
 static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
 int dirty_flags)
 {
-return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] & dirty_flags;
+RAMBlock *block;
+int page_nr = get_page_nr(addr, &block);
+return block->phys_dirty[page_nr] & dirty_flags;
 }
 
 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
-ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] = 0xff;
+RAMBlock *block;
+int page_nr = get_page_nr(addr, &block);
+block->phys_dirty[page_nr] = 0xff;
 }
 
 static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
   int dirty_flags)
 {
-return ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] |= dirty_flags;
+RAMBlock *block;
+int page_nr = get_page_nr(addr, &block);
+return block->phys_dirty[page_nr] |= dirty_flags;
 }
 
 static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
@@ -995,10 +1018,13 @@ static inline void 
cpu_physical_memory_mask_dirty_range(ram_addr_t start,
 {
 int i, mask, len;
 uint8_t *p;
+RAMBlock *block;
+int page_nr = get_page_nr(start, &block);
 
 len = length >> TARGET_PAGE_BITS;
 mask = ~dirty_flags;
-p = ram_list.phys_dirty + (start >> TARGET_PAGE_BITS);
+
+p = block->phys_dirty + page_nr;
 for (i = 0; i < len; i++) {
 p[i] &= mask;
 }
diff --git a/exec.c b/exec.c
index 0e2ce57..6312550 100644
--- a/exec.c
+++ b/exec.c
@@ -2106,6 +2106,10 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, 
ram_addr_t end,
 abort();
 }
 
+if (kvm_enabled()) {
+return;
+}
+
 for(env = first_cpu; env != NULL; env = env->next_cpu) {
 int mmu_idx;
 for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
@@ -2894,17 +2898,6 @@ static ram_addr_t find_ram_offset(ram_addr_t size)
 return offset;
 }
 
-static ram_addr_t last_ram_offset(void)
-{
-RAMBlock *block;
-ram_addr_t last = 0;
-
-QLIST_FOREACH(block, &ram_list.blocks, next)
-last = MAX(last, block->offset + block->length);
-
-return last;
-}
-
 ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
ram_addr_t size, void *host)
 {
@@ -2974,10 +2967,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, 
const char *name,
 
 QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next);
 
-ram_list.phys_dirty = qemu_realloc(ram_list.phys_dirty,
-   last_ram_offset() >> TARGET_PAGE_BITS);
-memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
-   0xff, size >> TARGET_PAGE_BITS);
+new_block->phys_dirty = qemu_mallocz(new_block->length >> 
TARGET_PAGE_BITS);
+memset(new_block->phys_dirty, 0xff, new_block->length >> TARGET_PAGE_BITS);
 
 if (kvm_enabled())
 kvm_setup_gu

[RFC PATCH v2 2/3] fine grained qemu_mutex locking for migration

2011-07-29 Thread Umesh Deshpande
In the migration thread, qemu_mutex is released during the most time consuming
part. i.e. during is_dup_page which identifies the uniform data pages and during
the put_buffer. qemu_mutex is also released while blocking on select to wait for
the descriptor to become ready for writes.

Signed-off-by: Umesh Deshpande 
---
 arch_init.c |   14 +++---
 migration.c |   11 +++
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 484b39d..cd545bc 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -110,7 +110,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
 static RAMBlock *last_block;
 static ram_addr_t last_offset;
 
-static int ram_save_block(QEMUFile *f)
+static int ram_save_block(QEMUFile *f, int stage)
 {
 RAMBlock *block = last_block;
 ram_addr_t offset = last_offset;
@@ -131,6 +131,10 @@ static int ram_save_block(QEMUFile *f)
 current_addr + TARGET_PAGE_SIZE,
 MIGRATION_DIRTY_FLAG);
 
+if (stage != 3) {
+qemu_mutex_unlock_iothread();
+}
+
 p = block->host + offset;
 
 if (is_dup_page(p, *p)) {
@@ -153,6 +157,10 @@ static int ram_save_block(QEMUFile *f)
 bytes_sent = TARGET_PAGE_SIZE;
 }
 
+if (stage != 3) {
+qemu_mutex_lock_iothread();
+}
+
 break;
 }
 
@@ -301,7 +309,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, 
void *opaque)
 while (!qemu_file_rate_limit(f)) {
 int bytes_sent;
 
-bytes_sent = ram_save_block(f);
+bytes_sent = ram_save_block(f, stage);
 bytes_transferred += bytes_sent;
 if (bytes_sent == 0) { /* no more blocks */
 break;
@@ -322,7 +330,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, 
void *opaque)
 int bytes_sent;
 
 /* flush all remaining blocks regardless of rate limiting */
-while ((bytes_sent = ram_save_block(f)) != 0) {
+while ((bytes_sent = ram_save_block(f, stage)) != 0) {
 bytes_transferred += bytes_sent;
 }
 cpu_physical_memory_set_dirty_tracking(0);
diff --git a/migration.c b/migration.c
index bf86067..992fef5 100644
--- a/migration.c
+++ b/migration.c
@@ -375,15 +375,19 @@ void migrate_fd_begin(void *arg)
 if (ret < 0) {
 DPRINTF("failed, %d\n", ret);
 migrate_fd_error(s);
-goto out;
+qemu_mutex_unlock_iothread();
+return;
 }
 
 expire_time = qemu_get_clock_ms(rt_clock) + 100;
 migrate_fd_put_ready(s);
+qemu_mutex_unlock_iothread();
 
 while (s->state == MIG_STATE_ACTIVE) {
 if (migrate_fd_check_expire()) {
+qemu_mutex_lock_iothread();
 buffered_rate_tick(s->file);
+qemu_mutex_unlock_iothread();
 }
 
 if (s->state != MIG_STATE_ACTIVE) {
@@ -392,12 +396,11 @@ void migrate_fd_begin(void *arg)
 
 if (s->callback) {
 migrate_fd_wait_for_unfreeze(s);
+qemu_mutex_lock_iothread();
 s->callback(s);
+qemu_mutex_unlock_iothread();
 }
 }
-
-out:
-qemu_mutex_unlock_iothread();
 }
 
 
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC net-next] virtio_net: refill buffer right after being used

2011-07-29 Thread Shirley Ma
To even the latency, refill buffer right after being used.

Sign-off-by: Shirley Ma 
---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0c7321c..c8201d4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -429,6 +429,22 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, 
gfp_t gfp)
return err;
 }
 
+static bool fill_one(struct virtio_net *vi, gfp_t gfp)
+{
+   int err;
+
+   if (vi->mergeable_rx_bufs)
+   err = add_recvbuf_mergeable(vi, gfp);
+   else if (vi->big_packets)
+   err = add_recvbuf_big(vi, gfp);
+   else
+   err = add_recvbuf_small(vi, gfp);
+
+   if (err >= 0)
+   ++vi->num;
+   return err;
+}
+
 /* Returns false if we couldn't fill entirely (OOM). */
 static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
 {
@@ -436,17 +452,10 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t 
gfp)
bool oom;
 
do {
-   if (vi->mergeable_rx_bufs)
-   err = add_recvbuf_mergeable(vi, gfp);
-   else if (vi->big_packets)
-   err = add_recvbuf_big(vi, gfp);
-   else
-   err = add_recvbuf_small(vi, gfp);
-
+   err = fill_one(vi, gfp);
oom = err == -ENOMEM;
if (err < 0)
break;
-   ++vi->num;
} while (err > 0);
if (unlikely(vi->num > vi->max))
vi->max = vi->num;
@@ -506,13 +515,13 @@ again:
receive_buf(vi->dev, buf, len);
--vi->num;
received++;
-   }
-
-   if (vi->num < vi->max / 2) {
-   if (!try_fill_recv(vi, GFP_ATOMIC))
+   if (fill_one(vi, GFP_ATOMIC) < 0)
schedule_delayed_work(&vi->refill, 0);
}
 
+   /* notify buffers are refilled */
+   virtqueue_kick(vi->rvq);
+
/* Out of packets? */
if (received < budget) {
napi_complete(napi);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: report valid microcode update ID

2011-07-29 Thread Marcelo Tosatti

Windows Server 2008 SP2 checked build with smp > 1 BSOD's during
boot due to lack of microcode update:

*** Assertion failed: The system BIOS on this machine does not properly
support the processor.  The system BIOS did not load any microcode update.
A BIOS containing the latest microcode update is needed for system reliability.
(CurrentUpdateRevision != 0)
***   Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440

Report a non-zero microcode update signature to make it happy.

Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e80f0d7..f435591 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1841,7 +1841,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
 
switch (msr) {
case MSR_IA32_PLATFORM_ID:
-   case MSR_IA32_UCODE_REV:
case MSR_IA32_EBL_CR_POWERON:
case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
@@ -1862,6 +1861,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case MSR_FAM10H_MMIO_CONF_BASE:
data = 0;
break;
+   case MSR_IA32_UCODE_REV:
+   data = 0x1ULL;
+   break;
case MSR_MTRRcap:
data = 0x500 | KVM_NR_VAR_MTRR;
break;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC net-next] virtio_net: refill buffer right after being used

2011-07-29 Thread Shirley Ma
Resubmit it with a typo fix.

Signed-off-by: Shirley Ma 
---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0c7321c..c8201d4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -429,6 +429,22 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, 
gfp_t gfp)
return err;
 }
 
+static int fill_one(struct virtnet_info *vi, gfp_t gfp)
+{
+   int err;
+
+   if (vi->mergeable_rx_bufs)
+   err = add_recvbuf_mergeable(vi, gfp);
+   else if (vi->big_packets)
+   err = add_recvbuf_big(vi, gfp);
+   else
+   err = add_recvbuf_small(vi, gfp);
+
+   if (err >= 0)
+   ++vi->num;
+   return err;
+}
+
 /* Returns false if we couldn't fill entirely (OOM). */
 static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
 {
@@ -436,17 +452,10 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t 
gfp)
bool oom;
 
do {
-   if (vi->mergeable_rx_bufs)
-   err = add_recvbuf_mergeable(vi, gfp);
-   else if (vi->big_packets)
-   err = add_recvbuf_big(vi, gfp);
-   else
-   err = add_recvbuf_small(vi, gfp);
-
+   err = fill_one(vi, gfp);
oom = err == -ENOMEM;
if (err < 0)
break;
-   ++vi->num;
} while (err > 0);
if (unlikely(vi->num > vi->max))
vi->max = vi->num;
@@ -506,13 +515,13 @@ again:
receive_buf(vi->dev, buf, len);
--vi->num;
received++;
-   }
-
-   if (vi->num < vi->max / 2) {
-   if (!try_fill_recv(vi, GFP_ATOMIC))
+   if (fill_one(vi, GFP_ATOMIC) < 0)
schedule_delayed_work(&vi->refill, 0);
}
 
+   /* notify buffers are refilled */
+   virtqueue_kick(vi->rvq);
+
/* Out of packets? */
if (received < budget) {
napi_complete(napi);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm PCI assignment & VFIO ramblings

2011-07-29 Thread Benjamin Herrenschmidt
Hi folks !

So I promised Anthony I would try to summarize some of the comments &
issues we have vs. VFIO after we've tried to use it for PCI pass-through
on POWER. It's pretty long, there are various items with more or less
impact, some of it is easily fixable, some are API issues, and we'll
probably want to discuss them separately, but for now here's a brain
dump.

David, Alexei, please make sure I haven't missed anything :-)

* Granularity of pass-through

So let's first start with what is probably the main issue and the most
contentious, which is the problem of dealing with the various
constraints which define the granularity of pass-through, along with
exploiting features like the VTd iommu domains.

For the sake of clarity, let me first talk a bit about the "granularity"
issue I've mentioned above.

There are various constraints that can/will force several devices to be
"owned" by the same guest and on the same side of the host/guest
boundary. This is generally because some kind of HW resource is shared
and thus not doing so would break the isolation barrier and enable a
guest to disrupt the operations of the host and/or another guest.

Some of those constraints are well know, such as shared interrupts. Some
are more subtle, for example, if a PCIe->PCI bridge exist in the system,
there is no way for the iommu to identify transactions from devices
coming from the PCI segment of that bridge with a granularity other than
"behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
behind such a bridge must be treated as a single "entity" for
pass-trough purposes.

In IBM POWER land, we call this a "partitionable endpoint" (the term
"endpoint" here is historic, such a PE can be made of several PCIe
"endpoints"). I think "partitionable" is a pretty good name tho to
represent the constraints, so I'll call this a "partitionable group"
from now on. 

Other examples of such HW imposed constraints can be a shared iommu with
no filtering capability (some older POWER hardware which we might want
to support fall into that category, each PCI host bridge is its own
domain but doesn't have a finer granularity... however those machines
tend to have a lot of host bridges :)

If we are ever going to consider applying some of this to non-PCI
devices (see the ongoing discussions here), then we will be faced with
the crazyness of embedded designers which probably means all sort of new
constraints we can't even begin to think about

This leads me to those initial conclusions:

- The -minimum- granularity of pass-through is not always a single
device and not always under SW control

- Having a magic heuristic in libvirt to figure out those constraints is
WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
knowledge of PCI resource management and getting it wrong in many many
cases, something that took years to fix essentially by ripping it all
out. This is kernel knowledge and thus we need the kernel to expose in a
way or another what those constraints are, what those "partitionable
groups" are.

- That does -not- mean that we cannot specify for each individual device
within such a group where we want to put it in qemu (what devfn etc...).
As long as there is a clear understanding that the "ownership" of the
device goes with the group, this is somewhat orthogonal to how they are
represented in qemu. (Not completely... if the iommu is exposed to the
guest ,via paravirt for example, some of these constraints must be
exposed but I'll talk about that more later).

The interface currently proposed for VFIO (and associated uiommu)
doesn't handle that problem at all. Instead, it is entirely centered
around a specific "feature" of the VTd iommu's for creating arbitrary
domains with arbitrary devices (tho those devices -do- have the same
constraints exposed above, don't try to put 2 legacy PCI devices behind
the same bridge into 2 different domains !), but the API totally ignores
the problem, leaves it to libvirt "magic foo" and focuses on something
that is both quite secondary in the grand scheme of things, and quite
x86 VTd specific in the implementation and API definition.

Now, I'm not saying these programmable iommu domains aren't a nice
feature and that we shouldn't exploit them when available, but as it is,
it is too much a central part of the API.

I'll talk a little bit more about recent POWER iommu's here to
illustrate where I'm coming from with my idea of groups:

On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
of domain and a per-RID filtering. However it differs from VTd in a few
ways:

The "domains" (aka PEs) encompass more than just an iommu filtering
scheme. The MMIO space and PIO space are also segmented, and those
segments assigned to domains. Interrupts (well, MSI ports at least) are
assigned to domains. Inbound PCIe error messages are targeted to
domains, etc...

Basically, the PEs provide a very strong isolation feature which
includes errors, and has the ability 

Re: [PATCH RFC net-next] virtio_net: refill buffer right after being used

2011-07-29 Thread Mike Waychison
On Fri, Jul 29, 2011 at 3:55 PM, Shirley Ma  wrote:
> Resubmit it with a typo fix.
>
> Signed-off-by: Shirley Ma 
> ---
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0c7321c..c8201d4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -429,6 +429,22 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi, gfp_t gfp)
>        return err;
>  }
>
> +static int fill_one(struct virtnet_info *vi, gfp_t gfp)
> +{
> +       int err;
> +
> +       if (vi->mergeable_rx_bufs)
> +               err = add_recvbuf_mergeable(vi, gfp);
> +       else if (vi->big_packets)
> +               err = add_recvbuf_big(vi, gfp);
> +       else
> +               err = add_recvbuf_small(vi, gfp);
> +
> +       if (err >= 0)
> +               ++vi->num;
> +       return err;
> +}
> +
>  /* Returns false if we couldn't fill entirely (OOM). */
>  static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
>  {
> @@ -436,17 +452,10 @@ static bool try_fill_recv(struct virtnet_info *vi, 
> gfp_t gfp)
>        bool oom;
>
>        do {
> -               if (vi->mergeable_rx_bufs)
> -                       err = add_recvbuf_mergeable(vi, gfp);
> -               else if (vi->big_packets)
> -                       err = add_recvbuf_big(vi, gfp);
> -               else
> -                       err = add_recvbuf_small(vi, gfp);
> -
> +               err = fill_one(vi, gfp);
>                oom = err == -ENOMEM;
>                if (err < 0)
>                        break;
> -               ++vi->num;
>        } while (err > 0);
>        if (unlikely(vi->num > vi->max))
>                vi->max = vi->num;
> @@ -506,13 +515,13 @@ again:
>                receive_buf(vi->dev, buf, len);
>                --vi->num;
>                received++;
> -       }
> -
> -       if (vi->num < vi->max / 2) {
> -               if (!try_fill_recv(vi, GFP_ATOMIC))
> +               if (fill_one(vi, GFP_ATOMIC) < 0)
>                        schedule_delayed_work(&vi->refill, 0);
>        }
>
> +       /* notify buffers are refilled */
> +       virtqueue_kick(vi->rvq);
> +

How does this reduce latency?   We are doing the same amount of work
in both cases, and in both cases the newly available buffers are not
visible to the device until the virtqueue_kick..


>        /* Out of packets? */
>        if (received < budget) {
>                napi_complete(napi);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html