Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-02 Thread Michael S. Tsirkin
On Fri, Mar 30, 2012 at 08:26:06AM -0700, Tejun Heo wrote:
> On Fri, Mar 30, 2012 at 05:53:52PM +0800, Ren Mingxin wrote:
> >  The current virtblk's naming algorithm only supports 263  disks.
> > If there are mass of virtblks(exceeding 263), there will be disks
> > with the same name.
> > 
> > By renaming "sd_format_disk_name()" to "disk_name_format()"
> > and moving it into block core, virtio_blk can use this function to
> > support mass of disks.
> > 
> > Signed-off-by: Ren Mingxin 
> 
> I guess it's already way too late but why couldn't they have been
> named vdD-P where both D and P are integers denoting disk number and
> partition number?

Yes they could and yes it's way too late to change device
names for virtio - if we did this will break countless setups.

>  [sh]dX's were created when there weren't supposed
> to be too many disks, so we had to come up with the horrible alphabet
> based numbering scheme but vd is new enough.  I mean, naming is one
> thing but who wants to figure out which sequence is or guess what
> comes next vdzz9?  :(
> 
> If we're gonna move it to block layer, let's add big blinking red
> comment saying "don't ever use it for any new driver".
> 
> Thanks.

Right. virtio would have to use the legacy one though.

> -- 
> tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-02 Thread Michael S. Tsirkin
On Mon, Apr 02, 2012 at 09:19:05AM +0800, Ren Mingxin wrote:
>  On 03/30/2012 11:28 PM, Tejun Heo wrote:
> >On Fri, Mar 30, 2012 at 08:26:06AM -0700, Tejun Heo wrote:
> >>On Fri, Mar 30, 2012 at 05:53:52PM +0800, Ren Mingxin wrote:
> >>>  The current virtblk's naming algorithm only supports 263  disks.
> >>>If there are mass of virtblks(exceeding 263), there will be disks
> >>>with the same name.
> >>>
> >>>By renaming "sd_format_disk_name()" to "disk_name_format()"
> >>>and moving it into block core, virtio_blk can use this function to
> >>>support mass of disks.
> >>>
> >>>Signed-off-by: Ren Mingxin
> >>I guess it's already way too late but why couldn't they have been
> >>named vdD-P where both D and P are integers denoting disk number and
> >>partition number?  [sh]dX's were created when there weren't supposed
> >>to be too many disks, so we had to come up with the horrible alphabet
> >>based numbering scheme but vd is new enough.  I mean, naming is one
> >>thing but who wants to figure out which sequence is or guess what
> >>comes next vdzz9?  :(
> >>
> >>If we're gonna move it to block layer, let's add big blinking red
> >>comment saying "don't ever use it for any new driver".
> >And also let's make that clear in the function name - say,
> >format_legacy_disk_name() or something.
> 
> So, to legacy disks [sh]d, we'd name them as [sh]d[a-z]{1,}. To new devices
> like vd, we'd name them as vd(vdp as partitions)?

Pleae don't rename virtio disks, it is way too late for that:
virtio block driver was merged around 2007, it is not new by
any measure, and there are many systems out there using
the current naming scheme.

> And how about the rssd in the patch 3 then?

Probably same. Renaming existing devices will break setups.
I think the idea is to avoid using the
legacy naming in new drivers *that will be added from now on*.

> Besides, does anybody have comments?
> Looking forward to your replies ;-)
> 
> -- 
> Thanks,
> Ren
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AMD iommu: Difference between setting iommu=pt and not setting this option

2012-04-02 Thread Andreas Hartmann
Hello,

please, could somebody explain the difference in behaviour between
setting the boot option iommu=pt and not setting it?

I can see the following differences:

w/ iommu=pt:

[0.832946] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40
[0.883983] AMD-Vi: Initialized for Passthrough Mode


w/o iommu=pt:

[0.834905] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40
[0.894172] AMD-Vi: Lazy IO/TLB flushing enabled



In both cases, I can pass through a PCIe device but not a PCI device (->
no problem with IRQ-sharing - the PCI device has its own exclusive IRQ).

If I use iommu=pt, the passed PCIe device is broken after the VM has
been shutdown and started again.



Thank you,
kind regards,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM updates for the 3.4 merge window

2012-04-02 Thread Avi Kivity
On 04/02/2012 12:02 AM, Benjamin Herrenschmidt wrote:
> On Sun, 2012-04-01 at 15:38 +0300, Avi Kivity wrote:
> > On 03/30/2012 03:01 PM, Paul Mackerras wrote:
> > > I just noticed that the branch you asked Linus to pull includes none
> > > of the patches that Alex sent you in the last batch, in the email with
> > > subject "[PULL 00/56] ppc patch queue 2012-03-15" sent on March 15,
> > > where he asked you to pull git://github.com/agraf/linux-2.6.git
> > > for-upstream.
> > >
> > > What happened?  Did they get lost in the re-signing, or is there some
> > > reason you thought they shouldn't go in?
> > 
> > That pull request was send three days before the merge window opened;
> > patches are supposed to cook for a while in -next before being merged,
> > especially large trees like that one.
>
> These are all powerpc specific patches that have been cooking in Alex
> tree for a while and elsewhere before that. They almost only affect
> arch/powerpc/kvm, and as such don't really need a lot of integration
> testing in -next. A bit for sure but not necessarily monthes.
>
> The current process is such that it takes absolutely forever for our
> patches to get in, which is a major PITA for something in such state of
> active development.

If the patches were posted two weeks earlier, they would have gone in.

> Why don't we have Alex tree go straight to -next like I do with Kumar
> for example ? That way I don't need to have his branch sit in my tree
> for weeks before I push it out to Linus.

There isn't a lot of common kvm code, but what there is needs to be
synchronized.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM updates for the 3.4 merge window

2012-04-02 Thread Avi Kivity
On 04/02/2012 01:45 AM, Paul Mackerras wrote:
> On Sun, Apr 01, 2012 at 03:38:37PM +0300, Avi Kivity wrote:
> > On 03/30/2012 03:01 PM, Paul Mackerras wrote:
> > > I just noticed that the branch you asked Linus to pull includes none
> > > of the patches that Alex sent you in the last batch, in the email with
> > > subject "[PULL 00/56] ppc patch queue 2012-03-15" sent on March 15,
> > > where he asked you to pull git://github.com/agraf/linux-2.6.git
> > > for-upstream.
> > >
> > > What happened?  Did they get lost in the re-signing, or is there some
> > > reason you thought they shouldn't go in?
> > 
> > That pull request was send three days before the merge window opened;
> > patches are supposed to cook for a while in -next before being merged,
> > especially large trees like that one.
>
> OK.  Then there are about six commits from that string that are small,
> simple fixes for bugs introduced in earlier patches, that only affect
> arch/powerpc.  They should go into Linus' tree as soon as possible.

Sure.

> How do you want to handle those?  Do you want them as patches or as a
> git tree, and if the latter, what do you prefer that I base them on?
> Or should I send them in via Ben?

Git tree against upstream please.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-04-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #26 from Avi Kivity   2012-04-02 09:12:38 ---
Strange, looks like the patches did not take effect at all.  The opcode is 0f
6f 02, which should have been decoded as movq.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-02 Thread Thomas Gleixner
On Sun, 1 Apr 2012, Avi Kivity wrote:
> On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2012, H. Peter Anvin wrote:
> >
> > > What is the current status of this patchset?  I haven't looked at it too
> > > closely because I have been focused on 3.4 up until now...
> >
> > The real question is whether these heuristics are the correct approach
> > or not.
> >
> > If I look at it from the non virtualized kernel side then this is ass
> > backwards. We know already that we are holding a spinlock which might
> > cause other (v)cpus going into eternal spin. The non virtualized
> > kernel solves this by disabling preemption and therefor getting out of
> > the critical section as fast as possible,
> >
> > The virtualization problem reminds me a lot of the problem which RT
> > kernels are observing where non raw spinlocks are turned into
> > "sleeping spinlocks" and therefor can cause throughput issues for non
> > RT workloads.
> >
> > Though the virtualized situation is even worse. Any preempted guest
> > section which holds a spinlock is prone to cause unbound delays.
> >
> > The paravirt ticketlock solution can only mitigate the problem, but
> > not solve it. With massive overcommit there is always a way to trigger
> > worst case scenarious unless you are educating the scheduler to cope
> > with that.
> >
> > So if we need to fiddle with the scheduler and frankly that's the only
> > way to get a real gain (the numbers, which are achieved by this
> > patches, are not that impressive) then the question arises whether we
> > should turn the whole thing around.
> >
> > I know that Peter is going to go berserk on me, but if we are running
> > a paravirt guest then it's simple to provide a mechanism which allows
> > the host (aka hypervisor) to check that in the guest just by looking
> > at some global state.
> >
> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> > modified to indicate the guest to invoke an exit when the last nested
> > lock has been released.
> 
> Interesting idea (I think it has been raised before btw, don't recall by
> who).

Someoen posted a pointer to that old thread.

> One thing about it is that it can give many false positives.  Consider a
> fine-grained spinlock that is being accessed by many threads.  That is,
> the lock is taken and released with high frequency, but there is no
> contention, because each vcpu is accessing a different instance.  So the
> host scheduler will needlessly delay preemption of vcpus that happen to
> be holding a lock, even though this gains nothing.

You're talking about per cpu locks, right? I can see the point there,
but per cpu stuff might be worth annotating to avoid this.
 
> A second issue may happen with a lock that is taken and released with
> high frequency, with a high hold percentage.  The host scheduler may
> always sample the guest in a held state, leading it to conclude that
> it's exceeding its timeout when in fact the lock is held for a short
> time only.

Well, no. You can avoid that.

hostVCPU
spin_lock()
 modify_global_state()
exit
check_global_state()
mark_global_state()
reschedule vcpu

spin_unlock()
 check_global_state()
  trigger_exit()

So when an exit occures in the locked section, then the host can mark
the global state to tell the guest to issue a trap on unlock.
 
> > Of course this needs to be time bound, so a rogue guest cannot
> > monopolize the cpu forever, but that's the least to worry about
> > problem simply because a guest which does not get out of a spinlocked
> > region within a certain amount of time is borked and elegible to
> > killing anyway.
> 
> Hopefully not killing!  Just because a guest doesn't scale well, or even
> if it's deadlocked, doesn't mean it should be killed.  Just preempt it.

:)
 
> > Thoughts ?
> 
> It's certainly interesting.  Maybe a combination is worthwhile - prevent
> lockholder preemption for a short period of time AND put waiters to
> sleep in case that period is insufficient to release the lock.

Right, but as Srivatsa pointed out this still needs the ticket lock
ordering support to avoid that guests are completely starved.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-02 Thread Ian Campbell
On Fri, 2012-03-30 at 23:07 +0100, Thomas Gleixner wrote:
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.

It probably doesn't materially effect your core point (which seems valid
to me) but it's worth pointing out that the numbers presented in this
thread are AFAICT mostly focused on ensuring that that the impact of
this infrastructure is acceptable on native rather than showing the
benefits for virtualized workloads.

Ian.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM updates for the 3.4 merge window

2012-04-02 Thread Benjamin Herrenschmidt
On Mon, 2012-04-02 at 12:06 +0300, Avi Kivity wrote:
> > The current process is such that it takes absolutely forever for our
> > patches to get in, which is a major PITA for something in such state of
> > active development.
> 
> If the patches were posted two weeks earlier, they would have gone in.

I believe on our side they were, but Alex took a while to make up his
tree ... oh well..

> > Why don't we have Alex tree go straight to -next like I do with Kumar
> > for example ? That way I don't need to have his branch sit in my tree
> > for weeks before I push it out to Linus.
> 
> There isn't a lot of common kvm code, but what there is needs to be
> synchronized.

At least we should be able to get the important bug fixes in now, I see
that you agree in your reply to Paulus so that's good :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-02 Thread Raghavendra K T

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
 I have patch something like below in mind to try:

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index d3b98b1..5127668 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 * else and called schedule in __vcpu_run.  Hopefully that
 * VCPU is holding the lock that we need and will release it.
 * We approximate round-robin by starting at the last boosted
 VCPU.
 + * Priority is given to vcpu that are unhalted.
 */
 -for (pass = 0; pass<   2&&   !yielded; pass++) {
 +for (pass = 0; pass<   3&&   !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
struct task_struct *task = NULL;
struct pid *pid;
 -if (!pass&&   i<   last_boosted_vcpu) {
 +if (!pass&&   !vcpu->pv_unhalted)
 +continue;
 +else if (pass == 1&&   i<   last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
 -} else if (pass&&   i>   last_boosted_vcpu)
 +} else if (pass == 2&&   i>   last_boosted_vcpu)
break;
if (vcpu == me)
continue;

>>>
>>> Actually I think this is unneeded.  The loops tries to find vcpus that
>>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
>>> don't match this condition.
>>>

Oh! I think I misinterpreted your statement. hmm I got it. you told to
remove if (vcpu == me) condition.

I got some more interesting idea ( not sure there is some flaw in idea too).
Basically tried  similar idea (to PLE exit handler) in vcpu_block.

Instead of blind scheduling we try to do yield to vcpu that is kicked.
IMO it may solve some scalability problem and make LHP problem further
shrink.

I think Thomas would be happy to see the result.

results:
test setup.
===
Host: i5-2540M CPU @ 2.60GHz laptop with 4cpu w/ hyperthreading. 8GB RAM
guest: 16 vcpu 2GB RAM  single guest.

Did kernbench run under guest:
x rc6-with ticketlock (current patchset)+ kvmpatches 
(CONFIG_PARAVIRT_SPINLOCK=y)
+ rc6-with ticketlock + kvmpatches + try_yield_patch (below one) 
(YIELD_THRESHOLD=256) (CONFIG_PARAVIRT_SPINLOCK=y)
* rc6-withticketlock + kvmpatches + try_yield_patch 
(YIELD_THRESHOLD=2048) (CONFIG_PARAVIRT_SPINLOCK=y)


N   Min   MaxMedian   AvgStddev
x   3162.45165.94   165.433 164.60767 1.8857111
+   3114.02   117.243   115.953 115.73867 1.6221548
Difference at 95.0% confidence
-29.6882% +/- 2.42192%
*   3   115.823   120.423   117.103   117.783 2.3741946
Difference at 95.0% confidence
-28.4462% +/- 2.9521%


improvement ~29% w.r.t to current patches.

Note: vanilla rc6 (host and guest) with (CONFIG_PARAVIRT_SPINLOCK=n)
did not finish kernbench run even after *1hr 45* minutes (above
kernbench runs took 9 minute and  6.5 min respectively). I did not try
to test it again.


Yes, I understand that  have to do some more test. and immediate TODO's
for patch are.

1) code belongs to arch/x86 directory and fill in static inline for
other archs
2) tweek YIELD_THRESHOLD value.

Ideas/suggestions welcome

Here is the try_yield_to patch.
---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5127668..3fa912a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1557,12 +1557,17 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
mark_page_dirty_in_slot(kvm, memslot, gfn);
 }

+#define YIELD_THRESHOLD 2048
+static void kvm_vcpu_try_yield_to(struct kvm_vcpu *me);
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
 void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 {
DEFINE_WAIT(wait);
+   unsigned int loop_count;
+
+   loop_count = 0;

for (;;) {
prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
@@ -1579,7 +1584,10 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (signal_pending(current))
break;

-   schedule();
+   if (loop_count++ % YIELD_THRESHOLD)
+   schedule();
+   else
+   kvm_vcpu_try_yield_to(vcpu);
}

finish_wait(&vcpu->wq, &wait);
@@ -1593,6 +1601,39 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);

+static void kvm_vcpu_try_yield(struct kvm_vcpu *me)
+{
+
+   struct kvm *kvm = me->kvm;
+   struct kvm_vcpu *vcpu;
+   int i;
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = NULL;
+   struct pid *pid;
+   if (!vcpu->pv_unhalted)
+

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-04-02 Thread Wen Congyang
At 03/19/2012 03:33 PM, Wen Congyang Wrote:
> At 03/08/2012 03:57 PM, Wen Congyang Wrote:
>> We can know the guest is paniced when the guest runs on xen.
>> But we do not have such feature on kvm.
>>
>> Another purpose of this feature is: management app(for example:
>> libvirt) can do auto dump when the guest is crashed. If management
>> app does not do auto dump, the guest's user can do dump by hand if
>> he sees the guest is paniced.
>>
>> I touch the hypervisor instead of using virtio-serial, because
>> 1. it is simple
>> 2. the virtio-serial is an optional device, and the guest may
>>not have such device.
>>
>> Changes from v2 to v3:
>> 1. correct spelling
>>
>> Changes from v1 to v2:
>> 1. split up host and guest-side changes
>> 2. introduce new request flag to avoid changing return values.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> 
> Hi all:
> 
> we neet this feature, but we don't decide how to implement it.
> We have two solution:
> 1. use vmcall
> 2. use virtio-serial.

Hi, all

There are three solutions now:
1. use vmcall
2. use I/O port
3. use virtio-serial.

I think 1 and 2 are more simple than 3.

I am reading virtio-serial's driver recent days. It seems that
if the virtio serial port is not opened at the host side, the
data writen into this port will be discarded, and we will have
no chance to know the guest is panicked.

To Amit:

Can we write message to a virtio serial port like this directly in
the kernel space?
send_buf(panicked_port, message, message's length, true);

if port->outvq_full is true, is it OK to do this?

Thanks
Wen Congyang
> 
> I will not change this patch set before we decide how to do it.
> Can we make a decision recent days?
> 
> Thanks
> Wen Congyang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-04-02 Thread Amit Shah
On (Mon) 02 Apr 2012 [18:05:45], Wen Congyang wrote:
> At 03/19/2012 03:33 PM, Wen Congyang Wrote:
> > At 03/08/2012 03:57 PM, Wen Congyang Wrote:
> >> We can know the guest is paniced when the guest runs on xen.
> >> But we do not have such feature on kvm.
> >>
> >> Another purpose of this feature is: management app(for example:
> >> libvirt) can do auto dump when the guest is crashed. If management
> >> app does not do auto dump, the guest's user can do dump by hand if
> >> he sees the guest is paniced.
> >>
> >> I touch the hypervisor instead of using virtio-serial, because
> >> 1. it is simple
> >> 2. the virtio-serial is an optional device, and the guest may
> >>not have such device.
> >>
> >> Changes from v2 to v3:
> >> 1. correct spelling
> >>
> >> Changes from v1 to v2:
> >> 1. split up host and guest-side changes
> >> 2. introduce new request flag to avoid changing return values.
> > 
> > Hi all:
> > 
> > we neet this feature, but we don't decide how to implement it.
> > We have two solution:
> > 1. use vmcall
> > 2. use virtio-serial.
> 
> Hi, all
> 
> There are three solutions now:
> 1. use vmcall
> 2. use I/O port
> 3. use virtio-serial.
> 
> I think 1 and 2 are more simple than 3.
> 
> I am reading virtio-serial's driver recent days. It seems that
> if the virtio serial port is not opened at the host side, the
> data writen into this port will be discarded, and we will have
> no chance to know the guest is panicked.

The qemu-side implementation should exist within qemu itself; i.e. the
consumer of the data from the kernel-side will always have a
listener.  In this case, you don't have to worry about port not being
opened on the host side.

> To Amit:
> 
> Can we write message to a virtio serial port like this directly in
> the kernel space?

As I had mentioned earlier, an in-kernel API is currently missing, but
it will be very simple to add one.

> send_buf(panicked_port, message, message's length, true);
> 
> if port->outvq_full is true, is it OK to do this?

port->outvq_full means guest has sent out data to the host, but the
host has not consumed the data yet, and has not released the buffers
back to the guest.

If you indeed reach such a situation (essentially you should never,
there are enough buffers already to account for scheduling delays),
then newer data will be discarded, or you will have to wait to write
the newer data till the host-side frees up buffers in the virtqueues.

This isn't really different from other approaches -- if you use a
shared buffer between guest and host, and if the guest has new data
before the host has had a chance to read off the older buffer
contents, you either overwrite or wait for the host to read the older
stuff.


Amit
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for April, Tuesday 3

2012-04-02 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Cheers,

Juan.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-02 Thread Raghavendra K T

On 04/02/2012 03:21 PM, Raghavendra K T wrote:

On 04/01/2012 07:23 PM, Avi Kivity wrote:
 > On 04/01/2012 04:48 PM, Raghavendra K T wrote:
  I have patch something like below in mind to try:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5127668..3fa912a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1557,12 +1557,17 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
mark_page_dirty_in_slot(kvm, memslot, gfn);
}

+#define YIELD_THRESHOLD 2048
+static void kvm_vcpu_try_yield_to(struct kvm_vcpu *me);
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{

[...]

+ if (loop_count++ % YIELD_THRESHOLD)
+ schedule();
+ else
+ kvm_vcpu_try_yield_to(vcpu);
}

+static void kvm_vcpu_try_yield(struct kvm_vcpu *me)


yes, it is kvm_vcpu_try_yield_to. had changed the name just before
sending. sorry.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: High Level Proof of The Correctness of Dirty Page Tracking

2012-04-02 Thread Takuya Yoshikawa
On Sun, 01 Apr 2012 11:38:17 +0800
Xiao Guangrong  wrote:

> > About MMU Page Fault Path:
> > 1. Set bit in dirty bitmap
> > 2. Make spte writable
> > 3. Guest re-execute the write
> > 
> > If GET_DIRTY_LOG is allowed to write protect the page between step 1 and 2,
> > that page will be made writable again at step 2 and the write at step 3 will
> > not be caught.  Since the userspace can process that page before step 3, the
> > written data may be lost.  To avoid this mmu lock must be held correctly in
> > both sides as the current implementation does.
> > 
> 
> 
> Hmm, according to (*2), if we set the dirty bit after make spte writeable,
> it should be safe without holding mmu-lock?
> 

Yes, probably.

But I just checked the current code.
Not sure about your fast page fault path.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration between qemu-kvm 1.0 and 0.15

2012-04-02 Thread Markus Armbruster
Anthony Liguori  writes:

> So, since we're approaching 1.1, we should really discuss release
> criteria for 1.1 with respect to live migration.  I'd prefer to avoid
> surprises in this release.
>
> My expectation is that migration works from:
>
> qemu-1.0 -M 1.0 =>qemu-1.1 -M 1.1
> qemu-1.1 -M 1.0 <=qemu-1.1 -M 1.0
>
> I would expect that migration works from:
>
> qemu-0.15 -M 0.15   =>   qemu-1.1 -M 0.15
>
> I'm okay if this fails gracefully:
>
> qemu-1.1 -M 0.15<=   qemu-0.15 -M 0.15

Until we have tools to mechanically verify migratability, such
expecations will remain just that: expectations :)

Vmstate describes the migration format (sort of), vmstate is data,
computers can process data, so at least some mechanical verification
should be possible, shouldn't it?  And even something that can only
disprove migratability, but not prove it, could be incredibly useful.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] virtio_blk: Drop unused request tracking list

2012-04-02 Thread Stefan Hajnoczi
On Sat, Mar 31, 2012 at 11:22 AM, Asias He  wrote:
> Hi, Stefan
>
> On Fri, Mar 30, 2012 at 6:14 PM, Stefan Hajnoczi  wrote:
>> On Fri, Mar 30, 2012 at 4:24 AM, Asias He  wrote:
>>> Benchmark shows small performance improvement on fusion io device.
>>>
>>> Before:
>>>  seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
>>>  seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
>>>  rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
>>>  rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec
>>>
>>> After:
>>>  seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
>>>  seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
>>>  rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
>>>  rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec
>>>
>>> Signed-off-by: Asias He 
>>> ---
>>>  drivers/block/virtio_blk.c |   10 --
>>>  1 files changed, 0 insertions(+), 10 deletions(-)
>>
>> Thanks for providing performance results.  It's a bit scary that this
>> unused list has an impact...I'm sure we have worse things elsewhere in
>> the KVM storage code path.
>
> Do you find any worse things? I saw your trace work here:
>
>   http://www.linux-kvm.org/page/Virtio/Block/Latency

I haven't updated those results in a long time because I no longer use
that benchmarking environment.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-02 Thread Tejun Heo
Hello,

On Mon, Apr 02, 2012 at 10:20:09AM +0300, Michael S. Tsirkin wrote:
> Pleae don't rename virtio disks, it is way too late for that:
> virtio block driver was merged around 2007, it is not new by
> any measure, and there are many systems out there using
> the current naming scheme.

There's no point in renaming device nodes of already deployed drivers.
It'll cause a lot of pain.

> > And how about the rssd in the patch 3 then?
> 
> Probably same. Renaming existing devices will break setups.
> I think the idea is to avoid using the
> legacy naming in new drivers *that will be added from now on*.

Yeap.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-02 Thread James Bottomley
On Mon, 2012-04-02 at 11:52 -0700, Tejun Heo wrote:
> > Probably same. Renaming existing devices will break setups.
> > I think the idea is to avoid using the
> > legacy naming in new drivers *that will be added from now on*.
> 
> Yeap.

So if we're agreed no other devices going forwards should ever use this
interface, is there any point unifying the interface?  No matter how
many caveats you hedge it round with, putting the API in a central place
will be a bit like a honey trap for careless bears.   It might be safer
just to leave it buried in the three current drivers.

James


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-02 Thread Tejun Heo
Hello, James.

On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote:
> So if we're agreed no other devices going forwards should ever use this
> interface, is there any point unifying the interface?  No matter how
> many caveats you hedge it round with, putting the API in a central place
> will be a bit like a honey trap for careless bears.   It might be safer
> just to leave it buried in the three current drivers.

Yeah, that was my hope but I think it would be easier to enforce to
have a common function which is clearly marked legacy so that new
driver writers can go look for the naming code in the existing ones,
find out they're all using the same function which is marked legacy
and explains what to do for newer drivers.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for April, Tuesday 3

2012-04-02 Thread Jan Kiszka
On 2012-04-02 13:30, Juan Quintela wrote:
> 
> Hi
> 
> Please send in any agenda items you are interested in covering.

- MSI injection to KVM irqchips from userspace devices models:
  new direct MSI injection API for KVM and the right model for QEMU
  to deal with the old API

- state of and plans for qemu-kvm tree
  (I'd like to proceed with cleanups)

But maybe both topics are just stuck due to limited bandwidth on the
maintainer side. Some clarification would be good, though, as these
things are blocking me ATM.

Jan



signature.asc
Description: OpenPGP digital signature


[RFC PATCH 2/3] iommu: Create basic group infrastructure and update AMD-Vi & Intel VT-d

2012-04-02 Thread Alex Williamson
IOMMU groups define the minimum granularity of the IOMMU.  We therefore
create groups using a dma_dev which is the effective requestor ID for
the entire group.  Additional devices can be added to groups based on
system topology, IOMMU limitations, or device quirks.

This implementation also includes a simple idr database for creating
a flat address space of group IDs across multiple IOMMUs.  Updates
included for Intel VT-d, using example iommu callbacks for adding and
removing devices, as well as AMD-Vi, which tracks devices on it's own.
We should be able to better integrate the iommu_group within existing
AMD-Vi structs or provide a private data location within the iommu_group
where we can eventually reduce redundancy.

Signed-off-by: Alex Williamson 
---

 drivers/iommu/amd_iommu.c   |   50 ++-
 drivers/iommu/intel-iommu.c |   76 +
 drivers/iommu/iommu.c   |  198 ++-
 include/linux/iommu.h   |   23 +
 4 files changed, 267 insertions(+), 80 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index f75e060..876db28 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,10 @@ static bool check_device(struct device *dev)
 
 static int iommu_init_device(struct device *dev)
 {
-   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
struct iommu_dev_data *dev_data;
u16 alias;
+   int ret;
 
if (dev->archdata.iommu)
return 0;
@@ -279,6 +280,26 @@ static int iommu_init_device(struct device *dev)
return -ENOTSUPP;
}
dev_data->alias_data = alias_data;
+
+   dma_pdev = pci_get_bus_and_slot(alias >> 8, alias & 0xff);
+   } else
+   dma_pdev = pdev;
+
+   /* dma_pdev = iommu_pci_quirk(dma_pdev) */
+
+   if (!dma_pdev->dev.iommu_group) {
+   struct iommu_group *group;
+
+   group = iommu_group_alloc(&dma_pdev->dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+   }
+
+   ret = iommu_group_add_device(dma_pdev->dev.iommu_group, dev);
+   if (ret) {
+   if (iommu_group_empty(dma_pdev->dev.iommu_group))
+   iommu_group_free(dma_pdev->dev.iommu_group);
+   return ret;
}
 
if (pci_iommuv2_capable(pdev)) {
@@ -309,6 +330,12 @@ static void iommu_ignore_device(struct device *dev)
 
 static void iommu_uninit_device(struct device *dev)
 {
+   struct iommu_group *group = dev->iommu_group;
+
+   iommu_group_remove_device(dev);
+   if (iommu_group_empty(group))
+   iommu_group_free(group);
+
/*
 * Nothing to do here - we keep dev_data around for unplugged devices
 * and reuse it when the device is re-plugged - not doing so would
@@ -3191,26 +3218,6 @@ static int amd_iommu_domain_has_cap(struct iommu_domain 
*domain,
return 0;
 }
 
-static int amd_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-   struct iommu_dev_data *dev_data = dev->archdata.iommu;
-   struct pci_dev *pdev = to_pci_dev(dev);
-   u16 devid;
-
-   if (!dev_data)
-   return -ENODEV;
-
-   if (pdev->is_virtfn || !iommu_group_mf)
-   devid = dev_data->devid;
-   else
-   devid = calc_devid(pdev->bus->number,
-  PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
-
-   *groupid = amd_iommu_alias_table[devid];
-
-   return 0;
-}
-
 static struct iommu_ops amd_iommu_ops = {
.domain_init = amd_iommu_domain_init,
.domain_destroy = amd_iommu_domain_destroy,
@@ -3220,7 +3227,6 @@ static struct iommu_ops amd_iommu_ops = {
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.domain_has_cap = amd_iommu_domain_has_cap,
-   .device_group = amd_iommu_device_group,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index c9c6053..41ab7d0 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4075,54 +4075,59 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
 }
 
-/*
- * Group numbers are arbitrary.  Device with the same group number
- * indicate the iommu cannot differentiate between them.  To avoid
- * tracking used groups we just use the seg|bus|devfn of the lowest
- * level we're able to differentiate devices
- */
-static int intel_iommu_device_group(struct device *dev, unsigned int *groupid)
+static int intel_iommu_add_device(struct device *dev)
 {
struct pci_dev *pdev = to_pci_dev(dev);
-   struct pci_dev *bridge;
-   union {
-   struct {
-   u8 devfn;
-   u8 bus;
-   u16 segment;
-   }

[RFC PATCH 3/3] iommu: Create attach/detach group interface

2012-04-02 Thread Alex Williamson
IOMMU ops should be working at a group level rather than a device
level as we cannot arbitrarily assign devices from the same group
to different domains.  For now this is just a simple wrapper that
makes use of the dma_dev within a group.

Signed-off-by: Alex Williamson 
---

 drivers/iommu/iommu.c |   12 
 include/linux/iommu.h |   15 +++
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6a51ed9..af77da1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -327,6 +327,18 @@ void iommu_detach_device(struct iommu_domain *domain, 
struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device);
 
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+   return iommu_attach_device(domain, group->dma_dev);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_group);
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+   iommu_detach_device(domain, group->dma_dev);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_group);
+
 phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
   unsigned long iova)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 24004d6..d4e9a7a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -107,6 +107,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
   struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
struct device *dev);
+extern int iommu_attach_group(struct iommu_domain *domain,
+ struct iommu_group *group);
+extern void iommu_detach_group(struct iommu_domain *domain,
+  struct iommu_group *group);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
@@ -194,6 +198,17 @@ static inline void iommu_detach_device(struct iommu_domain 
*domain,
 {
 }
 
+static inline int iommu_attach_group(struct iommu_domain *domain,
+struct iommu_group *group)
+{
+   return -ENODEV;
+}
+
+static inline void iommu_detach_group(struct iommu_domain *domain,
+ struct iommu_group *group)
+{
+}
+
 static inline int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, int gfp_order, int prot)
 {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] iommu: Introduce iommu_group

2012-04-02 Thread Alex Williamson
IOMMUs often do not have visibility of individual devices in the
system.  Due to IOMMU design, bus topology, or device quirks, we
can often only identify groups of devices.  Examples include
Intel VT-d & AMD-Vi which often have function level visibility
compared to POWER partitionable endpoints which have bridge level
granularity.  PCIe-to-PCI bridges also often cloud the IOMMU
visibility as it cannot distiguish devices behind the bridge.
Devices can also sometimes hurt themselves by initiating DMA using
the wrong source ID on a multifunction PCI device.

IOMMU groups are meant to help solve these problems and hopefully
become the working unit of the IOMMI API.

Signed-off-by: Alex Williamson 
---

 include/linux/device.h |2 ++
 include/linux/iommu.h  |5 +
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index b63fb39..6acab1c 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -35,6 +35,7 @@ struct subsys_private;
 struct bus_type;
 struct device_node;
 struct iommu_ops;
+struct iommu_group;
 
 struct bus_attribute {
struct attributeattr;
@@ -683,6 +684,7 @@ struct device {
const struct attribute_group **groups;  /* optional groups */
 
void(*release)(struct device *dev);
+   struct iommu_group  *iommu_group;
 };
 
 /* Get the wakeup routines, which depend on struct device */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..2ee375c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@
 #define IOMMU_CACHE(4) /* DMA cache coherency */
 
 struct iommu_ops;
+struct iommu_group;
 struct bus_type;
 struct device;
 struct iommu_domain;
@@ -78,6 +79,9 @@ struct iommu_ops {
unsigned long pgsize_bitmap;
 };
 
+struct iommu_group {
+};
+
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -140,6 +144,7 @@ static inline int report_iommu_fault(struct iommu_domain 
*domain,
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
+struct iommu_group {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/3] IOMMU groups

2012-04-02 Thread Alex Williamson
This series attempts to make IOMMU device grouping a slightly more
integral part of the device model.  iommu_device_groups were originally
introduced to support the VFIO user space driver interface which needs
to understand the granularity of device isolation in order to ensure
security of devices when assigned for user access.  This information
was provided via a simple group identifier from the IOMMU driver allowing
VFIO to walk devices and assemble groups itself.

The feedback received from this was that groups should be the effective
unit of work for the IOMMU API.  The existing model of allowing domains
to be created and individual devices attached ignores many of the
restrictions of the IOMMU, whether by design, by topology or by defective
devices.  Additionally we should be able to use the grouping information
at the dma ops layer for managing domains and quirking devices.

This series is a sketch at implementing only those aspects and leaving
everything else about the multifaceted hairball of Isolation groups for
another API.  Please comment and let me know if this seems like the
direction we should be headed.  Thanks,

Alex


---

Alex Williamson (3):
  iommu: Create attach/detach group interface
  iommu: Create basic group infrastructure and update AMD-Vi & Intel VT-d
  iommu: Introduce iommu_group


 drivers/iommu/amd_iommu.c   |   50 ++
 drivers/iommu/intel-iommu.c |   76 
 drivers/iommu/iommu.c   |  210 ++-
 include/linux/device.h  |2 
 include/linux/iommu.h   |   43 +
 5 files changed, 301 insertions(+), 80 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm: RCU warning in async pf

2012-04-02 Thread Sasha Levin
Hi all,

I got the spew at the bottom of the mail in a KVM guest using the KVM tools and 
running trinity.

I'm not quite sure how default_idle managed to trigger a pagefault, so that 
part looks odd to me.

[12140.220507] ===
[12140.220507] [ INFO: suspicious RCU usage. ]
[12140.220507] 3.4.0-rc1-next-20120402-sasha #46 Tainted: GW   
[12140.220507] ---
[12140.220507] include/linux/rcupdate.h:729 rcu_read_lock() used illegally 
while idle!
[12140.220507] 
[12140.220507] other info that might help us debug this:
[12140.220507] 
[12140.220507] 
[12140.220507] RCU used illegally from idle CPU!
[12140.220507] rcu_scheduler_active = 1, debug_locks = 1
[12140.220507] RCU used illegally from extended quiescent state!
[12140.220507] 4 locks held by swapper/1/0:
[12140.253991]  #0:  (&(&async_pf_sleepers[i].lock)->rlock){..-...}, at: 
[] kvm_async_pf_task_wake+0x48/0x130
[12140.253991]  #1:  (&n.wq){..}, at: [] 
__wake_up+0x2d/0x70
[12140.253991]  #2:  (&p->pi_lock){-.-.-.}, at: [] 
try_to_wake_up+0x34/0x290
[12140.253991]  #3:  (rcu_read_lock){.+.+..}, at: [] 
select_task_rq_fair+0xb0/0x5c0
[12140.253991] 
[12140.253991] stack backtrace:
[12140.253991] Pid: 0, comm: swapper/1 Tainted: G    W    
3.4.0-rc1-next-20120402-sasha #46
[12140.253991] Call Trace:
[12140.253991]  [] lockdep_rcu_suspicious+0x113/0x130
[12140.253991]  [] select_task_rq_fair+0x115/0x5c0
[12140.253991]  [] ? select_task_rq_fair+0xb0/0x5c0
[12140.253991]  [] ? try_to_wake_up+0x34/0x290
[12140.253991]  [] ? try_to_wake_up+0x34/0x290
[12140.253991]  [] try_to_wake_up+0x191/0x290
[12140.253991]  [] ? __lock_acquired+0x1c0/0x210
[12140.253991]  [] default_wake_function+0xd/0x10
[12140.253991]  [] autoremove_wake_function+0x18/0x40
[12140.253991]  [] __wake_up_common+0x52/0x90
[12140.253991]  [] ? __wake_up+0x2d/0x70
[12140.253991]  [] __wake_up+0x43/0x70
[12140.253991]  [] apf_task_wake_one+0x87/0x90
[12140.253991]  [] kvm_async_pf_task_wake+0x75/0x130
[12140.253991]  [] do_async_page_fault+0x86/0xa0
[12140.253991]  [] async_page_fault+0x25/0x30
[12140.253991]  [] ? native_safe_halt+0x6/0x10
[12140.253991]  [] ? trace_hardirqs_on+0xd/0x10
[12140.253991]  [] default_idle+0x4d/0xa0
[12140.253991]  [] cpu_idle+0x11f/0x180
[12140.253991]  [] ? setup_APIC_timer+0x88/0x8c
[12140.253991]  [] start_secondary+0xe1/0xe8
[12140.253991] 
[12140.253991] ===
[12140.253991] [ INFO: suspicious RCU usage. ]
[12140.253991] 3.4.0-rc1-next-20120402-sasha #46 Tainted: GW   
[12140.253991] ---
[12140.253991] kernel/sched/fair.c:2716 suspicious rcu_dereference_check() 
usage!
[12140.253991] 
[12140.253991] other info that might help us debug this:
[12140.253991] 
[12140.253991] 
[12140.253991] RCU used illegally from idle CPU!
[12140.253991] rcu_scheduler_active = 1, debug_locks = 1
[12140.253991] RCU used illegally from extended quiescent state!
[12140.253991] 4 locks held by swapper/1/0:
[12140.253991]  #0:  (&(&async_pf_sleepers[i].lock)->rlock){..-...}, at: 
[] kvm_async_pf_task_wake+0x48/0x130
[12140.253991]  #1:  (&n.wq){..}, at: [] 
__wake_up+0x2d/0x70
[12140.253991]  #2:  (&p->pi_lock){-.-.-.}, at: [] 
try_to_wake_up+0x34/0x290
[12140.253991]  #3:  (rcu_read_lock){.+.+..}, at: [] 
select_task_rq_fair+0xb0/0x5c0
[12140.253991] 
[12140.253991] stack backtrace:
[12140.253991] Pid: 0, comm: swapper/1 Tainted: GW
3.4.0-rc1-next-20120402-sasha #46
[12140.253991] Call Trace:
[12140.253991]  [] lockdep_rcu_suspicious+0x113/0x130
[12140.253991]  [] select_task_rq_fair+0x17c/0x5c0
[12140.253991]  [] ? select_task_rq_fair+0xb0/0x5c0
[12140.253991]  [] ? try_to_wake_up+0x34/0x290
[12140.253991]  [] ? try_to_wake_up+0x34/0x290
[12140.253991]  [] try_to_wake_up+0x191/0x290
[12140.253991]  [] ? __lock_acquired+0x1c0/0x210
[12140.253991]  [] default_wake_function+0xd/0x10
[12140.253991]  [] autoremove_wake_function+0x18/0x40
[12140.253991]  [] __wake_up_common+0x52/0x90
[12140.253991]  [] ? __wake_up+0x2d/0x70
[12140.253991]  [] __wake_up+0x43/0x70
[12140.253991]  [] apf_task_wake_one+0x87/0x90
[12140.253991]  [] kvm_async_pf_task_wake+0x75/0x130
[12140.253991]  [] do_async_page_fault+0x86/0xa0
[12140.253991]  [] async_page_fault+0x25/0x30
[12140.253991]  [] ? native_safe_halt+0x6/0x10
[12140.253991]  [] ? trace_hardirqs_on+0xd/0x10
[12140.253991]  [] default_idle+0x4d/0xa0
[12140.253991]  [] cpu_idle+0x11f/0x180
[12140.253991]  [] ? setup_APIC_timer+0x88/0x8c
[12140.253991]  [] start_secondary+0xe1/0xe8
[12140.253991] 
[12140.253991] ===
[12140.253991] [ INFO: suspicious RCU usage. ]
[12140.253991] 3.4.0-rc1-next-20120402-sasha #46 Tainted: GW   
[12140.253991] ---
[12140.253991] kernel/sched/fair.c:2660 suspicious rcu_dereference_check() 
usage!
[12140.253991] 
[12140.253991] other info that might help us debug th

Re: [PATCH] virtio_blk: Drop unused request tracking list

2012-04-02 Thread Asias He

On 04/01/2012 06:07 PM, Michael S. Tsirkin wrote:

On Fri, Mar 30, 2012 at 11:24:10AM +0800, Asias He wrote:

Benchmark shows small performance improvement on fusion io device.

Before:
   seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
   seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
   rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
   rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec

After:
   seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
   seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
   rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
   rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec

Signed-off-by: Asias He


Thanks, the patch makes sense to me.
Acked-by: Michael S. Tsirkin

This is 3.5 material, correct?


Yes, I think so.




---
  drivers/block/virtio_blk.c |   10 --
  1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c4a60ba..338da9a 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -29,9 +29,6 @@ struct virtio_blk
/* The disk structure for the kernel. */
struct gendisk *disk;

-   /* Request tracking. */
-   struct list_head reqs;
-
mempool_t *pool;

/* Process context for config space updates */
@@ -55,7 +52,6 @@ struct virtio_blk

  struct virtblk_req
  {
-   struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
@@ -99,7 +95,6 @@ static void blk_done(struct virtqueue *vq)
}

__blk_end_request_all(vbr->req, error);
-   list_del(&vbr->list);
mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
@@ -184,7 +179,6 @@ static bool do_req(struct request_queue *q, struct 
virtio_blk *vblk,
return false;
}

-   list_add_tail(&vbr->list,&vblk->reqs);
return true;
  }

@@ -408,7 +402,6 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
goto out_free_index;
}

-   INIT_LIST_HEAD(&vblk->reqs);
spin_lock_init(&vblk->lock);
vblk->vdev = vdev;
vblk->sg_elems = sg_elems;
@@ -571,9 +564,6 @@ static void __devexit virtblk_remove(struct virtio_device 
*vdev)
vblk->config_enable = false;
mutex_unlock(&vblk->config_lock);

-   /* Nothing should be pending. */
-   BUG_ON(!list_empty(&vblk->reqs));
-
/* Stop all the virtqueues. */
vdev->config->reset(vdev);

--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Asias
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html