date:20120503

[PATCH v2] virtio-blk: Fix hot-unplug race in remove method

2012-05-03 Thread Asias He

If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Changes in v2:
- Drop req_in_flight
- Use virtqueue_detach_unused_buf to get request dispatched to driver

Signed-off-by: Asias He 
---
 drivers/block/virtio_blk.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..670c28f 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -443,7 +443,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
if (err)
goto out_free_vblk;
 
-   vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+   vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
if (!vblk->pool) {
err = -ENOMEM;
goto out_free_vq;
@@ -576,20 +576,30 @@ static void __devexit virtblk_remove(struct virtio_device 
*vdev)
 {
struct virtio_blk *vblk = vdev->priv;
int index = vblk->index;
+   struct virtblk_req *vbr;
 
/* Prevent config work handler from accessing the device. */
mutex_lock(&vblk->config_lock);
vblk->config_enable = false;
mutex_unlock(&vblk->config_lock);
 
+   /* Abort all request on the queue. */
+   blk_abort_queue(vblk->disk->queue);
+   del_gendisk(vblk->disk);
+
/* Stop all the virtqueues. */
vdev->config->reset(vdev);
-
flush_work(&vblk->config_work);
 
-   del_gendisk(vblk->disk);
+   /* Abort request dispatched to driver. */
+   while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
+   blk_abort_request(vbr->req);
+   mempool_free(vbr, vblk->pool);
+   }
+
blk_cleanup_queue(vblk->disk->queue);
put_disk(vblk->disk);
+
mempool_destroy(vblk->pool);
vdev->config->del_vqs(vdev);
kfree(vblk);
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using cache for virtio allocations?

2012-05-03 Thread Michael S. Tsirkin

On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin  wrote:
> > Sasha, didn't you have a patch to allocate
> > things using cache in virtio core?
> > What happened to it?
> >
> > Thanks,
> > MST
> 
> It got stuck due to several things, and I got sidetracked, sorry. Here
> are the outstanding issues:
> 
> 1. Since now we can allocate a descriptor either using kmalloc or from
> the cache, we need a new flag in vring_desc to know how to free it, it
> seems a bit too intrusive,
> and I couldn't thing of a better
> alternative.

Since that is guest visible it does not sound great, I agree.

Three ideas:
1. The logic looks at descriptor size so can we just read
   desc.len before free and rerun the same math?
2. For -net the requests are up to max_skb_frags + 2 in size, right?
   Does it make sense to just use cache for net, always?
   That would mean a per device flag.
3. Allocate a bit more and stick extra data before the 1st descriptor.

> 2. Rusty has pointed out that no one is going to modify the default
> value we set, and we don't really have a good default value to put
> there (at least, we haven't agreed on a specific value). Also, you
> have noted that it should be a per-device value, which complicates
> this question further since we probably want a different value for
> each device type.
> 
> While the first one can be solved easily with a blessing from the
> maintainers, the second one will require testing on various platforms,
> configurations and devices to select either the best "magic" value, or
> the best algorithm to play with threshold.

Not sure about platforms but for devices that's right.
But this really only means we only change what we tested.
eg see what is good for net and change net in a way
that others will keep using old code.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method

2012-05-03 Thread Asias He


On 05/03/2012 01:02 PM, Sasha Levin wrote:

On Thu, May 3, 2012 at 4:19 AM, Asias He  wrote:

@@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)

while ((req = blk_peek_request(q)) != NULL) {
BUG_ON(req->nr_phys_segments + 2>  vblk->sg_elems);
+   vblk->req_in_flight++;

/* If this request fails, stop queue and wait for something to
   finish to restart it. */


This is being increased before we know if the request will actually be
sent, so if do_req() fails afterwards, req_in_flight would be
increased but the request will never be sent.

Which means we won't be able to unplug the device ever.


Yes, you are right. This introduces another race. I could do 
vblk->req_in_flight++ right after blk_start_request(req) to avoid this 
race.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Asias
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method

2012-05-03 Thread Asias He


On 05/03/2012 01:27 PM, Michael S. Tsirkin wrote:

On Thu, May 03, 2012 at 10:19:52AM +0800, Asias He wrote:

If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[]>  0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Signed-off-by: Asias He


We used to do similar tracking in -net but dropped it all by using the
tracking that virtio core does.  Can't blk do the same?  Isn't there
some way to use virtqueue_detach_unused_buf for this instead?


It is much simpler to use virtqueue_detach_unused_buf. Thanks, Michael.


---
  drivers/block/virtio_blk.c |   26 ++
  1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..72b818b 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -46,6 +46,9 @@ struct virtio_blk
/* Ida index - used to track minor number allocations. */
int index;

+   /* Number of pending requests dispatched to driver. */
+   int req_in_flight;
+
/* Scatterlist: can be too big for stack. */
struct scatterlist sg[/*sg_elems*/];
  };
@@ -95,6 +98,7 @@ static void blk_done(struct virtqueue *vq)
}

__blk_end_request_all(vbr->req, error);
+   vblk->req_in_flight--;
mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
@@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)

while ((req = blk_peek_request(q)) != NULL) {
BUG_ON(req->nr_phys_segments + 2>  vblk->sg_elems);
+   vblk->req_in_flight++;

/* If this request fails, stop queue and wait for something to
   finish to restart it. */
@@ -443,7 +448,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
if (err)
goto out_free_vblk;

-   vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+   vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
if (!vblk->pool) {
err = -ENOMEM;
goto out_free_vq;
@@ -466,6 +471,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)

virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);

+   vblk->req_in_flight = 0;
vblk->disk->major = major;
vblk->disk->first_minor = index_to_minor(index);
vblk->disk->private_data = vblk;
@@ -576,22 +582,34 @@ static void __devexit virtblk_remove(struct virtio_device 
*vdev)
  {
struct virtio_blk *vblk = vdev->priv;
int index = vblk->index;
+   unsigned long flags;
+   int req_in_flight;

/* Prevent config work handler from accessing the device. */
mutex_lock(&vblk->config_lock);
vblk->config_enable = false;
mutex_unlock(&vblk->config_lock);

+   /* Abort all request on the queue. */
+   blk_abort_queue(vblk->disk->queue);
+   del_gendisk(vblk->disk);
+
/* Stop all the virtqueues. */
vdev->config->reset(vdev);
-
+   vdev->config->del_vqs(vdev);
flush_work(&vblk->config_work);

-   del_gendisk(vblk->disk);
+   /* Wait requests dispatched to device driver to finish. */
+   do {
+   spin_lock_irqsave(&vblk->lock, flags);
+   req_in_flight = vblk->req_in_flight;
+   spin_unlock_irqrestore(&vblk->lock, flags);
+   } while (req_in_flight != 0);
+
blk_cleanup_queue(vblk->disk->queue);
put_disk(vblk->disk);
+
mempool_destroy(vblk->pool);
-   vdev->config->del_vqs(vdev);
kfree(vblk);
ida_simple_remove(&vd_index_ida, index);
  }
--
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Asias
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] virtio-blk: Fix hot-unplug race in remove method

2012-05-03 Thread Michael S. Tsirkin

On Thu, May 03, 2012 at 03:30:47PM +0800, Asias He wrote:
> If we reset the virtio-blk device before the requests already dispatched
> to the virtio-blk driver from the block layer are finised, we will stuck
> in blk_cleanup_queue() and the remove will fail.
> 
> blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
> before DEAD marking. However it will never success if the device is
> already stopped. We'll have q->in_flight[] > 0, so the drain will not
> finish.
> 
> How to reproduce the race:
> 1. hot-plug a virtio-blk device
> 2. keep reading/writing the device in guest
> 3. hot-unplug while the device is busy serving I/O
> 
> Changes in v2:
> - Drop req_in_flight
> - Use virtqueue_detach_unused_buf to get request dispatched to driver
> 
> Signed-off-by: Asias He 
> ---
>  drivers/block/virtio_blk.c |   16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 72fe55d..670c28f 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -443,7 +443,7 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>   if (err)
>   goto out_free_vblk;
>  
> - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> + vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
>   if (!vblk->pool) {
>   err = -ENOMEM;
>   goto out_free_vq;

Would be a bit easier to review if whitespace changes
are avoided, and done in a separate patch targeting 3.5.

> @@ -576,20 +576,30 @@ static void __devexit virtblk_remove(struct 
> virtio_device *vdev)
>  {
>   struct virtio_blk *vblk = vdev->priv;
>   int index = vblk->index;
> + struct virtblk_req *vbr;
>  
>   /* Prevent config work handler from accessing the device. */
>   mutex_lock(&vblk->config_lock);
>   vblk->config_enable = false;
>   mutex_unlock(&vblk->config_lock);
>  
> + /* Abort all request on the queue. */

All requests

Also, the comment isn't
really helpful. Want to explain why we abort
them all here? Won't they be drained
by later detach code?


> + blk_abort_queue(vblk->disk->queue);
> + del_gendisk(vblk->disk);
> +
>   /* Stop all the virtqueues. */
>   vdev->config->reset(vdev);
> -
>   flush_work(&vblk->config_work);
>  
> - del_gendisk(vblk->disk);

Is there a reason you move del_gendisk to before reset?
Is it safe to del_gendisk while we might
still be getting callbacks from the device?

> + /* Abort request dispatched to driver. */
> + while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
> + blk_abort_request(vbr->req);
> + mempool_free(vbr, vblk->pool);
> + }
> +
>   blk_cleanup_queue(vblk->disk->queue);
>   put_disk(vblk->disk);
> +
>   mempool_destroy(vblk->pool);
>   vdev->config->del_vqs(vdev);
>   kfree(vblk);
> -- 
> 1.7.10
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/powerpc: Add new ioctl to retreive server MMU infos

2012-05-03 Thread Alexander Graf


On 27.04.2012, at 07:43, Benjamin Herrenschmidt wrote:

> This is necessary for qemu to be able to pass the right information
> to the guest, such as the supported page sizes and corresponding
> encodings in the SLB and hash table, which can vary depending
> on the processor type, the type of KVM used (PR vs HV) and the
> version of KVM
> 
> Signed-off-by: Benjamin Herrenschmidt 

Thanks, applied (with fixes) to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Takuya Yoshikawa

This patch is for showing what I am thinking and only compile tested
on linux-next, so an RFC.

Although I might misread something, I am not sure whether every user of
this API wanted to avoid contention check without CONFIG_PREEMPT.

Any comments will be appreciated.

Thanks,
Takuya

===
From: Takuya Yoshikawa 

While doing kvm development, we found a case in which we wanted to break
a critical section on lock contention even without CONFIG_PREEMPT.

Although we can do that using spin_is_contended() and cond_resched(),
changing cond_resched_lock() to satisfy such a need is another option.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c   |3 ++-
 fs/btrfs/extent_io.c |2 +-
 fs/btrfs/inode.c |3 ++-
 fs/btrfs/ordered-data.c  |3 ++-
 fs/btrfs/relocation.c|3 ++-
 fs/dcache.c  |3 ++-
 fs/fscache/object.c  |3 ++-
 fs/jbd/commit.c  |6 --
 fs/jbd2/commit.c |3 ++-
 fs/nfs/nfs4filelayout.c  |2 +-
 fs/nfs/write.c   |2 +-
 fs/ocfs2/dlm/dlmdomain.c |5 +++--
 fs/ocfs2/dlm/dlmthread.c |3 ++-
 fs/reiserfs/journal.c|4 ++--
 include/linux/sched.h|6 +++---
 kernel/sched/core.c  |4 ++--
 16 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 07424cf..3361ee3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1704,7 +1704,8 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
mmu_pages_clear_parents(&parents);
}
kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
-   cond_resched_lock(&vcpu->kvm->mmu_lock);
+   cond_resched_lock(&vcpu->kvm->mmu_lock,
+ spin_is_contended(&vcpu->kvm->mmu_lock));
kvm_mmu_pages_init(parent, &parents, &pages);
}
 }
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 198c2ba..cfcc233 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -675,7 +675,7 @@ again:
if (start > end)
break;
 
-   cond_resched_lock(&tree->lock);
+   cond_resched_lock(&tree->lock, spin_needbreak(&tree->lock));
}
 out:
spin_unlock(&tree->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..16a6173 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3985,7 +3985,8 @@ again:
goto again;
}
 
-   if (cond_resched_lock(&root->inode_lock))
+   if (cond_resched_lock(&root->inode_lock,
+ spin_needbreak(&root->inode_lock)))
goto again;
 
node = rb_next(node);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index bbf6d0d..1dfcd6d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -485,7 +485,8 @@ void btrfs_wait_ordered_extents(struct btrfs_root *root,
!test_bit(BTRFS_ORDERED_PREALLOC, &ordered->flags)) {
list_move(&ordered->root_extent_list,
  &root->fs_info->ordered_extents);
-   cond_resched_lock(&root->fs_info->ordered_extent_lock);
+   cond_resched_lock(&root->fs_info->ordered_extent_lock,
+   
spin_needbreak(&root->fs_info->ordered_extent_lock));
continue;
}
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 646ee21..6102a62 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1471,7 +1471,8 @@ again:
}
 
objectid = btrfs_ino(&entry->vfs_inode) + 1;
-   if (cond_resched_lock(&root->inode_lock))
+   if (cond_resched_lock(&root->inode_lock,
+ spin_needbreak(&root->inode_lock)))
goto again;
 
node = rb_next(node);
diff --git a/fs/dcache.c b/fs/dcache.c
index 58a6ecf..dccfa62 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -855,7 +855,8 @@ relock:
if (!--count)
break;
}
-   cond_resched_lock(&dcache_lru_lock);
+   cond_resched_lock(&dcache_lru_lock,
+ spin_needbreak(&dcache_lru_lock));
}
if (!list_empty(&referenced))
list_splice(&referenced, &sb->s_dentry_lru);
diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index b6b897c..9db99c6 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -824,7 +824,8 @@ static void fscache_enqueue_dependents(struct 
fscache_object *object)
fscache_put_object(dep);
 
if (!list_empty(&object->dependents))
-   cond_resched_lock(&object->lock);
+   cond_resched_lock(&object->lock

vhost-net: is there a race for sock in handle_tx/rx?

2012-05-03 Thread Liu ping fan

Hi,

During reading the vhost-net code, I find the following,

static void handle_tx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
unsigned out, in, s;
int head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
.msg_control = NULL,
.msg_controllen = 0,
.msg_iov = vq->iov,
.msg_flags = MSG_DONTWAIT,
};
size_t len, total_len = 0;
int err, wmem;
size_t hdr_size;
struct socket *sock;
struct vhost_ubuf_ref *uninitialized_var(ubufs);
bool zcopy;

/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq->private_data, 1);
if (!sock)
return;

   > Qemu calls
vhost_net_set_backend() to set a new backend fd, and close
@oldsock->file. And  sock->file refcnt==0.

  Can vhost_worker prevent
itself from such situation? And how?

wmem = atomic_read(&sock->sk->sk_wmem_alloc);
   .

Is it a race?

Thanks and regards,
pingfan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Peter Zijlstra

On Thu, 2012-05-03 at 17:12 +0900, Takuya Yoshikawa wrote:
> 
> Although we can do that using spin_is_contended() and cond_resched(),
> changing cond_resched_lock() to satisfy such a need is another option.
> 
Yeah, not a pretty patch. Changing all cond_resched_lock() sites just to
change one and in such an ugly way too.

So what's the impact of making spin_needbreak() work for !PREEMPT?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] async PF event can fail to wakeup vcpu from halt

2012-05-03 Thread Gleb Natapov

If vcpu executes hlt instruction while async PF is waiting to be delivered
vcpu can block and deliver async PF only after another even wakes it
up. This happens because kvm_check_async_pf_completion() will remove
completion event from vcpu->async_pf.done before entering kvm_vcpu_block()
and this will make kvm_arch_vcpu_runnable() return false. The solution
is to make vcpu runnable when processing completion.

Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4de705c..6d44308 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6611,6 +6611,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
kvm_inject_page_fault(vcpu, &fault);
}
vcpu->arch.apf.halted = false;
+   vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
 
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using cache for virtio allocations?

2012-05-03 Thread Sasha Levin

On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin  wrote:
> On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
>> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin  wrote:
>> > Sasha, didn't you have a patch to allocate
>> > things using cache in virtio core?
>> > What happened to it?
>> >
>> > Thanks,
>> > MST
>>
>> It got stuck due to several things, and I got sidetracked, sorry. Here
>> are the outstanding issues:
>>
>> 1. Since now we can allocate a descriptor either using kmalloc or from
>> the cache, we need a new flag in vring_desc to know how to free it, it
>> seems a bit too intrusive,
>> and I couldn't thing of a better
>> alternative.
>
> Since that is guest visible it does not sound great, I agree.
>
> Three ideas:
> 1. The logic looks at descriptor size so can we just read
>   desc.len before free and rerun the same math?

It'll break every time the value is changed (either by the user or by
some dynamic algorithm thingie).

> 2. For -net the requests are up to max_skb_frags + 2 in size, right?
>   Does it make sense to just use cache for net, always?
>   That would mean a per device flag.

Yup, it could work.

> 3. Allocate a bit more and stick extra data before the 1st descriptor.

I guess it'll work, but it just seems a bit ugly :)

>> 2. Rusty has pointed out that no one is going to modify the default
>> value we set, and we don't really have a good default value to put
>> there (at least, we haven't agreed on a specific value). Also, you
>> have noted that it should be a per-device value, which complicates
>> this question further since we probably want a different value for
>> each device type.
>>
>> While the first one can be solved easily with a blessing from the
>> maintainers, the second one will require testing on various platforms,
>> configurations and devices to select either the best "magic" value, or
>> the best algorithm to play with threshold.
>
> Not sure about platforms but for devices that's right.
> But this really only means we only change what we tested.
> eg see what is good for net and change net in a way
> that others will keep using old code.

It'll work only if there will be someone following up and actually
testing it, since regular users won't be testing it at all (with it
being defaulted to off and everything).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net: is there a race for sock in handle_tx/rx?

2012-05-03 Thread Michael S. Tsirkin

On Thu, May 03, 2012 at 04:33:55PM +0800, Liu ping fan wrote:
> Hi,
> 
> During reading the vhost-net code, I find the following,
> 
> static void handle_tx(struct vhost_net *net)
> {
>   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
>   unsigned out, in, s;
>   int head;
>   struct msghdr msg = {
>   .msg_name = NULL,
>   .msg_namelen = 0,
>   .msg_control = NULL,
>   .msg_controllen = 0,
>   .msg_iov = vq->iov,
>   .msg_flags = MSG_DONTWAIT,
>   };
>   size_t len, total_len = 0;
>   int err, wmem;
>   size_t hdr_size;
>   struct socket *sock;
>   struct vhost_ubuf_ref *uninitialized_var(ubufs);
>   bool zcopy;
> 
>   /* TODO: check that we are running from vhost_worker? */
>   sock = rcu_dereference_check(vq->private_data, 1);
>   if (!sock)
>   return;
> 
>> Qemu calls
> vhost_net_set_backend() to set a new backend fd, and close
> @oldsock->file. And  sock->file refcnt==0.
> 
>   Can vhost_worker prevent
> itself from such situation? And how?
> 
>   wmem = atomic_read(&sock->sk->sk_wmem_alloc);
>
> .
> 
> Is it a race?
> 
> Thanks and regards,
> pingfan

See comment before void __rcu *private_data in vhost.h


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using cache for virtio allocations?

2012-05-03 Thread Michael S. Tsirkin

On Thu, May 03, 2012 at 10:38:56AM +0200, Sasha Levin wrote:
> On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin  wrote:
> > On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
> >> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin  wrote:
> >> > Sasha, didn't you have a patch to allocate
> >> > things using cache in virtio core?
> >> > What happened to it?
> >> >
> >> > Thanks,
> >> > MST
> >>
> >> It got stuck due to several things, and I got sidetracked, sorry. Here
> >> are the outstanding issues:
> >>
> >> 1. Since now we can allocate a descriptor either using kmalloc or from
> >> the cache, we need a new flag in vring_desc to know how to free it, it
> >> seems a bit too intrusive,
> >> and I couldn't thing of a better
> >> alternative.
> >
> > Since that is guest visible it does not sound great, I agree.
> >
> > Three ideas:
> > 1. The logic looks at descriptor size so can we just read
> >   desc.len before free and rerun the same math?
> 
> It'll break every time the value is changed (either by the user or by
> some dynamic algorithm thingie).

Yes but did you intend to implement such complex logic?
If not let's not over-engineer.

> > 2. For -net the requests are up to max_skb_frags + 2 in size, right?
> >   Does it make sense to just use cache for net, always?
> >   That would mean a per device flag.
> 
> Yup, it could work.
> 
> > 3. Allocate a bit more and stick extra data before the 1st descriptor.
> 
> I guess it'll work, but it just seems a bit ugly :)

An understatement.

> >> 2. Rusty has pointed out that no one is going to modify the default
> >> value we set, and we don't really have a good default value to put
> >> there (at least, we haven't agreed on a specific value). Also, you
> >> have noted that it should be a per-device value, which complicates
> >> this question further since we probably want a different value for
> >> each device type.
> >>
> >> While the first one can be solved easily with a blessing from the
> >> maintainers, the second one will require testing on various platforms,
> >> configurations and devices to select either the best "magic" value, or
> >> the best algorithm to play with threshold.
> >
> > Not sure about platforms but for devices that's right.
> > But this really only means we only change what we tested.
> > eg see what is good for net and change net in a way
> > that others will keep using old code.
> 
> It'll work only if there will be someone following up and actually
> testing it, since regular users won't be testing it at all (with it
> being defaulted to off and everything).

Not sure I understand. Whatever patch gets applied will be
tested beforehand.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using cache for virtio allocations?

2012-05-03 Thread Sasha Levin

On Thu, May 3, 2012 at 10:44 AM, Michael S. Tsirkin  wrote:
> On Thu, May 03, 2012 at 10:38:56AM +0200, Sasha Levin wrote:
>> On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin  wrote:
>> > On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
>> >> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin  
>> >> wrote:
>> >> > Sasha, didn't you have a patch to allocate
>> >> > things using cache in virtio core?
>> >> > What happened to it?
>> >> >
>> >> > Thanks,
>> >> > MST
>> >>
>> >> It got stuck due to several things, and I got sidetracked, sorry. Here
>> >> are the outstanding issues:
>> >>
>> >> 1. Since now we can allocate a descriptor either using kmalloc or from
>> >> the cache, we need a new flag in vring_desc to know how to free it, it
>> >> seems a bit too intrusive,
>> >> and I couldn't thing of a better
>> >> alternative.
>> >
>> > Since that is guest visible it does not sound great, I agree.
>> >
>> > Three ideas:
>> > 1. The logic looks at descriptor size so can we just read
>> >   desc.len before free and rerun the same math?
>>
>> It'll break every time the value is changed (either by the user or by
>> some dynamic algorithm thingie).
>
> Yes but did you intend to implement such complex logic?
> If not let's not over-engineer.

I did intend to allow him to change the value while the device is
running, if we don't want to allow that then it's easy.

>> > 2. For -net the requests are up to max_skb_frags + 2 in size, right?
>> >   Does it make sense to just use cache for net, always?
>> >   That would mean a per device flag.
>>
>> Yup, it could work.
>>
>> > 3. Allocate a bit more and stick extra data before the 1st descriptor.
>>
>> I guess it'll work, but it just seems a bit ugly :)
>
> An understatement.
>
>> >> 2. Rusty has pointed out that no one is going to modify the default
>> >> value we set, and we don't really have a good default value to put
>> >> there (at least, we haven't agreed on a specific value). Also, you
>> >> have noted that it should be a per-device value, which complicates
>> >> this question further since we probably want a different value for
>> >> each device type.
>> >>
>> >> While the first one can be solved easily with a blessing from the
>> >> maintainers, the second one will require testing on various platforms,
>> >> configurations and devices to select either the best "magic" value, or
>> >> the best algorithm to play with threshold.
>> >
>> > Not sure about platforms but for devices that's right.
>> > But this really only means we only change what we tested.
>> > eg see what is good for net and change net in a way
>> > that others will keep using old code.
>>
>> It'll work only if there will be someone following up and actually
>> testing it, since regular users won't be testing it at all (with it
>> being defaulted to off and everything).
>
> Not sure I understand. Whatever patch gets applied will be
> tested beforehand.

I thought you meant that we apply the patch with threshold set at
0/disabled, and based on future tests we will enable it for specific
devices and set best values for threshold, no?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using cache for virtio allocations?

2012-05-03 Thread Michael S. Tsirkin

On Thu, May 03, 2012 at 10:48:53AM +0200, Sasha Levin wrote:
> On Thu, May 3, 2012 at 10:44 AM, Michael S. Tsirkin  wrote:
> > On Thu, May 03, 2012 at 10:38:56AM +0200, Sasha Levin wrote:
> >> On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin  wrote:
> >> > On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
> >> >> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin  
> >> >> wrote:
> >> >> > Sasha, didn't you have a patch to allocate
> >> >> > things using cache in virtio core?
> >> >> > What happened to it?
> >> >> >
> >> >> > Thanks,
> >> >> > MST
> >> >>
> >> >> It got stuck due to several things, and I got sidetracked, sorry. Here
> >> >> are the outstanding issues:
> >> >>
> >> >> 1. Since now we can allocate a descriptor either using kmalloc or from
> >> >> the cache, we need a new flag in vring_desc to know how to free it, it
> >> >> seems a bit too intrusive,
> >> >> and I couldn't thing of a better
> >> >> alternative.
> >> >
> >> > Since that is guest visible it does not sound great, I agree.
> >> >
> >> > Three ideas:
> >> > 1. The logic looks at descriptor size so can we just read
> >> >   desc.len before free and rerun the same math?
> >>
> >> It'll break every time the value is changed (either by the user or by
> >> some dynamic algorithm thingie).
> >
> > Yes but did you intend to implement such complex logic?
> > If not let's not over-engineer.
> 
> I did intend to allow him to change the value while the device is
> running, if we don't want to allow that then it's easy.
> 
> >> > 2. For -net the requests are up to max_skb_frags + 2 in size, right?
> >> >   Does it make sense to just use cache for net, always?
> >> >   That would mean a per device flag.
> >>
> >> Yup, it could work.
> >>
> >> > 3. Allocate a bit more and stick extra data before the 1st descriptor.
> >>
> >> I guess it'll work, but it just seems a bit ugly :)
> >
> > An understatement.
> >
> >> >> 2. Rusty has pointed out that no one is going to modify the default
> >> >> value we set, and we don't really have a good default value to put
> >> >> there (at least, we haven't agreed on a specific value). Also, you
> >> >> have noted that it should be a per-device value, which complicates
> >> >> this question further since we probably want a different value for
> >> >> each device type.
> >> >>
> >> >> While the first one can be solved easily with a blessing from the
> >> >> maintainers, the second one will require testing on various platforms,
> >> >> configurations and devices to select either the best "magic" value, or
> >> >> the best algorithm to play with threshold.
> >> >
> >> > Not sure about platforms but for devices that's right.
> >> > But this really only means we only change what we tested.
> >> > eg see what is good for net and change net in a way
> >> > that others will keep using old code.
> >>
> >> It'll work only if there will be someone following up and actually
> >> testing it, since regular users won't be testing it at all (with it
> >> being defaulted to off and everything).
> >
> > Not sure I understand. Whatever patch gets applied will be
> > tested beforehand.
> 
> I thought you meant that we apply the patch with threshold set at
> 0/disabled, and based on future tests we will enable it for specific
> devices and set best values for threshold, no?

Exactly the opposite. I meant each driver sets the value
and we test it to find a good value. Drivers that don't
do anything use existing kmalloc code.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net: is there a race for sock in handle_tx/rx?

2012-05-03 Thread Liu ping fan

Oh, got it. It is a very interesting implement.

Thanks and regards,
pingfan

On Thu, May 3, 2012 at 4:41 PM, Michael S. Tsirkin  wrote:
> On Thu, May 03, 2012 at 04:33:55PM +0800, Liu ping fan wrote:
>> Hi,
>>
>> During reading the vhost-net code, I find the following,
>>
>> static void handle_tx(struct vhost_net *net)
>> {
>>       struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
>>       unsigned out, in, s;
>>       int head;
>>       struct msghdr msg = {
>>               .msg_name = NULL,
>>               .msg_namelen = 0,
>>               .msg_control = NULL,
>>               .msg_controllen = 0,
>>               .msg_iov = vq->iov,
>>               .msg_flags = MSG_DONTWAIT,
>>       };
>>       size_t len, total_len = 0;
>>       int err, wmem;
>>       size_t hdr_size;
>>       struct socket *sock;
>>       struct vhost_ubuf_ref *uninitialized_var(ubufs);
>>       bool zcopy;
>>
>>       /* TODO: check that we are running from vhost_worker? */
>>       sock = rcu_dereference_check(vq->private_data, 1);
>>       if (!sock)
>>               return;
>>
>>            > Qemu calls
>> vhost_net_set_backend() to set a new backend fd, and close
>> @oldsock->file. And  sock->file refcnt==0.
>>
>>                                               Can vhost_worker prevent
>> itself from such situation? And how?
>>
>>       wmem = atomic_read(&sock->sk->sk_wmem_alloc);
>>        
>> .
>>
>> Is it a race?
>>
>> Thanks and regards,
>> pingfan
>
> See comment before void __rcu *private_data in vhost.h
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] virtio-net: remove useless disable on freeze

2012-05-03 Thread Amit Shah

On (Wed) 04 Apr 2012 [12:19:55], Michael S. Tsirkin wrote:
> disable_cb is just an optimization: it
> can not guarantee that there are no callbacks.

Even then, what's the harm in keeping it?  If indeed there's an
attempt to raise an interrupt after the host has been notified, it
will be suppressed.

Also, disable_cb seems to be used elsewhere in the virtio_net.c file,
to suit similar purposes.

Amit
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] virtio-net: remove useless disable on freeze

2012-05-03 Thread Michael S. Tsirkin

On Thu, May 03, 2012 at 04:29:59PM +0530, Amit Shah wrote:
> On (Wed) 04 Apr 2012 [12:19:55], Michael S. Tsirkin wrote:
> > disable_cb is just an optimization: it
> > can not guarantee that there are no callbacks.
> 
> Even then, what's the harm in keeping it?  If indeed there's an
> attempt to raise an interrupt after the host has been notified, it
> will be suppressed.

It won't. It's not a guarantee, e.g. with event index on
it does nothing at all.

> Also, disable_cb seems to be used elsewhere in the virtio_net.c file,
> to suit similar purposes.
> 
>   Amit

Where?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-03 Thread Avi Kivity

Currently we flush the TLB while holding mmu_lock.  This
increases the lock hold time by the IPI round-trip time, increasing
contention, and makes dropping the lock (for latency reasons) harder.

This patch changes TLB management to be usable locklessly, introducing
the following APIs:

  kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
  kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
  dirty

These APIs can be used without holding mmu_lock (though if the TLB
became stale due to shadow page table modifications, typically it
will need to be called with the lock held to prevent other threads
from seeing the modified page tables with the TLB unmarked and unflushed)/

Signed-off-by: Avi Kivity 
---
 Documentation/virtual/kvm/locking.txt |   14 ++
 arch/x86/kvm/paging_tmpl.h|4 ++--
 include/linux/kvm_host.h  |   22 +-
 virt/kvm/kvm_main.c   |   29 -
 4 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/Documentation/virtual/kvm/locking.txt 
b/Documentation/virtual/kvm/locking.txt
index 3b4cd3b..f6c90479 100644
--- a/Documentation/virtual/kvm/locking.txt
+++ b/Documentation/virtual/kvm/locking.txt
@@ -23,3 +23,17 @@ Arch:x86
 Protects:  - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
- tsc offset in vmcb
 Comment:   'raw' because updating the tsc offsets must not be preempted.
+
+3. TLB control
+--
+
+The following APIs should be used for TLB control:
+
+ - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
+  either guest or host page tables.
+ - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
+ - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
+
+These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be
+used while holding mmu_lock if it is called due to host page table changes
+(contrast to guest page table changes).
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 34f9709..97e2a81 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -793,7 +793,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
return -EINVAL;
 
if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
-   vcpu->kvm->tlbs_dirty++;
+   kvm_mark_tlb_dirty(vcpu->kvm);
continue;
}
 
@@ -806,7 +806,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
if (gfn != sp->gfns[i]) {
drop_spte(vcpu->kvm, &sp->spt[i]);
-   vcpu->kvm->tlbs_dirty++;
+   kvm_mark_tlb_dirty(vcpu->kvm);
continue;
}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6f34330..4bff05d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -310,7 +310,14 @@ struct kvm {
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
 #endif
-   long tlbs_dirty;
+   struct {
+   /*
+* When these two are different, a TLB somewhere holds a
+* stale TLB entry.  Clean with kvm_[cond_]flush_remote_tlbs().
+*/
+   atomic_long_t dirtied_count;
+   atomic_long_t flushed_count;
+   } tlb_state;
 };
 
 /* The guest did something we don't support. */
@@ -467,6 +474,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
+void kvm_cond_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
@@ -888,5 +896,17 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
}
 }
 
+/*
+ * Mark the TLB as dirty, for kvm_cond_flush_remote_tlbs().
+ */
+static inline void kvm_mark_tlb_dirty(struct kvm *kvm)
+{
+   /*
+* Make any changes to the page tables visible to remote flushers.
+*/
+   smp_mb__before_atomic_inc();
+   atomic_long_inc(&kvm->tlb_state.dirtied_count);
+}
+
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1847c76..643ce01 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -203,12 +203,21 @@ static bool make_all_cpus_request(struct kvm *kvm, 
unsigned int req)
 
 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
-   long dirty_count = kvm->tlbs_dirty;
+   long dirty_count = atomic_long_read(&kvm->tlb_state.dirtied_count);
+   long flushed_count = atomic_long_read(&kvm->tlb_state.flushed_count);
 
smp_mb();
if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
-   cmpxchg(&kvm->t

[PATCH 2/4] KVM: Flush TLB in mmu notifier without holding mmu_lock

2012-05-03 Thread Avi Kivity

Signed-off-by: Avi Kivity 
---
 virt/kvm/kvm_main.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 643ce01..9e2db44 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -302,11 +302,11 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
kvm->mmu_notifier_seq++;
if (kvm_unmap_hva(kvm, address))
kvm_mark_tlb_dirty(kvm);
-   /* we've to flush the tlb before the pages can be freed */
-   kvm_cond_flush_remote_tlbs(kvm);
-
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
+
+   /* we've to flush the tlb before the pages can be freed */
+   kvm_cond_flush_remote_tlbs(kvm);
 }
 
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
@@ -347,11 +347,11 @@ static void 
kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
if (need_tlb_flush)
kvm_mark_tlb_dirty(kvm);
 
-   /* we've to flush the tlb before the pages can be freed */
-   kvm_cond_flush_remote_tlbs(kvm);
-
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
+
+   /* we've to flush the tlb before the pages can be freed */
+   kvm_cond_flush_remote_tlbs(kvm);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -392,11 +392,13 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
mmu_notifier *mn,
 
young = kvm_age_hva(kvm, address);
if (young)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_mark_tlb_dirty(kvm);
 
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
 
+   kvm_cond_flush_remote_tlbs(kvm);
+
return young;
 }
 
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] KVM: Flush TLB in change_pte mmu notifier without holding mmu_lock

2012-05-03 Thread Avi Kivity

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/mmu.c  |8 ++--
 virt/kvm/kvm_main.c |2 ++
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0e7bcff..bc1d83c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1138,7 +1138,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
 {
u64 *sptep;
struct rmap_iterator iter;
-   int need_flush = 0;
u64 new_spte;
pte_t *ptep = (pte_t *)data;
pfn_t new_pfn;
@@ -1150,8 +1149,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
BUG_ON(!is_shadow_present_pte(*sptep));
rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", sptep, *sptep);
 
-   need_flush = 1;
-
if (pte_write(*ptep)) {
drop_spte(kvm, sptep);
sptep = rmap_get_first(*rmapp, &iter);
@@ -1167,10 +1164,9 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
mmu_spte_set(sptep, new_spte);
sptep = rmap_get_next(&iter);
}
-   }
 
-   if (need_flush)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_mark_tlb_dirty(kvm);
+   }
 
return 0;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9e2db44..92d8ddc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -323,6 +323,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier 
*mn,
kvm_set_spte_hva(kvm, address, pte);
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
+
+   kvm_cond_flush_remote_tlbs(kvm);
 }
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4] Unlocked TLB flush

2012-05-03 Thread Avi Kivity

This patchset implements unlocked TLB flushing for KVM.  An operation that
generates stale TLB entries can mark the TLB as dirty instead of flushing
immediately, and then flush after releasing mmu_lock but before returning
to the guest or the caller.  A few call sites are converted too.

Note not all call sites are easily convertible; as an example, sync_page()
must flush before reading the guest page table.

Avi Kivity (4):
  KVM: Add APIs for unlocked TLB flush
  KVM: Flush TLB in mmu notifier without holding mmu_lock
  KVM: Flush TLB in FNAME(invlpg) without holding mmu_lock
  KVM: Flush TLB in change_pte mmu notifier without holding mmu_lock

 Documentation/virtual/kvm/locking.txt |   14 
 arch/x86/kvm/mmu.c|   15 +
 arch/x86/kvm/paging_tmpl.h|9 
 include/linux/kvm_host.h  |   22 ++-
 virt/kvm/kvm_main.c   |   39 +++--
 5 files changed, 72 insertions(+), 27 deletions(-)

-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] KVM: Flush TLB in FNAME(invlpg) without holding mmu_lock

2012-05-03 Thread Avi Kivity

mmu_page_zap_pte() is modified to mark the TLB as dirty; but currently
only FNAME(invlpg) takes advantage of this.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/mmu.c |7 +++
 arch/x86/kvm/paging_tmpl.h |5 +++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 07424cf..0e7bcff 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1887,7 +1887,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
}
 }
 
-static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
+static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 u64 *spte)
 {
u64 pte;
@@ -1903,13 +1903,12 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct 
kvm_mmu_page *sp,
child = page_header(pte & PT64_BASE_ADDR_MASK);
drop_parent_pte(child, spte);
}
-   return true;
+   kvm_mark_tlb_dirty(kvm);
+   return;
}
 
if (is_mmio_spte(pte))
mmu_spte_clear_no_track(spte);
-
-   return false;
 }
 
 static void kvm_mmu_page_unlink_children(struct kvm *kvm,
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 97e2a81..72c9cf4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -697,8 +697,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
pte_gpa = FNAME(get_level1_sp_gpa)(sp);
pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
-   if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   mmu_page_zap_pte(vcpu->kvm, sp, sptep);
 
if (!rmap_can_add(vcpu))
break;
@@ -714,6 +713,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
break;
}
spin_unlock(&vcpu->kvm->mmu_lock);
+
+   kvm_cond_flush_remote_tlbs(vcpu->kvm);
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-03 Thread Xiao Guangrong

On 05/03/2012 05:07 AM, Marcelo Tosatti wrote:

>> 'entry' is not a problem since it is from atomically read-write as
>> mentioned above, i need change this code to:
>>
>>  /*
>>   * Optimization: for pte sync, if spte was writable the hash
>>   * lookup is unnecessary (and expensive). Write protection
>>   * is responsibility of mmu_get_page / kvm_sync_page.
>>   * Same reasoning can be applied to dirty page accounting.
>>   */
>>  if (!can_unsync && is_writable_pte(entry) /* Use 'entry' 
>> instead of '*sptep'. */
>>  goto set_pte
>>..
>>
>>
>>  if (is_writable_pte(entry) && !is_writable_pte(spte)) /* Use 'spte' 
>> instead of '*sptep'. */
>>  kvm_flush_remote_tlbs(vcpu->kvm);
> 
> What is of more importance than the ability to verify that this or that
> particular case are ok at the moment is to write code in such a way that
> its easy to verify that it is correct.
> 
> Thus the suggestion above:
> 
> "scattered all over (as mentioned before, i think a pattern of read spte
> once, work on top of that, atomically write and then deal with results
> _everywhere_ (where mmu lock is held) is more consistent."
> 

Marcelo, thanks for your time to patiently review/reply my mail.

I am confused with ' _everywhere_ ', it means all of the path read/update
spte? why not only verify the path which depends on is_writable_pte()?

For the reason of "its easy to verify that it is correct"? But these
paths are safe since it is not care PT_WRITABLE_MASK at all. What these
paths care is the Dirty-bit and Accessed-bit are not lost, that is why
we always treat the spte as "volatile" if it is can be updated out of
mmu-lock.

For the further development? We can add the delta comment for
is_writable_pte() to warn the developer use it more carefully.

It is also very hard to verify spte everywhere. :(

Actually, the current code to care PT_WRITABLE_MASK is just for
tlb flush, may be we can fold it into mmu_spte_update.
[
  There are tree ways to modify spte, present -> nonpresent, nonpresent -> 
present,
  present -> present.

  But we only need care present -> present for lockless.
]

/*
 * return true means we need flush tlbs caused by changing spte from writeable
 * to read-only.
 */
bool mmu_update_spte(u64 *sptep, u64 spte)
{
u64 last_spte, old_spte = *sptep;
bool flush = false;

last_spte = xchg(sptep, spte);

if ((is_writable_pte(last_spte) ||
  spte_has_updated_lockless(old_spte, last_spte)) &&
 !is_writable_pte(spte) )
flush = true;

 track Drity/Accessed bit ...

return flush
}

Furthermore, the style of "if (spte-has-changed) goto beginning" is feasible
in set_spte since this path is a fast path. (i can speed up 
mmu_need_write_protect)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-03 Thread Xiao Guangrong

On 05/03/2012 05:10 AM, Marcelo Tosatti wrote:

> On Wed, May 02, 2012 at 01:39:51PM +0800, Xiao Guangrong wrote:
>> On 04/29/2012 04:50 PM, Takuya Yoshikawa wrote:
>>
>>> On Fri, 27 Apr 2012 11:52:13 -0300
>>> Marcelo Tosatti  wrote:
>>>
 Yes but the objective you are aiming for is to read and write sptes
 without mmu_lock. That is, i am not talking about this patch. 
 Please read carefully the two examples i gave (separated by "example)").
>>>
>>> The real objective is not still clear.
>>>
>>> The ~10% improvement reported before was on macro benchmarks during live
>>> migration.  At least, that optimization was the initial objective.
>>>
>>> But at some point, the objective suddenly changed to "lock-less" without
>>> understanding what introduced the original improvement.
>>>
>>> Was the problem really mmu_lock contention?
>>>
>>
>>
>> Takuya, i am so tired to argue the advantage of lockless write-protect
>> and lockless O(1) dirty-log again and again.
> 
> His point is valid: there is a lack of understanding on the details of
> the improvement.
> 

Actually, the improvement of lockless is that it can let vcpu to be parallel
as possible.

>From the test result, lockless gains little improvement for unix-migration,
in this case, the vcpus are almost idle (at least not busy).

The large improvement is from dbench-migration, in this case, all vcpus are
busy accessing memory which is write-protected by dirty-log. If you enable
page-fault/fast-page-fault tracepoints, you can see huge number of page fault
from different vcpu during the migration.

> Did you see the pahole output on struct kvm? Apparently mmu_lock is
> sharing a cacheline with read-intensive memslots pointer. It would be 
> interesting to see what are the effects of cacheline aligning mmu_lock.
> 

Yes, i see that. In my test .config, i have enabled
CONFIG_DEBUG_SPINLOCK/CONFIG_DEBUG_LOCK_ALLOC, mmu-lock is not sharing cacheline
with memslots. That means it is not a problem during my test.
(BTW, pahole can not work on my box, it shows:
..
DW_AT_<0x3c>=0x19
DW_AT_<0x3c>=0x19
DW_AT_<0x3c>=0x19
die__process_function: DW_TAG_INVALID (0x4109) @ <0x12886> not handled!
)

If we reorganize 'struct kvm', i guess it is good for kvm but it can not improve
too much for migration. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-03 Thread Avi Kivity

On 05/03/2012 03:09 PM, Xiao Guangrong wrote:
> Actually, the improvement of lockless is that it can let vcpu to be parallel
> as possible.
>
> From the test result, lockless gains little improvement for unix-migration,
> in this case, the vcpus are almost idle (at least not busy).
>
> The large improvement is from dbench-migration, in this case, all vcpus are
> busy accessing memory which is write-protected by dirty-log. If you enable
> page-fault/fast-page-fault tracepoints, you can see huge number of page fault
> from different vcpu during the migration.
>

We can kill the page faults completely by using A/D bits in shadow page
tables.  The latest version of the Intel SDM defines A/D bits for EPT,
and NPT has had them from day one.

Of course this comes at a cost, instead of traversing a bitmap we have
to read all sptes, which are 64x as large.  But it's probably worthwhile
for large guests and perhaps also for smaller ones.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Takuya Yoshikawa

On Thu, 03 May 2012 10:35:27 +0200
Peter Zijlstra  wrote:

> On Thu, 2012-05-03 at 17:12 +0900, Takuya Yoshikawa wrote:
> > 
> > Although we can do that using spin_is_contended() and cond_resched(),
> > changing cond_resched_lock() to satisfy such a need is another option.
> > 
> Yeah, not a pretty patch. Changing all cond_resched_lock() sites just to
> change one and in such an ugly way too.
> 
> So what's the impact of making spin_needbreak() work for !PREEMPT?

Although the real use case is out of this RFC patch, we are now discussing
a case in which we may hold a spin_lock for long time, ms order, depending
on workload;  and in that case, other threads -- VCPU threads -- should be
given higher priority for that problematic lock.

I wanted to hear whether other users also have similar needs.  If so, it
may be worth making the API a bit more generic.

But I could not find a clean solution for that.  Do you think that using
spin_is_contended() directly is the way to go?

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-03 Thread Xiao Guangrong

On 05/03/2012 08:15 AM, Takuya Yoshikawa wrote:

> On Wed, 02 May 2012 13:39:51 +0800
> Xiao Guangrong  wrote:
> 
>>> Was the problem really mmu_lock contention?
> 
>> Takuya, i am so tired to argue the advantage of lockless write-protect
>> and lockless O(1) dirty-log again and again.
> 
> You are missing my point.  Please do not take my comments as an objection
> to your whole work: whey do you feel so?
> 


Takuya, i am sorry, please forgive my rudeness! Since my English is
so poor that it is easy for me to misunderstand the mail.　:(

> I thought that your new fast-page-fault path was fast and optimized
> the guest during dirty logging.
> 
> So in this v4, you might get a similar result even before dropping
> mmu_lock, without 07/10?, if the problem Marcelo explained was not there.
> 


Actually, the improvement is larger than v2/v3 if ept is enabled, but
it is lower for ept disabled. This is because the fask-fask (rmap.WRITABLE bit)
is dropped for better review.

> 
> Of course there is a problem of mmu_lock contention.  What I am suggesting
> is to split that problem and do measurement separately so that part of
> your work can be merged soon.
> 
> Your guest size and workload was small to make get_dirty hold mmu_lock
> long time.  If you want to appeal the real value of lock-less, you need to
> do another measurment.
> 
> 
> But this is your work and it's up to you.  Although I was thinking to help
> your measurement, I cannot do that knowing the fact that you would not
> welcome my help.
> 


Of course, any measurement is appreciative!

> 
>>> Although I am not certain about what will be really needed in the
>>> final form, if this kind of maybe-needed-overhead is going to be
>>> added little by little, I worry about possible regression.
> 
>> Well, will you suggest Linus to reject all patches and stop
>> all discussion for the "possible regression" reason?
> 
> My concern was for Marcelo's examples, not your current implementation.
> If you can show explicitely what will be needed in the final form,
> I do not have any concern.
> 
> 
> Sorry for disturbing.


Sorry again.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Peter Zijlstra

On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
> Although the real use case is out of this RFC patch, we are now discussing
> a case in which we may hold a spin_lock for long time, ms order, depending
> on workload;  and in that case, other threads -- VCPU threads -- should be
> given higher priority for that problematic lock. 

Firstly, if you can hold a lock that long, it shouldn't be a spinlock,
secondly why isn't TIF_RESCHED being set if its running that long? That
should still make cond_resched_lock() break.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-03 Thread Takuya Yoshikawa

On Thu, 03 May 2012 20:23:18 +0800
Xiao Guangrong  wrote:

> Takuya, i am sorry, please forgive my rudeness! Since my English is
> so poor that it is easy for me to misunderstand the mail.　:(

Me too, I am not good at reading/speaking English!
No problem.

> > But this is your work and it's up to you.  Although I was thinking to help
> > your measurement, I cannot do that knowing the fact that you would not
> > welcome my help.

> Of course, any measurement is appreciative!

I am interested in checking the improvement in my own eyes.

> > Sorry for disturbing.

> Sorry again.

No problem, really.
Let's discuss more from now on!

What I am really worrying about is the lack of review in this area.
I am not the person who is good at MMU things.

We should collect more comments if possible from other developers.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Avi Kivity

On 05/03/2012 03:29 PM, Peter Zijlstra wrote:
> On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
> > Although the real use case is out of this RFC patch, we are now discussing
> > a case in which we may hold a spin_lock for long time, ms order, depending
> > on workload;  and in that case, other threads -- VCPU threads -- should be
> > given higher priority for that problematic lock. 
>
> Firstly, if you can hold a lock that long, it shouldn't be a spinlock,

In fact with your mm preemptibility work it can be made into a mutex, if
the entire mmu notifier path can be done in task context.  However it
ends up a strange mutex - you can sleep while holding it but you may not
allocate, because you might recurse into an mmu notifier again.

Most uses of the lock only involve tweaking some bits though.

> secondly why isn't TIF_RESCHED being set if its running that long? That
> should still make cond_resched_lock() break.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Takuya Yoshikawa

On Thu, 03 May 2012 14:29:10 +0200
Peter Zijlstra  wrote:

> On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
> > Although the real use case is out of this RFC patch, we are now discussing
> > a case in which we may hold a spin_lock for long time, ms order, depending
> > on workload;  and in that case, other threads -- VCPU threads -- should be
> > given higher priority for that problematic lock. 
> 
> Firstly, if you can hold a lock that long, it shouldn't be a spinlock,

I agree with you in principle, but isn't cond_resched_lock() there for that?

> secondly why isn't TIF_RESCHED being set if its running that long? That
> should still make cond_resched_lock() break.

I see.

I did some tests using spin_is_contended() and need_resched() and saw
that need_resched() was called as often as spin_is_contended(), so
experimentally I understand your point.

But as I could not see why spin_needbreak() was differently implemented
depending on CONFIG_PREEMPT, I wanted to understand the meaning.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-03 Thread Xiao Guangrong

On 05/03/2012 07:22 PM, Avi Kivity wrote:

> Currently we flush the TLB while holding mmu_lock.  This
> increases the lock hold time by the IPI round-trip time, increasing
> contention, and makes dropping the lock (for latency reasons) harder.
> 
> This patch changes TLB management to be usable locklessly, introducing
> the following APIs:
> 
>   kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
>   kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
>   dirty
> 
> These APIs can be used without holding mmu_lock (though if the TLB
> became stale due to shadow page table modifications, typically it
> will need to be called with the lock held to prevent other threads
> from seeing the modified page tables with the TLB unmarked and unflushed)/
> 
> Signed-off-by: Avi Kivity 
> ---
>  Documentation/virtual/kvm/locking.txt |   14 ++
>  arch/x86/kvm/paging_tmpl.h|4 ++--
>  include/linux/kvm_host.h  |   22 +-
>  virt/kvm/kvm_main.c   |   29 -
>  4 files changed, 57 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/locking.txt 
> b/Documentation/virtual/kvm/locking.txt
> index 3b4cd3b..f6c90479 100644
> --- a/Documentation/virtual/kvm/locking.txt
> +++ b/Documentation/virtual/kvm/locking.txt
> @@ -23,3 +23,17 @@ Arch:  x86
>  Protects:- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
>   - tsc offset in vmcb
>  Comment: 'raw' because updating the tsc offsets must not be preempted.
> +
> +3. TLB control
> +--
> +
> +The following APIs should be used for TLB control:
> +
> + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
> +  either guest or host page tables.
> + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
> + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
> +
> +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be
> +used while holding mmu_lock if it is called due to host page table changes
> +(contrast to guest page table changes).


In these patches, it seems that kvm_mark_tlb_dirty is always used
under the protection of mmu-lock, yes?

If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use
out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead.

If it is so, dirtied_count/flushed_count need not be atomic.

And, it seems there is a bug:

 VCPU 0  VCPU 1

 hold mmu-lock
 zap spte which points to 'gfn'
 mark_tlb_dirty
 release mmu-lock
hold mmu-lock
rmap_write-protect:
   see no spte pointing to gfn
   tlb did not be flushed
release mmu-lock

write gfn by guest
  OOPS!!!

 kvm_cond_flush_remote_tlbs

We need call kvm_cond_flush_remote_tlbs in rmap_write-protect
unconditionally?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-03 Thread Stefano Stabellini

On Tue, 1 May 2012, Jeremy Fitzhardinge wrote:
> On 05/01/2012 03:59 AM, Peter Zijlstra wrote:
> > On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
> >> Anyway, I don't have any idea about the costs involved with
> >> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> >> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> >> specific case, whereas mmu-gather is something affecting pretty much all
> >> tasks. 
> > Which reminds me, I thought Xen needed this too, but a git grep on
> > HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.
> >
> > Jeremy?
> 
> Yeah, I was thinking that too, but I can't remember what we did to
> resolve it.  For pure PV guests, gupf simply isn't used, so the problem
> is moot.  But for dom0 or PCI-passthrough it could be.

Yes, dom0 can use gupf, for example when a userspace block backend is
involved.

Reading the code it seems to me that xen_flush_tlb_others returns
immediately and succesfully, no matter whether one or more vcpus are or
are not running at the moment and no matter if one or more vcpus
previously disabled interrupts.

Therefore I think that we should be using HAVE_RCU_TABLE_FREE.
I am going to submit a patch for that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Takuya Yoshikawa

On Thu, 03 May 2012 15:47:26 +0300
Avi Kivity  wrote:

> On 05/03/2012 03:29 PM, Peter Zijlstra wrote:
> > On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
> > > Although the real use case is out of this RFC patch, we are now discussing
> > > a case in which we may hold a spin_lock for long time, ms order, depending
> > > on workload;  and in that case, other threads -- VCPU threads -- should be
> > > given higher priority for that problematic lock. 
> >
> > Firstly, if you can hold a lock that long, it shouldn't be a spinlock,
> 
> In fact with your mm preemptibility work it can be made into a mutex, if
> the entire mmu notifier path can be done in task context.  However it
> ends up a strange mutex - you can sleep while holding it but you may not
> allocate, because you might recurse into an mmu notifier again.
> 
> Most uses of the lock only involve tweaking some bits though.

I might find a real way to go.

After your "mmu_lock -- TLB-flush" decoupling, we can change the current
get_dirty work flow like this:

for ... {
take mmu_lock
for 4K*8 gfns { // with 4KB dirty_bitmap_buffer
xchg dirty bits // 64/32 gfns at once
write protect them
}
release mmu_lock
copy_to_user
}
TLB flush

This reduces the size of dirty_bitmap_buffer and does not hold mmu_lock
so long.

I should have think of a way not to hold the spin_lock so long as Peter
said.  My lack of thinking might be the real problem.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-03 Thread Avi Kivity

On 05/03/2012 04:23 PM, Xiao Guangrong wrote:
> On 05/03/2012 07:22 PM, Avi Kivity wrote:
>
> > Currently we flush the TLB while holding mmu_lock.  This
> > increases the lock hold time by the IPI round-trip time, increasing
> > contention, and makes dropping the lock (for latency reasons) harder.
> > 
> > This patch changes TLB management to be usable locklessly, introducing
> > the following APIs:
> > 
> >   kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
> >   kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
> >   dirty
> > 
> > These APIs can be used without holding mmu_lock (though if the TLB
> > became stale due to shadow page table modifications, typically it
> > will need to be called with the lock held to prevent other threads
> > from seeing the modified page tables with the TLB unmarked and unflushed)/
> > 
> > Signed-off-by: Avi Kivity 
> > ---
> >  Documentation/virtual/kvm/locking.txt |   14 ++
> >  arch/x86/kvm/paging_tmpl.h|4 ++--
> >  include/linux/kvm_host.h  |   22 +-
> >  virt/kvm/kvm_main.c   |   29 -
> >  4 files changed, 57 insertions(+), 12 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/locking.txt 
> > b/Documentation/virtual/kvm/locking.txt
> > index 3b4cd3b..f6c90479 100644
> > --- a/Documentation/virtual/kvm/locking.txt
> > +++ b/Documentation/virtual/kvm/locking.txt
> > @@ -23,3 +23,17 @@ Arch:x86
> >  Protects:  - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
> > - tsc offset in vmcb
> >  Comment:   'raw' because updating the tsc offsets must not be preempted.
> > +
> > +3. TLB control
> > +--
> > +
> > +The following APIs should be used for TLB control:
> > +
> > + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
> > +  either guest or host page tables.
> > + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
> > + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
> > +
> > +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be
> > +used while holding mmu_lock if it is called due to host page table changes
> > +(contrast to guest page table changes).
>
>
> In these patches, it seems that kvm_mark_tlb_dirty is always used
> under the protection of mmu-lock, yes?

Correct.  It's possible we'll find a use outside mmu_lock, but this
isn't likely.

> If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use
> out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead.
>
> If it is so, dirtied_count/flushed_count need not be atomic.

But we always mark with mmu_lock held.

>
> And, it seems there is a bug:
>
>  VCPU 0  VCPU 1
>
>  hold mmu-lock
>  zap spte which points to 'gfn'
>  mark_tlb_dirty
>  release mmu-lock
> hold mmu-lock
> rmap_write-protect:
>see no spte pointing to gfn
>tlb did not be flushed
> release mmu-lock
>
> write gfn by guest
>   OOPS!!!
>
>  kvm_cond_flush_remote_tlbs
>
> We need call kvm_cond_flush_remote_tlbs in rmap_write-protect
> unconditionally?

Yes, that's the point.  We change

   spin_lock(mmu_lock)
   conditionally flush the tlb
   spin_unlock(mmu_lock)

to

  spin_lock(mmu_lock)
  conditionally mark the tlb as dirty
  spin_unlock(mmu_lock)
  kvm_cond_flush_remote_tlbs()

but that means the entire codebase has to be converted.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Avi Kivity

On 05/03/2012 05:11 PM, Takuya Yoshikawa wrote:
> On Thu, 03 May 2012 15:47:26 +0300
> Avi Kivity  wrote:
>
> > On 05/03/2012 03:29 PM, Peter Zijlstra wrote:
> > > On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
> > > > Although the real use case is out of this RFC patch, we are now 
> > > > discussing
> > > > a case in which we may hold a spin_lock for long time, ms order, 
> > > > depending
> > > > on workload;  and in that case, other threads -- VCPU threads -- should 
> > > > be
> > > > given higher priority for that problematic lock. 
> > >
> > > Firstly, if you can hold a lock that long, it shouldn't be a spinlock,
> > 
> > In fact with your mm preemptibility work it can be made into a mutex, if
> > the entire mmu notifier path can be done in task context.  However it
> > ends up a strange mutex - you can sleep while holding it but you may not
> > allocate, because you might recurse into an mmu notifier again.
> > 
> > Most uses of the lock only involve tweaking some bits though.
>
> I might find a real way to go.
>
> After your "mmu_lock -- TLB-flush" decoupling, we can change the current
> get_dirty work flow like this:
>
>   for ... {
>   take mmu_lock
>   for 4K*8 gfns { // with 4KB dirty_bitmap_buffer
>   xchg dirty bits // 64/32 gfns at once
>   write protect them
>   }
>   release mmu_lock
>   copy_to_user
>   }
>   TLB flush
>
> This reduces the size of dirty_bitmap_buffer and does not hold mmu_lock
> so long.

Good idea.  Hopefully the lock acquisition costs are low enough - we're
adding two atomic operations per iteration here.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Avi Kivity

On 05/03/2012 05:27 PM, Avi Kivity wrote:
> On 05/03/2012 05:11 PM, Takuya Yoshikawa wrote:
> > On Thu, 03 May 2012 15:47:26 +0300
> > Avi Kivity  wrote:
> >
> > > On 05/03/2012 03:29 PM, Peter Zijlstra wrote:
> > > > On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
> > > > > Although the real use case is out of this RFC patch, we are now 
> > > > > discussing
> > > > > a case in which we may hold a spin_lock for long time, ms order, 
> > > > > depending
> > > > > on workload;  and in that case, other threads -- VCPU threads -- 
> > > > > should be
> > > > > given higher priority for that problematic lock. 
> > > >
> > > > Firstly, if you can hold a lock that long, it shouldn't be a spinlock,
> > > 
> > > In fact with your mm preemptibility work it can be made into a mutex, if
> > > the entire mmu notifier path can be done in task context.  However it
> > > ends up a strange mutex - you can sleep while holding it but you may not
> > > allocate, because you might recurse into an mmu notifier again.
> > > 
> > > Most uses of the lock only involve tweaking some bits though.
> >
> > I might find a real way to go.
> >
> > After your "mmu_lock -- TLB-flush" decoupling, we can change the current
> > get_dirty work flow like this:
> >
> > for ... {
> > take mmu_lock
> > for 4K*8 gfns { // with 4KB dirty_bitmap_buffer
> > xchg dirty bits // 64/32 gfns at once
> > write protect them
> > }
> > release mmu_lock
> > copy_to_user
> > }
> > TLB flush
> >
> > This reduces the size of dirty_bitmap_buffer and does not hold mmu_lock
> > so long.
>
> Good idea.  Hopefully the lock acquisition costs are low enough - we're
> adding two atomic operations per iteration here.
>

btw, this requires my kvm_cond_flush_remote_tlbs().  Otherwise another
thread can acquire the lock, see a pagetable marked read-only by this
code, and proceed to shadow it, while the guest still has a writeable
tlb entry pointing at it.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Peter Zijlstra

On Thu, 2012-05-03 at 22:00 +0900, Takuya Yoshikawa wrote:
> But as I could not see why spin_needbreak() was differently
> implemented
> depending on CONFIG_PREEMPT, I wanted to understand the meaning. 

Its been that way since before voluntary preemption was introduced, so
its possible Ingo simply missed that spot and nobody noticed until now.

Ingo, do you have any recollections from back when?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

possible circular locking dependency

2012-05-03 Thread Sergey Senozhatsky

Hello,
3.4-rc5

[32881.212463] kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. 
Using workaround
[32882.360505] 
[32882.360509] ==
[32882.360511] [ INFO: possible circular locking dependency detected ]
[32882.360515] 3.4.0-rc5-dbg-00932-gfabccd4-dirty #1107 Not tainted
[32882.360517] ---
[32882.360519] qemu-system-x86/15168 is trying to acquire lock:
[32882.360521]  (cpu_hotplug.lock){+.+.+.}, at: [] 
get_online_cpus+0x41/0x55
[32882.360532] 
[32882.360532] but task is already holding lock:
[32882.360534]  (&sp->mutex){+.+...}, at: [] 
__synchronize_srcu+0x9a/0x104
[32882.360542] 
[32882.360542] which lock already depends on the new lock.
[32882.360543] 
[32882.360545] 
[32882.360545] the existing dependency chain (in reverse order) is:
[32882.360547] 
[32882.360547] -> #3 (&sp->mutex){+.+...}:
[32882.360552][] lock_acquire+0x148/0x1c3
[32882.360557][] mutex_lock_nested+0x6e/0x2d1
[32882.360562][] __synchronize_srcu+0x9a/0x104
[32882.360566][] synchronize_srcu+0x15/0x17
[32882.360569][] 
srcu_notifier_chain_unregister+0x5b/0x69
[32882.360573][] cpufreq_unregister_notifier+0x22/0x3c
[32882.360580][] cpufreq_governor_dbs+0x322/0x3ac
[32882.360584][] __cpufreq_governor+0x6b/0xa8
[32882.360587][] __cpufreq_set_policy+0xf3/0x145
[32882.360591][] store_scaling_governor+0x173/0x1a9
[32882.360594][] store+0x5a/0x86
[32882.360597][] sysfs_write_file+0xee/0x126
[32882.360603][] vfs_write+0xa3/0x14c
[32882.360607][] sys_write+0x43/0x73
[32882.360610][] system_call_fastpath+0x1a/0x1f
[32882.360614] 
[32882.360615] -> #2 (dbs_mutex){+.+.+.}:
[32882.360619][] lock_acquire+0x148/0x1c3
[32882.360622][] mutex_lock_nested+0x6e/0x2d1
[32882.360625][] cpufreq_governor_dbs+0x7c/0x3ac
[32882.360629][] __cpufreq_governor+0x6b/0xa8
[32882.360632][] __cpufreq_set_policy+0x109/0x145
[32882.360636][] cpufreq_add_dev_interface+0x257/0x288
[32882.360639][] cpufreq_add_dev+0x40a/0x42a
[32882.360643][] subsys_interface_register+0x9b/0xdc
[32882.360648][] cpufreq_register_driver+0xa0/0x14b
[32882.360652][] store_up_threshold+0x3a/0x50 
[cpufreq_ondemand]
[32882.360657][] do_one_initcall+0x7f/0x140
[32882.360663][] sys_init_module+0x1818/0x1aec
[32882.360667][] system_call_fastpath+0x1a/0x1f
[32882.360671] 
[32882.360671] -> #1 (&per_cpu(cpu_policy_rwsem, cpu)){+.}:
[32882.360675][] lock_acquire+0x148/0x1c3
[32882.360679][] down_write+0x49/0x6c
[32882.360682][] lock_policy_rwsem_write+0x47/0x78
[32882.360685][] cpufreq_cpu_callback+0x57/0x81
[32882.360692][] notifier_call_chain+0xac/0xd9
[32882.360697][] __raw_notifier_call_chain+0xe/0x10
[32882.360701][] __cpu_notify+0x20/0x37
[32882.360705][] _cpu_down+0x7b/0x25d
[32882.360709][] disable_nonboot_cpus+0x5f/0x10b
[32882.360712][] suspend_devices_and_enter+0x197/0x401
[32882.360719][] pm_suspend+0x104/0x1bd
[32882.360722][] state_store+0xa0/0xc9
[32882.360726][] kobj_attr_store+0xf/0x1b
[32882.360730][] sysfs_write_file+0xee/0x126
[32882.360733][] vfs_write+0xa3/0x14c
[32882.360736][] sys_write+0x43/0x73
[32882.360739][] system_call_fastpath+0x1a/0x1f
[32882.360743] 
[32882.360743] -> #0 (cpu_hotplug.lock){+.+.+.}:
[32882.360747][] __lock_acquire+0xf6b/0x1612
[32882.360751][] lock_acquire+0x148/0x1c3
[32882.360754][] mutex_lock_nested+0x6e/0x2d1
[32882.360757][] get_online_cpus+0x41/0x55
[32882.360760][] synchronize_sched_expedited+0x26/0xfa
[32882.360766][] __synchronize_srcu+0xa8/0x104
[32882.360769][] synchronize_srcu_expedited+0x15/0x17
[32882.360773][] __kvm_set_memory_region+0x3d8/0x46a 
[kvm]
[32882.360789][] kvm_set_memory_region+0x37/0x50 [kvm]
[32882.360798][] vmx_set_tss_addr+0x4c/0x200 
[kvm_intel]
[32882.360803][] kvm_arch_vm_ioctl+0x160/0x9df [kvm]
[32882.360816][] kvm_vm_ioctl+0x36a/0x39c [kvm]
[32882.360825][] vfs_ioctl+0x24/0x2f
[32882.360829][] do_vfs_ioctl+0x412/0x455
[32882.360832][] sys_ioctl+0x56/0x7b
[32882.360835][] system_call_fastpath+0x1a/0x1f
[32882.360839] 
[32882.360839] other info that might help us debug this:
[32882.360840] 
[32882.360842] Chain exists of:
[32882.360842]   cpu_hotplug.lock --> dbs_mutex --> &sp->mutex
[32882.360847] 
[32882.360848]  Possible unsafe locking scenario:
[32882.360849] 
[32882.360851]CPU0CPU1
[32882.360852]
[32882.360854]   lock(&sp->mutex);
[32882.360856]lock(dbs_mutex);
[32882.360859]lock(&sp->mutex);

Re: [RFC] sched: make callers check lock contention for cond_resched_lock()

2012-05-03 Thread Michael Wang

On 05/03/2012 09:00 PM, Takuya Yoshikawa wrote:

> On Thu, 03 May 2012 14:29:10 +0200
> Peter Zijlstra  wrote:
> 
>> On Thu, 2012-05-03 at 21:22 +0900, Takuya Yoshikawa wrote:
>>> Although the real use case is out of this RFC patch, we are now discussing
>>> a case in which we may hold a spin_lock for long time, ms order, depending
>>> on workload;  and in that case, other threads -- VCPU threads -- should be
>>> given higher priority for that problematic lock. 
>>
>> Firstly, if you can hold a lock that long, it shouldn't be a spinlock,
> 
> I agree with you in principle, but isn't cond_resched_lock() there for that?
> 
>> secondly why isn't TIF_RESCHED being set if its running that long? That
>> should still make cond_resched_lock() break.
> 
> I see.
> 
> I did some tests using spin_is_contended() and need_resched() and saw
> that need_resched() was called as often as spin_is_contended(), so
> experimentally I understand your point.
> 
> But as I could not see why spin_needbreak() was differently implemented
> depending on CONFIG_PREEMPT, I wanted to understand the meaning.


I think enable CONFIG_PREEMPT means allow preemption in kernel, so if
disabled, we can't reschedule a task if it is running in kernel not the
user space at a given time.

As the cond_resched_lock() was invoked in kernel, and looks like
cpu_relax() will give up cpu(I'm not sure whether this will invoke
schedule on some arch, just because that name...), so we can't do break
if CONFIG_PREEMPT disabled, because that will cause kernel preemption
while not allowed.

May be that's the reason why we need to consider CONFIG_PREEMPT in
spin_needbreak().

Regards,
Michael Wang

> 
> Thanks,
>   Takuya
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-03 Thread Nikunj A Dadhania

On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra  wrote:
[...] 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index f96a5b5..8ca33e9 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>  
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page 
> *page);
> +
>  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
>  /*
>   * Semi RCU freeing of the page directories.
> @@ -60,6 +62,13 @@ struct mmu_table_batch {
>  extern void tlb_table_flush(struct mmu_gather *tlb);
>  extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
>  
> +#else
> +
> +static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> + tlb_remove_page(tlb, page);
>
tlb_remove_page(tlb, table);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

42 matches

Mail list logo