[Qemu-devel] [PATCH] kvm_irqchip_assign_irqfd: just set irqfd in case of kvm_irqfds_enabled()

2014-12-26 Thread Tiejun Chen
We should avoid to set irqfd{} unconditionally.

Signed-off-by: Tiejun Chen 
---
 kvm-all.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 18cc6b4..5b9786b 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1257,21 +1257,21 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
 bool assign)
 {
-struct kvm_irqfd irqfd = {
-.fd = fd,
-.gsi = virq,
-.flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN,
-};
+struct kvm_irqfd irqfd = {};
+
+if (!kvm_irqfds_enabled()) {
+return -ENOSYS;
+}
+
+irqfd.fd = fd;
+irqfd.gsi = virq;
+irqfd.flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN;
 
 if (rfd != -1) {
 irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
 irqfd.resamplefd = rfd;
 }
 
-if (!kvm_irqfds_enabled()) {
-return -ENOSYS;
-}
-
 return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
 }
 
-- 
1.9.1




Re: [Qemu-devel] Hyvää Joulua & Kiitos!

2014-12-26 Thread Stefan Hajnoczi
On Thu, Dec 25, 2014 at 2:29 AM, Andrew "Truck" Holland
 wrote:
> Again, thank you for putting this together.  I've enjoyed watching
> each day unfold, several times looking at the site in anticipation
> before the "next day" came up and being dissapointed it wasn't out
> yet.  I now have finally downloaded all the images and source, and
> will be actually spending some of today testing them.  (Yes, I waited
> (:  )

Thanks for the kind email!

Regarding next year, we got a lot of the low-hanging fruit so it may
be harder next time around.  But let's see what happens... :)

Stefan



Re: [Qemu-devel] [PATCH 1/2] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes

2014-12-26 Thread Roman Kagan
On Thu, Dec 25, 2014 at 08:37:29AM +0300, Denis V. Lunev wrote:
> +#ifdef CONFIG_FALLOCATE_ZERO_RANGE
> +do {
> +if (fallocate(s->fd, CONFIG_FALLOCATE_ZERO_RANGE,

Must be a typo, FALLOC_FL_ZERO_RANGE is what you mean.

Roman.



Re: [Qemu-devel] [PATCH 2/2] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes

2014-12-26 Thread Roman Kagan
On Thu, Dec 25, 2014 at 08:37:30AM +0300, Denis V. Lunev wrote:
> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
> The idea is that FALLOC_FL_PUNCH_HOLE could not increase file size
> but it cleans already allocated blocks inside the file. If we have to
> create something new, simple fallocate will do the job.
> 
> This should increase performance a bit for not-so-modern kernels or for
> filesystems which do not support FALLOC_FL_ZERO_RANGE.
> 
> Signed-off-by: Denis V. Lunev 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> ---
>  block/raw-posix.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 9e66cb7..60972a1 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -930,6 +930,18 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
> *aiocb)
>  
>  ret = -errno;
>  #endif
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +do {
> +if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +  aiocb->aio_offset, aiocb->aio_nbytes) == 0 &&
> +fallocate(s->fd, 0,
> +  aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
> +return 0;
> +}
> +} while (errno == EINTR);
> +
> +ret = -errno;
> +#endif

This is suboptimal in that fallocate(FALLOC_FL_ZERO_RANGE) would always
be called in vain for such systems.  Might be worth another flag in
BDRVRawState?

Roman.



Re: [Qemu-devel] [PATCH] kvm_irqchip_assign_irqfd: just set irqfd in case of kvm_irqfds_enabled()

2014-12-26 Thread Peter Maydell
On 26 December 2014 at 08:05, Tiejun Chen  wrote:
> We should avoid to set irqfd{} unconditionally.
>
> Signed-off-by: Tiejun Chen 

Is there a hot path that we use this on such that the difference
in code order matters at all?

thanks
-- PMM



Re: [Qemu-devel] [PATCH] kvm_irqchip_assign_irqfd: just set irqfd in case of kvm_irqfds_enabled()

2014-12-26 Thread Denis V. Lunev

On 26/12/14 13:00, Peter Maydell wrote:

On 26 December 2014 at 08:05, Tiejun Chen  wrote:

We should avoid to set irqfd{} unconditionally.

Signed-off-by: Tiejun Chen 


Is there a hot path that we use this on such that the difference
in code order matters at all?

thanks
-- PMM



IMHO the patch does not change anything even on hot-hot path.
the declaration 'struct kvm_irqfd irqfd = {};' will
result in memset inside.

Thus in order to achieve declared goal author should
declare
   struct kvm_irqfd irqfd;
and perform
   memset(&irqfd, 0, sizeof(irqfd));
later after the check.

Regards,
Den



[Qemu-devel] [Bug 1405385] Re: QEMU crashes when virtio network cards are used together with e1000 network cards

2014-12-26 Thread Michael Tokarev
I can't reproduce this, no matter how hard I try. Tried 4 virtio devices
and 4 e1000 devices (8 network devices in total).  Tried 2.1 and 2.2 and
current git.  It all Just Works (tm).  What I'm doing wrong? :)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1405385

Title:
  QEMU crashes when virtio network cards are used together with e1000
  network cards

Status in QEMU:
  New

Bug description:
  QEMU version: QEMU emulator version 2.2.50, Copyright (c) 2003-2008 Fabrice 
Bellard
  QEMU GIT version: ab0302ee764fd702465aef6d88612cdff4302809
  Configure flags: ./configure --enable-kvm --prefix=/opt/qemu-devel
  Linux version: Ubuntu 14.04.1 LTS
  Kernel version: 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux

  Problem:

  QEMU crashes when using one (or more) virtio network cards
  together with one (or more) e1000 (and possibly others) network cards
  when those cards are bound to a linux bridge. When the cards are *not*
  bound to a bridge QEMU does not crash.

  Bridge configuration:

iface bridge0 inet dhcp
bridge_ports eth1
bridge_stp off
bridge_fd 0

  Start-up command (including binding the network cards to the bridge +
  strace logging):

  ./qemu-system-x86_64 -daemonize -smp 1 -m 128 -vnc 0.0.0.0:0 \
  -netdev tap,id=tap_1,script=no,downscript=no,ifname=net_1_1,vhost=on \
  -device 
virtio-net-pci,bootindex=1,id=nic_1,netdev=tap_1,mac=02:16:3F:00:00:FA \
  -netdev tap,id=tap_2,script=no,downscript=no,ifname=net_1_2 \
  -device e1000,bootindex=2,id=nic_2,netdev=tap_2,mac=02:16:3F:00:00:FB; \
  brctl addif bridge0 net_1_1; \
  brctl addif bridge0 net_1_2; \
  ifconfig net_1_1 0.0.0.0 up; \
  ifconfig net_1_2 0.0.0.0 up; \
  sleep 2; \
  strace -p `ps x |grep qemu-system-x86_64 |grep -v grep|awk '{print $1}'` -o 
/tmp/qemu-devel-trace.txt 

  Kernel log:

  Dec 24 11:12:08 bramws kernel: [12466.885581] device net_1_1 entered 
promiscuous mode
  Dec 24 11:12:08 bramws kernel: [12466.886238] device net_1_2 entered 
promiscuous mode
  Dec 24 11:12:08 bramws kernel: [12466.887084] bridge0: port 2(net_1_1) 
entered forwarding state
  Dec 24 11:12:08 bramws kernel: [12466.887089] bridge0: port 2(net_1_1) 
entered forwarding state
  Dec 24 11:12:08 bramws kernel: [12466.888940] bridge0: port 3(net_1_2) 
entered forwarding state
  Dec 24 11:12:08 bramws kernel: [12466.888947] bridge0: port 3(net_1_2) 
entered forwarding state
  Dec 24 11:12:29 bramws kernel: [12488.026376] bridge0: port 2(net_1_1) 
entered disabled state
  Dec 24 11:12:29 bramws kernel: [12488.026820] device net_1_1 left promiscuous 
mode
  Dec 24 11:12:29 bramws kernel: [12488.026832] bridge0: port 2(net_1_1) 
entered disabled state
  Dec 24 11:12:29 bramws kernel: [12488.049636] bridge0: port 3(net_1_2) 
entered disabled state
  Dec 24 11:12:29 bramws kernel: [12488.050058] device net_1_2 left promiscuous 
mode
  Dec 24 11:12:29 bramws kernel: [12488.050074] bridge0: port 3(net_1_2) 
entered disabled state

  Strace log: (full log attached)

  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 28646613}, NULL, 8) = 0 (Timeout)
  write(5, "\1\0\0\0\0\0\0\0", 8) = 8
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 10899760}, NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {0, 10895457})
  write(6, "\1\0\0\0\0\0\0\0", 8) = 8
  read(5, "\1\0\0\0\0\0\0\0", 512)= 8
  write(6, "\1\0\0\0\0\0\0\0", 8) = 8
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 0}, NULL, 8) = 1 ([{fd=6, revents=POLLIN}], left {0, 0})
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 0}, NULL, 8) = 1 ([{fd=6, revents=POLLIN}], left {0, 0})
  read(6, "\2\0\0\0\0\0\0\0", 16) = 8
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 0}, NULL, 8) = 0 (Timeout)
  read(6, 0x7fff697320e0, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHU

[Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper

2014-12-26 Thread Denis V. Lunev
The pattern
do {
if (fallocate(s->fd, mode, offset, len) == 0) {
return 0;
}
} while (errno == EINTR);
is used twice at the moment and I am going to add more usages. Move it
to the helper function.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block/raw-posix.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 66ebaab..a7c8816 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -893,6 +893,19 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, 
uint64_t bytes)
 }
 #endif
 
+
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
defined(CONFIG_FALLOCATE_ZERO_RANGE)
+static int do_fallocate(int fd, int mode, off_t offset, off_t len)
+{
+do {
+if (fallocate(fd, mode, offset, len) == 0) {
+return 0;
+}
+} while (errno == EINTR);
+return -errno;
+}
+#endif
+
 static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 {
 int ret = -EOPNOTSUPP;
@@ -921,14 +934,8 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_ZERO_RANGE
-do {
-if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-  aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-return 0;
-}
-} while (errno == EINTR);
-
-ret = -errno;
+ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+   aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
 }
 
@@ -968,14 +975,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
-do {
-if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-  aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-return 0;
-}
-} while (errno == EINTR);
-
-ret = -errno;
+ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
 }
 
-- 
1.9.1




[Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs

2014-12-26 Thread Denis V. Lunev
The check for maximum length was added by
  commit c31cb70728d2c0c8900b35a66784baa446fd5147
  Author: Peter Lieven 
  Date:   Thu Oct 24 12:06:58 2013 +0200
block: honour BlockLimits in bdrv_co_do_write_zeroes

but actually if driver provides .bdrv_co_write_zeroes callback, there is
no need to limit the size to 32 MB. Callback should provide effective
implementation which normally should not write any zeroes in comparable
amount.

Correct check should be as follows to honour BlockLimits for slow writes:
- if bs->bl.max_write_zeroes is specified, it should be applied to all
  paths (fast and slow)
- MAX_WRITE_ZEROES_DEFAULT should be applied for slow path only

Signed-off-by: Denis V. Lunev 
CC: Eric Blake 
CC: Peter Lieven 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 4165d42..0ec8b15 100644
--- a/block.c
+++ b/block.c
@@ -3182,8 +3182,12 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 struct iovec iov = {0};
 int ret = 0;
 
-int max_write_zeroes = bs->bl.max_write_zeroes ?
-   bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
+int max_fast_write_size = nb_sectors;
+int max_slow_write_size = MAX_WRITE_ZEROES_DEFAULT;
+if (bs->bl.max_write_zeroes != 0) {
+max_fast_write_size = bs->bl.max_write_zeroes;
+max_slow_write_size = bs->bl.max_write_zeroes;
+}
 
 while (nb_sectors > 0 && !ret) {
 int num = nb_sectors;
@@ -3205,9 +3209,9 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 }
 }
 
-/* limit request size */
-if (num > max_write_zeroes) {
-num = max_write_zeroes;
+/* limit request size for fast path */
+if (num > max_fast_write_size) {
+num = max_fast_write_size;
 }
 
 ret = -ENOTSUP;
@@ -3217,6 +3221,11 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 }
 
 if (ret == -ENOTSUP) {
+/* limit request size further for slow path */
+if (num > max_slow_write_size) {
+num = max_slow_write_size;
+}
+
 /* Fall back to bounce buffer if write zeroes is unsupported */
 iov.iov_len = num * BDRV_SECTOR_SIZE;
 if (iov.iov_base == NULL) {
@@ -3234,7 +3243,7 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 /* Keep bounce buffer around if it is big enough for all
  * all future requests.
  */
-if (num < max_write_zeroes) {
+if (num < max_slow_write_size) {
 qemu_vfree(iov.iov_base);
 iov.iov_base = NULL;
 }
-- 
1.9.1




[Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes

2014-12-26 Thread Denis V. Lunev
There is a possibility that we are extending our image and thus writing
zeroes beyond end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).

This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block/raw-posix.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 96a8678..334f818 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL 0x0080 /* Do not cow file */
 #endif
 #endif
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 #include 
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -903,7 +903,7 @@ static int translate_err(int err)
 return err;
 }
 
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
 do {
@@ -957,6 +957,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
 
 s = aiocb->bs->opaque;
 
+if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) {
+return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+}
+
 #ifdef CONFIG_FALLOCATE_ZERO_RANGE
 if (s->has_write_zeroes) {
 int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-- 
1.9.1




[Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux

2014-12-26 Thread Denis V. Lunev
These patches eliminate data writes completely on Linux if fallocate
FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
underlying filesystem.

This should seriously increase performance in some special cases.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 




[Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes

2014-12-26 Thread Denis V. Lunev
This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.

Simple fallocate(0) will extend file with zeroes when appropriate in the
middle of the file if there is a hole there and at the end of the file.
Unfortunately fallocate(0) does not drop the content of the file if
there is a data on this offset. Therefore to make the situation consistent
we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE

This should increase the performance a bit for not-so-modern kernels or for
filesystems which do not support FALLOC_FL_ZERO_RANGE.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block/raw-posix.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7866d31..96a8678 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
 #endif
 
 s->has_write_zeroes = false;
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+if (s->has_discard) {
+int ret;
+ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   aiocb->aio_offset, aiocb->aio_nbytes);
+if (ret < 0) {
+if (ret == -ENOTSUP) {
+s->has_discard = false;
+}
+return ret;
+}
+return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+}
+#endif
+
+s->has_discard = false;
 return -ENOTSUP;
 }
 
-- 
1.9.1




[Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values

2014-12-26 Thread Denis V. Lunev
actually the code
if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
ret == -ENOTTY) {
ret = -ENOTSUP;
}
is present twice and will be added a couple more times. Create helper
for this. Place it into do_fallocate() for further convinience.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block/raw-posix.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index a7c8816..25a6947 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, 
uint64_t bytes)
 #endif
 
 
+static int translate_err(int err)
+{
+if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
+err == -ENOTTY) {
+err = -ENOTSUP;
+}
+return err;
+}
+
 #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
defined(CONFIG_FALLOCATE_ZERO_RANGE)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
@@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, 
off_t len)
 return 0;
 }
 } while (errno == EINTR);
-return -errno;
+return translate_err(-errno);
 }
 #endif
 
@@ -939,10 +948,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
 #endif
 }
 
-if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-ret == -ENOTTY) {
+ret = translate_err(ret);
+if (ret == -ENOTSUP) {
 s->has_write_zeroes = false;
-ret = -ENOTSUP;
 }
 return ret;
 }
@@ -980,10 +988,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
 }
 
-if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-ret == -ENOTTY) {
+ret = translate_err(ret);
+if (ret == -ENOTSUP) {
 s->has_discard = false;
-ret = -ENOTSUP;
 }
 return ret;
 }
-- 
1.9.1




[Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes

2014-12-26 Thread Denis V. Lunev
This efficiently writes zeroes on Linux if the kernel is capable enough.
FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
including file expansion.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block/raw-posix.c | 13 -
 configure | 19 +++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index e51293a..66ebaab 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL 0x0080 /* Do not cow file */
 #endif
 #endif
-#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include 
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -919,6 +919,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
 return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
 }
 #endif
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+do {
+if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+  aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
+return 0;
+}
+} while (errno == EINTR);
+
+ret = -errno;
+#endif
 }
 
 if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
diff --git a/configure b/configure
index cae588c..dfcf7b3 100755
--- a/configure
+++ b/configure
@@ -3309,6 +3309,22 @@ if compile_prog "" "" ; then
   fallocate_punch_hole=yes
 fi
 
+# check that fallocate supports range zeroing inside the file
+fallocate_zero_range=no
+cat > $TMPC << EOF
+#include 
+#include 
+
+int main(void)
+{
+fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0);
+return 0;
+}
+EOF
+if compile_prog "" "" ; then
+  fallocate_zero_range=yes
+fi
+
 # check for posix_fallocate
 posix_fallocate=no
 cat > $TMPC << EOF
@@ -4538,6 +4554,9 @@ fi
 if test "$fallocate_punch_hole" = "yes" ; then
   echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak
 fi
+if test "$fallocate_zero_range" = "yes" ; then
+  echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak
+fi
 if test "$posix_fallocate" = "yes" ; then
   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
 fi
-- 
1.9.1




[Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit

2014-12-26 Thread Denis V. Lunev
move code dealing with a block device to a separate function. This will
allow to implement additional processing for an ordinary files.

Pls note, that xfs_code has been moved before checking for
s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
This makes code a bit more consistent.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Stefan Hajnoczi 
---
 block/raw-posix.c | 60 +++
 1 file changed, 38 insertions(+), 22 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 25a6947..7866d31 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, 
off_t len)
 }
 #endif
 
-static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
 {
 int ret = -EOPNOTSUPP;
 BDRVRawState *s = aiocb->bs->opaque;
 
-if (s->has_write_zeroes == 0) {
+if (!s->has_write_zeroes) {
 return -ENOTSUP;
 }
 
-if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
 #ifdef BLKZEROOUT
-do {
-uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
-if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
-return 0;
-}
-} while (errno == EINTR);
-
-ret = -errno;
-#endif
-} else {
-#ifdef CONFIG_XFS
-if (s->is_xfs) {
-return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+do {
+uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+return 0;
 }
-#endif
+} while (errno == EINTR);
 
-#ifdef CONFIG_FALLOCATE_ZERO_RANGE
-ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-   aiocb->aio_offset, aiocb->aio_nbytes);
+ret = translate_err(-errno);
 #endif
-}
 
-ret = translate_err(ret);
 if (ret == -ENOTSUP) {
 s->has_write_zeroes = false;
 }
 return ret;
 }
 
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+BDRVRawState *s;
+
+if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+return handle_aiocb_write_zeroes_block(aiocb);
+}
+
+#ifdef CONFIG_XFS
+if (s->is_xfs) {
+return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+}
+#endif
+
+s = aiocb->bs->opaque;
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+if (s->has_write_zeroes) {
+int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+   aiocb->aio_offset, aiocb->aio_nbytes);
+if (ret == 0 && ret != -ENOTSUP) {
+return ret;
+}
+}
+#endif
+
+s->has_write_zeroes = false;
+return -ENOTSUP;
+}
+
 static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 {
 int ret = -EOPNOTSUPP;
-- 
1.9.1




Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs

2014-12-26 Thread Peter Lieven
Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
> The check for maximum length was added by
>   commit c31cb70728d2c0c8900b35a66784baa446fd5147
>   Author: Peter Lieven 
>   Date:   Thu Oct 24 12:06:58 2013 +0200
> block: honour BlockLimits in bdrv_co_do_write_zeroes
>
> but actually if driver provides .bdrv_co_write_zeroes callback, there is
> no need to limit the size to 32 MB. Callback should provide effective
> implementation which normally should not write any zeroes in comparable
> amount.

NACK.

First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
This heaviliy depends on several circumstances that the block layer is not 
aware of.
If a specific protocol knows it is very fast in writing zeroes under any 
circumstance
it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed 
to
return -ENOTSUP if the request size or alignment doesn't fit.

There are known backends e.g. anything that deals with SCSI that have a known
limitation of the maximum number of zeroes they can write fast in a single 
request.
This number MUST NOT be exceeded. The below patch would break all those 
backends.

What issue are you trying to fix with this patch? Maybe there is a better way 
to fix
it at another point in the code.

Peter

>
> Correct check should be as follows to honour BlockLimits for slow writes:
> - if bs->bl.max_write_zeroes is specified, it should be applied to all
>   paths (fast and slow)
> - MAX_WRITE_ZEROES_DEFAULT should be applied for slow path only
>
> Signed-off-by: Denis V. Lunev 
> CC: Eric Blake 
> CC: Peter Lieven 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> ---
>  block.c | 21 +++--
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/block.c b/block.c
> index 4165d42..0ec8b15 100644
> --- a/block.c
> +++ b/block.c
> @@ -3182,8 +3182,12 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  struct iovec iov = {0};
>  int ret = 0;
>  
> -int max_write_zeroes = bs->bl.max_write_zeroes ?
> -   bs->bl.max_write_zeroes : 
> MAX_WRITE_ZEROES_DEFAULT;
> +int max_fast_write_size = nb_sectors;
> +int max_slow_write_size = MAX_WRITE_ZEROES_DEFAULT;
> +if (bs->bl.max_write_zeroes != 0) {
> +max_fast_write_size = bs->bl.max_write_zeroes;
> +max_slow_write_size = bs->bl.max_write_zeroes;
> +}
>  
>  while (nb_sectors > 0 && !ret) {
>  int num = nb_sectors;
> @@ -3205,9 +3209,9 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  }
>  }
>  
> -/* limit request size */
> -if (num > max_write_zeroes) {
> -num = max_write_zeroes;
> +/* limit request size for fast path */
> +if (num > max_fast_write_size) {
> +num = max_fast_write_size;
>  }
>  
>  ret = -ENOTSUP;
> @@ -3217,6 +3221,11 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  }
>  
>  if (ret == -ENOTSUP) {
> +/* limit request size further for slow path */
> +if (num > max_slow_write_size) {
> +num = max_slow_write_size;
> +}
> +
>  /* Fall back to bounce buffer if write zeroes is unsupported */
>  iov.iov_len = num * BDRV_SECTOR_SIZE;
>  if (iov.iov_base == NULL) {
> @@ -3234,7 +3243,7 @@ static int coroutine_fn 
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  /* Keep bounce buffer around if it is big enough for all
>   * all future requests.
>   */
> -if (num < max_write_zeroes) {
> +if (num < max_slow_write_size) {
>  qemu_vfree(iov.iov_base);
>  iov.iov_base = NULL;
>  }




[Qemu-devel] [Bug 1405385] Re: QEMU crashes when virtio network cards are used together with e1000 network cards

2014-12-26 Thread Bram Klein Gunnewiek
I think your just not trying hard enough ;-)

I have reproduced this on four different (bare metal) machines. I used
the default ubuntu QEMU (2.0.0) and the latest GIT version. All machines
where running ubuntu 14.04. I also tried to reproduce it on a virtualbox
VM and I could only get it to crash when I put the network card of my
virtual machine (the virtualbox one) in promiscuous modus. If I would
set promiscuous modus to "deny all" in virtual box QEMU would not crash.

I will do the gdb debug trace when I get back at work, I don't have a
suitable system available at home to test this with.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1405385

Title:
  QEMU crashes when virtio network cards are used together with e1000
  network cards

Status in QEMU:
  New

Bug description:
  QEMU version: QEMU emulator version 2.2.50, Copyright (c) 2003-2008 Fabrice 
Bellard
  QEMU GIT version: ab0302ee764fd702465aef6d88612cdff4302809
  Configure flags: ./configure --enable-kvm --prefix=/opt/qemu-devel
  Linux version: Ubuntu 14.04.1 LTS
  Kernel version: 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux

  Problem:

  QEMU crashes when using one (or more) virtio network cards
  together with one (or more) e1000 (and possibly others) network cards
  when those cards are bound to a linux bridge. When the cards are *not*
  bound to a bridge QEMU does not crash.

  Bridge configuration:

iface bridge0 inet dhcp
bridge_ports eth1
bridge_stp off
bridge_fd 0

  Start-up command (including binding the network cards to the bridge +
  strace logging):

  ./qemu-system-x86_64 -daemonize -smp 1 -m 128 -vnc 0.0.0.0:0 \
  -netdev tap,id=tap_1,script=no,downscript=no,ifname=net_1_1,vhost=on \
  -device 
virtio-net-pci,bootindex=1,id=nic_1,netdev=tap_1,mac=02:16:3F:00:00:FA \
  -netdev tap,id=tap_2,script=no,downscript=no,ifname=net_1_2 \
  -device e1000,bootindex=2,id=nic_2,netdev=tap_2,mac=02:16:3F:00:00:FB; \
  brctl addif bridge0 net_1_1; \
  brctl addif bridge0 net_1_2; \
  ifconfig net_1_1 0.0.0.0 up; \
  ifconfig net_1_2 0.0.0.0 up; \
  sleep 2; \
  strace -p `ps x |grep qemu-system-x86_64 |grep -v grep|awk '{print $1}'` -o 
/tmp/qemu-devel-trace.txt 

  Kernel log:

  Dec 24 11:12:08 bramws kernel: [12466.885581] device net_1_1 entered 
promiscuous mode
  Dec 24 11:12:08 bramws kernel: [12466.886238] device net_1_2 entered 
promiscuous mode
  Dec 24 11:12:08 bramws kernel: [12466.887084] bridge0: port 2(net_1_1) 
entered forwarding state
  Dec 24 11:12:08 bramws kernel: [12466.887089] bridge0: port 2(net_1_1) 
entered forwarding state
  Dec 24 11:12:08 bramws kernel: [12466.888940] bridge0: port 3(net_1_2) 
entered forwarding state
  Dec 24 11:12:08 bramws kernel: [12466.888947] bridge0: port 3(net_1_2) 
entered forwarding state
  Dec 24 11:12:29 bramws kernel: [12488.026376] bridge0: port 2(net_1_1) 
entered disabled state
  Dec 24 11:12:29 bramws kernel: [12488.026820] device net_1_1 left promiscuous 
mode
  Dec 24 11:12:29 bramws kernel: [12488.026832] bridge0: port 2(net_1_1) 
entered disabled state
  Dec 24 11:12:29 bramws kernel: [12488.049636] bridge0: port 3(net_1_2) 
entered disabled state
  Dec 24 11:12:29 bramws kernel: [12488.050058] device net_1_2 left promiscuous 
mode
  Dec 24 11:12:29 bramws kernel: [12488.050074] bridge0: port 3(net_1_2) 
entered disabled state

  Strace log: (full log attached)

  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 28646613}, NULL, 8) = 0 (Timeout)
  write(5, "\1\0\0\0\0\0\0\0", 8) = 8
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 10899760}, NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {0, 10895457})
  write(6, "\1\0\0\0\0\0\0\0", 8) = 8
  read(5, "\1\0\0\0\0\0\0\0", 512)= 8
  write(6, "\1\0\0\0\0\0\0\0", 8) = 8
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 0}, NULL, 8) = 1 ([{fd=6, revents=POLLIN}], left {0, 0})
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=3, 
events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 
6, {0, 0}, NULL, 8) = 1 ([{fd=6, revents=POLLIN}], left {0, 0})
  read(6, "\2\0\0\0\0\0\0\0", 16) = 8
  ppoll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=7, 
events=POLLIN|POLLERR|POLLHUP}, {fd=12, events=POLLIN|POLLE

Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs

2014-12-26 Thread Denis V. Lunev

On 26/12/14 16:13, Peter Lieven wrote:

Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:

The check for maximum length was added by
   commit c31cb70728d2c0c8900b35a66784baa446fd5147
   Author: Peter Lieven 
   Date:   Thu Oct 24 12:06:58 2013 +0200
 block: honour BlockLimits in bdrv_co_do_write_zeroes

but actually if driver provides .bdrv_co_write_zeroes callback, there is
no need to limit the size to 32 MB. Callback should provide effective
implementation which normally should not write any zeroes in comparable
amount.


NACK.

First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
This heaviliy depends on several circumstances that the block layer is not 
aware of.
If a specific protocol knows it is very fast in writing zeroes under any 
circumstance
it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed 
to
return -ENOTSUP if the request size or alignment doesn't fit.


the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
specified, the cost is almost the same for any amount of zeroes
written. This is true for fallocate from my point of view. The amount
of actually written data will be in several orders less than specified
except slow path, which honors 32 MB limit.

If the operation is complex in realization, then it will be rate-limited
below, in actual implementation.


There are known backends e.g. anything that deals with SCSI that have a known
limitation of the maximum number of zeroes they can write fast in a single 
request.
This number MUST NOT be exceeded. The below patch would break all those 
backends.


could you pls point me this backends. Actually, from my point of
view, they should properly setup max_write_zeroes for themselves.
This is done at least in block/iscsi.c and it would be consistent
way of doing so.



What issue are you trying to fix with this patch? Maybe there is a better way 
to fix
it at another point in the code.



I am trying to minimize amount of metadata updates for a file.
This provides some speedup even on ext4 and this will provide
even more speedup with a distributed filesystem like CEPH
where size updates of the files and block allocation are
costly.

Regards,
Den



[Qemu-devel] [PATCH v3 2/5] QJSON: Add JSON writer

2014-12-26 Thread Alexander Graf
To support programmatic JSON assembly while keeping the code that generates it
readable, this patch introduces a simple JSON writer. It emits JSON serially
into a buffer in memory.

The nice thing about this writer is its simplicity and low memory overhead.
Unlike the QMP JSON writer, this one does not need to spawn QObjects for every
element it wants to represent.

This is a prerequisite for the migration stream format description generator.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - QOMify the QJSON object, makes for easier destruction
---
 include/qjson.h |  3 ++-
 qjson.c | 39 ---
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/qjson.h b/include/qjson.h
index 8f8c145..7c54fdf 100644
--- a/include/qjson.h
+++ b/include/qjson.h
@@ -4,7 +4,7 @@
  * Copyright Alexander Graf
  *
  * Authors:
- *  Alexander Graf 
  *
  * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
  * See the COPYING.LIB file in the top-level directory.
@@ -13,6 +13,7 @@
 #ifndef QEMU_QJSON_H
 #define QEMU_QJSON_H
 
+#define TYPE_QJSON "QJSON"
 typedef struct QJSON QJSON;
 
 QJSON *qjson_new(void);
diff --git a/qjson.c b/qjson.c
index 7a7cd72..8d911d0 100644
--- a/qjson.c
+++ b/qjson.c
@@ -15,8 +15,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 struct QJSON {
+Object obj;
 QString *str;
 bool omit_comma;
 unsigned long self_size_offset;
@@ -85,9 +88,7 @@ const char *qjson_get_str(QJSON *json)
 
 QJSON *qjson_new(void)
 {
-QJSON *json = g_new(QJSON, 1);
-json->str = qstring_from_str("{ ");
-json->omit_comma = true;
+QJSON *json = (QJSON *)object_new(TYPE_QJSON);
 return json;
 }
 
@@ -95,3 +96,35 @@ void qjson_finish(QJSON *json)
 {
 json_end_object(json);
 }
+
+static void qjson_initfn(Object *obj)
+{
+QJSON *json = (QJSON *)object_dynamic_cast(obj, TYPE_QJSON);
+assert(json);
+
+json->str = qstring_from_str("{ ");
+json->omit_comma = true;
+}
+
+static void qjson_finalizefn(Object *obj)
+{
+QJSON *json = (QJSON *)object_dynamic_cast(obj, TYPE_QJSON);
+
+assert(json);
+qobject_decref(QOBJECT(json->str));
+}
+
+static const TypeInfo qjson_type_info = {
+.name = TYPE_QJSON,
+.parent = TYPE_OBJECT,
+.instance_size = sizeof(QJSON),
+.instance_init = qjson_initfn,
+.instance_finalize = qjson_finalizefn,
+};
+
+static void qjson_register_types(void)
+{
+type_register_static(&qjson_type_info);
+}
+
+type_init(qjson_register_types)
-- 
1.7.12.4




[Qemu-devel] [PATCH v3 3/5] qemu-file: Add fast ftell code path

2014-12-26 Thread Alexander Graf
For ftell we flush the output buffer to ensure that we don't have anything
lingering in our internal buffers. This is a very safe thing to do.

However, with the dynamic size measurement that the dynamic vmstate
description will bring this would turn out quite slow.

Instead, we can fast path this specific measurement and just take the
internal buffers into account when telling the kernel our position.

I'm sure I overlooked some corner cases where this doesn't work, so
instead of tuning the safe, existing version, this patch adds a fast
variant of ftell that gets used by the dynamic vmstate description code
which isn't critical when it fails.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - improve ftell_fast, now works with bdrv too
---
 include/migration/qemu-file.h |  1 +
 migration/qemu-file.c | 16 
 2 files changed, 17 insertions(+)

diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 401676b..6b6772b 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -112,6 +112,7 @@ QEMUFile *qemu_bufopen(const char *mode, QEMUSizedBuffer 
*input);
 int qemu_get_fd(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 int64_t qemu_ftell(QEMUFile *f);
+int64_t qemu_ftell_fast(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
 /*
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index d2d4007..cdfec56 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -440,6 +440,22 @@ int qemu_get_byte(QEMUFile *f)
 return result;
 }
 
+int64_t qemu_ftell_fast(QEMUFile *f)
+{
+int64_t ret = f->pos;
+int i;
+
+if (f->ops->writev_buffer) {
+for (i = 0; i < f->iovcnt; i++) {
+ret += f->iov[i].iov_len;
+}
+} else {
+ret += f->buf_index;
+}
+
+return ret;
+}
+
 int64_t qemu_ftell(QEMUFile *f)
 {
 qemu_fflush(f);
-- 
1.7.12.4




[Qemu-devel] [PATCH v3 0/5] Migration Deciphering aid

2014-12-26 Thread Alexander Graf
Migration is a black hole to most people. One of the biggest reasons for
this is that its protocol is a secret, undocumented sauce of code rolling
around random parts of the QEMU code base.

But what if we simply exposed the description of how the format looks like
alongside the actual migration stream? This is what this patch set does.

It adds a new section that comes after the end of stream marker (so that it
doesn't slow down migration) that contains a JSON description of the device
state description.

Along with this patch set also comes a python script that can read said JSON
from a migration dump and decipher the device state and ram contents of the
migration dump using it.

With this, you can now fully examine all glorious details that go over the
wire when virtual machine state gets dumped, such as during live migration.

We discussed the approach taken here during KVM Forum 2013. Originally, my idea
was to include a special device that contains the JSON data which can be enabled
on demand. Anthony suggested however to just always include the description data
after the end marker which I think is a great idea.

  Example decoded migration: http://csgraf.de/mig/mig.txt
  Example migration description: http://csgraf.de/mig/mig.desc.txt
  Presentation: https://www.youtube.com/watch?v=iq1x40Qsrew
  Slides: https://www.dropbox.com/s/otp2pk2n3g087zp/Live%20Migration.pdf

v1 -> v2:

  - a lot, v1 was from 2013

v2 -> v3:

  - QOMify the QJSON object, makes for easier destruction
  - improve ftell_fast, now works with bdrv too
  - Dont compress objects with subsections
  - Destroy QJSON object
  - Add tell function to MigrationFile
  - Report where subsections were not found
  - Print ram sections with size
  - Remove foreign desc file support
  - Add memory dump support (-m option)
  - Add -x option to extract all sections into files

Alexander Graf (5):
  QJSON: Add JSON writer
  QJSON: Add JSON writer
  qemu-file: Add fast ftell code path
  migration: Append JSON description of migration stream
  Add migration stream analyzation script

 Makefile.objs |   1 +
 hw/pci/pci.c  |   2 +-
 hw/scsi/spapr_vscsi.c |   2 +-
 hw/virtio/virtio.c|   2 +-
 include/migration/migration.h |   1 +
 include/migration/qemu-file.h |   1 +
 include/migration/vmstate.h   |   3 +-
 include/qjson.h   |  29 +++
 migration/qemu-file.c |  16 ++
 migration/vmstate.c   | 186 -
 qjson.c   | 130 ++
 savevm.c  |  54 +++-
 scripts/analyze-migration.py  | 592 ++
 tests/test-vmstate.c  |   6 +-
 14 files changed, 1006 insertions(+), 19 deletions(-)
 create mode 100644 include/qjson.h
 create mode 100644 qjson.c
 create mode 100755 scripts/analyze-migration.py

-- 
1.7.12.4




[Qemu-devel] [PATCH v3 1/5] QJSON: Add JSON writer

2014-12-26 Thread Alexander Graf
To support programmatic JSON assembly while keeping the code that generates it
readable, this patch introduces a simple JSON writer. It emits JSON serially
into a buffer in memory.

The nice thing about this writer is its simplicity and low memory overhead.
Unlike the QMP JSON writer, this one does not need to spawn QObjects for every
element it wants to represent.

This is a prerequisite for the migration stream format description generator.

Signed-off-by: Alexander Graf 
---
 Makefile.objs   |  1 +
 include/qjson.h | 28 +
 qjson.c | 97 +
 3 files changed, 126 insertions(+)
 create mode 100644 include/qjson.h
 create mode 100644 qjson.c

diff --git a/Makefile.objs b/Makefile.objs
index abeb902..28999d3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -51,6 +51,7 @@ common-obj-$(CONFIG_LINUX) += fsdev/
 common-obj-y += migration/
 common-obj-y += qemu-char.o #aio.o
 common-obj-y += page_cache.o
+common-obj-y += qjson.o
 
 common-obj-$(CONFIG_SPICE) += spice-qemu-char.o
 
diff --git a/include/qjson.h b/include/qjson.h
new file mode 100644
index 000..8f8c145
--- /dev/null
+++ b/include/qjson.h
@@ -0,0 +1,28 @@
+/*
+ * QEMU JSON writer
+ *
+ * Copyright Alexander Graf
+ *
+ * Authors:
+ *  Alexander Graf 
+#include 
+#include 
+#include 
+
+struct QJSON {
+QString *str;
+bool omit_comma;
+unsigned long self_size_offset;
+};
+
+static void json_emit_element(QJSON *json, const char *name)
+{
+/* Check whether we need to print a , before an element */
+if (json->omit_comma) {
+json->omit_comma = false;
+} else {
+qstring_append(json->str, ", ");
+}
+
+if (name) {
+qstring_append(json->str, "\"");
+qstring_append(json->str, name);
+qstring_append(json->str, "\" : ");
+}
+}
+
+void json_start_object(QJSON *json, const char *name)
+{
+json_emit_element(json, name);
+qstring_append(json->str, "{ ");
+json->omit_comma = true;
+}
+
+void json_end_object(QJSON *json)
+{
+qstring_append(json->str, " }");
+json->omit_comma = false;
+}
+
+void json_start_array(QJSON *json, const char *name)
+{
+json_emit_element(json, name);
+qstring_append(json->str, "[ ");
+json->omit_comma = true;
+}
+
+void json_end_array(QJSON *json)
+{
+qstring_append(json->str, " ]");
+json->omit_comma = false;
+}
+
+void json_prop_int(QJSON *json, const char *name, int64_t val)
+{
+json_emit_element(json, name);
+qstring_append_int(json->str, val);
+}
+
+void json_prop_str(QJSON *json, const char *name, const char *str)
+{
+json_emit_element(json, name);
+qstring_append_chr(json->str, '"');
+qstring_append(json->str, str);
+qstring_append_chr(json->str, '"');
+}
+
+const char *qjson_get_str(QJSON *json)
+{
+return qstring_get_str(json->str);
+}
+
+QJSON *qjson_new(void)
+{
+QJSON *json = g_new(QJSON, 1);
+json->str = qstring_from_str("{ ");
+json->omit_comma = true;
+return json;
+}
+
+void qjson_finish(QJSON *json)
+{
+json_end_object(json);
+}
-- 
1.7.12.4




[Qemu-devel] [PATCH v3 4/5] migration: Append JSON description of migration stream

2014-12-26 Thread Alexander Graf
One of the annoyances of the current migration format is the fact that
it's not self-describing. In fact, it's not properly describing at all.
Some code randomly scattered throughout QEMU elaborates roughly how to
read and write a stream of bytes.

We discussed an idea during KVM Forum 2013 to add a JSON description of
the migration protocol itself to the migration stream. This patch
adds a section after the VM_END migration end marker that contains
description data on what the device sections of the stream are composed of.

With an additional external program this allows us to decipher the
contents of any migration stream and hopefully make migration bugs easier
to track down.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - Dont compress objects with subsections
  - Destroy QJSON object
---
 hw/pci/pci.c  |   2 +-
 hw/scsi/spapr_vscsi.c |   2 +-
 hw/virtio/virtio.c|   2 +-
 include/migration/migration.h |   1 +
 include/migration/vmstate.h   |   3 +-
 migration/vmstate.c   | 186 --
 savevm.c  |  54 ++--
 tests/test-vmstate.c  |   6 +-
 8 files changed, 237 insertions(+), 19 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 371699c..38d0b7b 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -512,7 +512,7 @@ void pci_device_save(PCIDevice *s, QEMUFile *f)
  * This makes us compatible with old devices
  * which never set or clear this bit. */
 s->config[PCI_STATUS] &= ~PCI_STATUS_INTERRUPT;
-vmstate_save_state(f, pci_get_vmstate(s), s);
+vmstate_save_state(f, pci_get_vmstate(s), s, NULL);
 /* Restore the interrupt status bit. */
 pci_update_irq_status(s);
 }
diff --git a/hw/scsi/spapr_vscsi.c b/hw/scsi/spapr_vscsi.c
index 20b20f0..3639235 100644
--- a/hw/scsi/spapr_vscsi.c
+++ b/hw/scsi/spapr_vscsi.c
@@ -630,7 +630,7 @@ static void vscsi_save_request(QEMUFile *f, SCSIRequest 
*sreq)
 vscsi_req *req = sreq->hba_private;
 assert(req->active);
 
-vmstate_save_state(f, &vmstate_spapr_vscsi_req, req);
+vmstate_save_state(f, &vmstate_spapr_vscsi_req, req, NULL);
 
 DPRINTF("VSCSI: saving tag=%u, current desc#%d, offset=%x\n",
 req->qtag, req->cur_desc_num, req->cur_desc_offset);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 013979a..d735343 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -955,7 +955,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 }
 
 /* Subsections */
-vmstate_save_state(f, &vmstate_virtio, vdev);
+vmstate_save_state(f, &vmstate_virtio, vdev, NULL);
 }
 
 int virtio_set_features(VirtIODevice *vdev, uint32_t val)
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 3cb5ba8..f37348b 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -33,6 +33,7 @@
 #define QEMU_VM_SECTION_END  0x03
 #define QEMU_VM_SECTION_FULL 0x04
 #define QEMU_VM_SUBSECTION   0x05
+#define QEMU_VM_VMDESCRIPTION0x06
 
 struct MigrationParams {
 bool blk;
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index e45fc49..1ed5311 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -29,6 +29,7 @@
 #ifndef CONFIG_USER_ONLY
 #include 
 #endif
+#include 
 
 typedef void SaveStateHandler(QEMUFile *f, void *opaque);
 typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id);
@@ -779,7 +780,7 @@ extern const VMStateInfo vmstate_info_bitmap;
 int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
void *opaque, int version_id);
 void vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
-void *opaque);
+void *opaque, QJSON *vmdesc);
 
 int vmstate_register_with_alias_id(DeviceState *dev, int instance_id,
const VMStateDescription *vmsd,
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 3dde574..2a91f1f 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -4,9 +4,10 @@
 #include "migration/vmstate.h"
 #include "qemu/bitops.h"
 #include "trace.h"
+#include "qjson.h"
 
 static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription 
*vmsd,
-void *opaque);
+void *opaque, QJSON *vmdesc);
 static int vmstate_subsection_load(QEMUFile *f, const VMStateDescription *vmsd,
void *opaque);
 
@@ -138,32 +139,181 @@ int vmstate_load_state(QEMUFile *f, const 
VMStateDescription *vmsd,
 return 0;
 }
 
+static int vmfield_name_num(VMStateField *start, VMStateField *search)
+{
+VMStateField *field;
+int found = 0;
+
+for (field = start; field->name; field++) {
+if (!strcmp(field->name, search->name)) {
+if (field == search) {
+return found;
+}

[Qemu-devel] [PATCH v3 5/5] Add migration stream analyzation script

2014-12-26 Thread Alexander Graf
This patch adds a python tool to the scripts directory that can read
a dumped migration stream if it contains the JSON description of the
device states. I constructs a human readable JSON stream out of it.

It's very simple to use:

  $ qemu-system-x86_64
(qemu) migrate "exec:cat > mig"
  $ ./scripts/analyze_migration.py -f mig

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - Remove support for multiple vmsd versions
  - Add support for HTAB
  - Add support for array_len
  - Move to new naming schema for fields
  - Use dynamic page size

v2 -> v3:

  - Add tell function to MigrationFile
  - Report where subsections were not found
  - Print ram sections with size
  - Remove foreign desc file
  - Add memory dump support (-m option)
  - Add -x option to extract all sections into files
---
 scripts/analyze-migration.py | 592 +++
 1 file changed, 592 insertions(+)
 create mode 100755 scripts/analyze-migration.py

diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
new file mode 100755
index 000..3363172
--- /dev/null
+++ b/scripts/analyze-migration.py
@@ -0,0 +1,592 @@
+#!/usr/bin/env python
+#
+#  Migration Stream Analyzer
+#
+#  Copyright (c) 2013 Alexander Graf 
+#
+# This library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2 of the License, or (at your option) any later version.
+#
+# This library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with this library; if not, see .
+
+import numpy as np
+import json
+import os
+import argparse
+import collections
+import pprint
+
+def mkdir_p(path):
+try:
+os.makedirs(path)
+except OSError:
+pass
+
+class MigrationFile(object):
+def __init__(self, filename):
+self.filename = filename
+self.file = open(self.filename, "rb")
+
+def read64(self):
+return np.asscalar(np.fromfile(self.file, count=1, dtype='>i8')[0])
+
+def read32(self):
+return np.asscalar(np.fromfile(self.file, count=1, dtype='>i4')[0])
+
+def read16(self):
+return np.asscalar(np.fromfile(self.file, count=1, dtype='>i2')[0])
+
+def read8(self):
+return np.asscalar(np.fromfile(self.file, count=1, dtype='>i1')[0])
+
+def readstr(self, len = None):
+if len is None:
+len = self.read8()
+if len == 0:
+return ""
+return np.fromfile(self.file, count=1, dtype=('S%d' % len))[0]
+
+def readvar(self, size = None):
+if size is None:
+size = self.read8()
+if size == 0:
+return ""
+value = self.file.read(size)
+if len(value) != size:
+raise Exception("Unexpected end of %s at 0x%x" % (self.filename, 
self.file.tell()))
+return value
+
+def tell(self):
+return self.file.tell()
+
+# The VMSD description is at the end of the file, after EOF. Look for
+# the last NULL byte, then for the beginning brace of JSON.
+def read_migration_debug_json(self):
+QEMU_VM_VMDESCRIPTION = 0x06
+
+# Remember the offset in the file when we started
+entrypos = self.file.tell()
+
+# Read the last 10MB
+self.file.seek(0, os.SEEK_END)
+endpos = self.file.tell()
+self.file.seek(max(-endpos, -10 * 1024 * 1024), os.SEEK_END)
+datapos = self.file.tell()
+data = self.file.read()
+# The full file read closed the file as well, reopen it
+self.file = open(self.filename, "rb")
+
+# Find the last NULL byte, then the first brace after that. This should
+# be the beginning of our JSON data.
+nulpos = data.rfind("\0")
+jsonpos = data.find("{", nulpos)
+
+# Check backwards from there and see whether we guessed right
+self.file.seek(datapos + jsonpos - 5, 0)
+if self.read8() != QEMU_VM_VMDESCRIPTION:
+raise Exception("No Debug Migration device found")
+
+jsonlen = self.read32()
+
+# Seek back to where we were at the beginning
+self.file.seek(entrypos, 0)
+
+return data[jsonpos:jsonpos + jsonlen]
+
+def close(self):
+self.file.close()
+
+class RamSection(object):
+RAM_SAVE_FLAG_COMPRESS = 0x02
+RAM_SAVE_FLAG_MEM_SIZE = 0x04
+RAM_SAVE_FLAG_PAGE = 0x08
+RAM_SAVE_FLAG_EOS  = 0x10
+RAM_SAVE_FLAG_CONTINUE = 0x20
+RAM_SAVE_FLAG_XBZRLE   = 0x40
+RAM_SAVE_FLAG_HOOK = 0x80
+
+def __init__(self, file, version_id, ramargs, section_key):
+if version_id != 4:
+  

Re: [Qemu-devel] [PATCH] kvm_irqchip_assign_irqfd: just set irqfd in case of kvm_irqfds_enabled()

2014-12-26 Thread Peter Maydell
On 26 December 2014 at 10:16, Denis V. Lunev  wrote:
> IMHO the patch does not change anything even on hot-hot path.
> the declaration 'struct kvm_irqfd irqfd = {};' will
> result in memset inside.
>
> Thus in order to achieve declared goal author should
> declare
>struct kvm_irqfd irqfd;
> and perform
>memset(&irqfd, 0, sizeof(irqfd));
> later after the check.

Mm, but once you're into such microoptimisations as this you really
need to have a good justification for the effort, in the form of
profiling measurements that indicate that this is a hot path.
In this case that seems pretty unlikely, because I'd expect all
the systems where we care about performance will support irqfds,
so we won't be taking the early-exit code path anyhow. (And
not supporting irqfds is leaving much more performance on the
table than we could possibly be talking about in this function.)

thanks
-- PMM



Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs

2014-12-26 Thread Denis V. Lunev

On 26/12/14 16:32, Denis V. Lunev wrote:

On 26/12/14 16:13, Peter Lieven wrote:

Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:

The check for maximum length was added by
   commit c31cb70728d2c0c8900b35a66784baa446fd5147
   Author: Peter Lieven 
   Date:   Thu Oct 24 12:06:58 2013 +0200
 block: honour BlockLimits in bdrv_co_do_write_zeroes

but actually if driver provides .bdrv_co_write_zeroes callback, 
there is

no need to limit the size to 32 MB. Callback should provide effective
implementation which normally should not write any zeroes in comparable
amount.


NACK.

First there is no guarantee that bdrv_co_do_write_zeroes is a fast 
operation.
This heaviliy depends on several circumstances that the block layer 
is not aware of.
If a specific protocol knows it is very fast in writing zeroes under 
any circumstance
it should provide INT_MAX in bs->bl.max_write_zeroes. It is then 
still allowed to

return -ENOTSUP if the request size or alignment doesn't fit.


the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
specified, the cost is almost the same for any amount of zeroes
written. This is true for fallocate from my point of view. The amount
of actually written data will be in several orders less than specified
except slow path, which honors 32 MB limit.

If the operation is complex in realization, then it will be rate-limited
below, in actual implementation.

There are known backends e.g. anything that deals with SCSI that have 
a known
limitation of the maximum number of zeroes they can write fast in a 
single request.
This number MUST NOT be exceeded. The below patch would break all 
those backends.


could you pls point me this backends. Actually, from my point of
view, they should properly setup max_write_zeroes for themselves.
This is done at least in block/iscsi.c and it would be consistent
way of doing so.



What issue are you trying to fix with this patch? Maybe there is a 
better way to fix

it at another point in the code.



I am trying to minimize amount of metadata updates for a file.
This provides some speedup even on ext4 and this will provide
even more speedup with a distributed filesystem like CEPH
where size updates of the files and block allocation are
costly.

Regards,
Den

First of all, the patch is really wrong :) It was written using
wrong assumptions.

OK. I have spent some time reading your original patchset and
and did not found any useful justification for default limit
for both discard and write zero.

Yes, there are drivers which requires block level to call
.bdrv_co_do_write_zeroes with alignment and with upper limit.
But in this case driver setups max_write_zeroes. All buggy
drivers should do that not to affect not buggy ones from
my opinion.

This is the only purpose of the original patches for limits.
I have wrongly interpret BlockLimits as something connected
with time of the operation. Sorry for that.

Therefore there is no good reason for limiting the amount of
data sent to drv->bdrv_co_writev with any data size. The only
thing is that it would be good not to allocate too many memory
at once. We could do something like

base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
added = 0;
for (added = 0; added < num; add += MIN(2048, num)) {
qemu_iovec_add(qiov, base, MIN(2048, num));
}

to avoid really big allocations here even if .max_write_zeroes is
very high. Do you think that this might be useful?

As for .bdrv_co_do_write_zeroes itself, can we still drop
default 32 Mb limit? If there are some buggy drivers, they
should have .max_write_zeroes specified.

The same applies to .max_discard

Regards,
Den



[Qemu-devel] implement lvm-aware P2V to reduce time cost significantly for linux server

2014-12-26 Thread Haoyu Zhang
Hi,

I want to P2V a redhat server to kvm vm, and lvm was used to manage disks
in the redhat server.
I want to only migrate the really used storage to vm image, which can
reduce the time cost significantly sometimes,
so I need the information of logical volume to physical disks bitmap, to
know which physical sectors were really used,
any ideas?
Is there a tool off-the-shelf have implemented the target?


Thanks,
Zhang Haoyu