Re: [Qemu-devel] [PATCH 06/11] Goldfish: Added a nand controller.

2011-08-23 Thread Paolo Bonzini

On 08/22/2011 10:41 PM, Stefan Hajnoczi wrote:

On Mon, Aug 22, 2011 at 3:16 PM, Paolo Bonzini  wrote:

On 08/22/2011 02:04 PM, Stefan Hajnoczi wrote:


Please use the block layer instead of reinventing portable file I/O.

Also, please check which of the utility functions already exist in
QEMU.  If a utility function in this patch is needed it should be made
generic for all of QEMU, not just goldfish or nand.


This was part of the plans, but there was no time for it.


Does that mean this series is Request For Comments and not something
you expect to be merged into qemu.git?


Yes, more or less.

Paolo




Re: [Qemu-devel] Compilation error of coroutine-win32.c with gcc version 3.4.5 (mingw-vista special r3)

2011-08-23 Thread Roy Tam
2011/8/20 Stefan Weil :
> Am 16.08.2011 20:50, schrieb Paolo Bonzini:
>>
>> On 08/16/2011 05:45 AM, Stefan Hajnoczi wrote:
>>>
>>> Roy,
>>> This stack trace does not reveal much.
>>>
>>> Is there any MinGW gcc user that has successfully built and run
>>> qemu.git?
>>
>> I would be surprised if Stefan Weil hasn't.
>
> How did you know that?
>
>>> Which version/architecture of Windows and which MinGW
>>> version?
>
> QEMU currently only supports 32 bit Windows executables
> which work on any current Windows version version
> (XP and later, 32 and 64 bit).
>
> I use latest MinGW from http://www.mingw.org/ (native compilation
> on Windows) and Debian cross compilations. Those are available
> from http://qemu.weilnetz.de/w32/.
>
>

Did you cross compile it to get those binaries? They run fine here,
but not mine.
I tried rebuilding all dependencies(gettext, libiconv, glib, zlib)
under gcc 4.6.1 but I still get same error with latest git revision.
I wonder if there is someone (excluding me) can build and run QEMU
successfully under Windows(MinGW+MSYS) environment.



Re: [Qemu-devel] [PATCH 2/2] [RFC] time: refactor QEMU timer to use GHRTimer

2011-08-23 Thread Paolo Bonzini

On 08/22/2011 09:21 PM, Anthony Liguori wrote:

-ticks = cpu_get_real_ticks();
-if (timers_state.cpu_ticks_prev > ticks) {
-/* Note: non increasing ticks may happen if the host uses
-   software suspend */
-timers_state.cpu_ticks_offset += timers_state.cpu_ticks_prev - 
ticks;
-}
+ticks = get_clock();

[...]

-static inline int64_t cpu_get_real_ticks(void)
-{
-int64_t val;
-asm volatile ("rdtsc" : "=A" (val));
-return val;
-}
-


cpu_get_ticks is used also to emulate the guest TSC, are you sure you 
want to change that uniformly to a 1 GHz rate?


I had some more cleanups in this area, I'll try to get them tested and 
submitted but I have little time for this right now.


Paolo



Re: [Qemu-devel] [PATCH v2] rbd: allow client id to be specified in config string

2011-08-23 Thread Stefan Hajnoczi
On Mon, Aug 22, 2011 at 02:45:48PM -0700, Sage Weil wrote:
> diff --git a/block/rbd.c b/block/rbd.c
> index d5659cd..b0cd750 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -169,6 +169,34 @@ done:
>  return ret;
>  }
>  
> +static char *qemu_rbd_parse_clientname(const char *conf, char *clientname)
> +{
> +const char *p = conf;
> +
> +while (*p) {
> + int len;
> + const char *end = strchr(p, ':');
> +
> + if (end) {
> + len = end - p;
> + } else {
> + len = strlen(p);
> + }
> +
> + if (strncmp(p, "id=", 3) == 0) {
> + len -= 3;
> + strncpy(clientname, p + 3, len);
> + clientname[len] = '\0';
> + return clientname;
> + }
> + if (end == NULL) {
> + break;
> + }
> + p = end + 1;
> +}
> +return NULL;
> +}

Please run scripts/checkpatch.pl to check coding style.  You are using
tab characters, QEMU uses 4 spaces for indentation instead of tabs.

> @@ -398,7 +433,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
> *filename, int flags)
>  s->snap = qemu_strdup(snap_buf);
>  }
>  
> -r = rados_create(&s->cluster, NULL);
> +clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
> +r = rados_create(&s->cluster, clientname);
>  if (r < 0) {
>  error_report("error initializing");
>  return r;

s->snap may be leaked here.  If .bdrv_file_open() fails .bdrv_close()
will not be called to clean things up.

Stefan



Re: [Qemu-devel] Block layer roadmap on wiki

2011-08-23 Thread Stefan Hajnoczi
On Mon, Aug 22, 2011 at 04:01:08PM -0500, Anthony Liguori wrote:
> On 08/22/2011 03:48 PM, Ryan Harper wrote:
> >* Stefan Hajnoczi  [2011-08-22 15:32]:
> >>We wouldn't rm -rf block/* because we still need qemu-nbd.  It
> >>probably makes sense to keep what we have today.  I'm talking more
> >>about a shift from writing our own image format to integrating
> >>existing storage support.
> >
> >I think this is a key point.  While I do like the idea of keeping QEMU
> >focused on single VM, I think we don't help ourselves by not consuming
> >the hypervisor platform services and integrating/exploiting those
> >features to make using QEMU easier.
> 
> Let's avoid the h-word here as it's not terribly relevant to the discussion.
> 
> Configuring block devices is fundamentally a privileged operation.
> QEMU fundamentally is designed to be useful as an unprivileged user.
> 
> That's the trouble with something like LVM.  Only root can create
> LVM snapshots and it's an all-or-nothing security model.
> 
> If you want to get QEMU out of the snapshot business, you need a
> file system that's widely available that allows non-privileged users
> to take snapshots of individual files.

I don't think we should remove qcow2 internal snapshots or
blockdev_snapshot.  But they have performance limitations where it makes
sense to start using existing storage support instead of reimplementing
efficient and scalable snapshots ourselves.

btrfs is maturing and its BTRFS_IOC_CLONE ioctl is unprivileged.  So we
can offer that option for unprivileged users.

Stefan



Re: [Qemu-devel] [PATCH 2/2] [RFC] time: refactor QEMU timer to use GHRTimer

2011-08-23 Thread Paolo Bonzini

On 08/22/2011 10:28 PM, Jan Kiszka wrote:

  - QEMU_CLOCK_VIRTUAL: Without -icount, same as above, but stops when
the guest is stopped. The offset to compensate for stopped
times is based on TSC, not sure why. With -icount, things get more
complicated, Paolo had some nice explanations for the details.


The TSC is actually out of the picture.  However, it is easy to get 
confused, because the same code handles stopping both the vm_clock and 
the TSC when the guest is paused.  cpu_get_ticks handles the TSC, 
cpu_get_clock handles the clock:


Removing some "uninteresting" details, you have:

/* return the host CPU cycle counter and handle stop/restart */
int64_t cpu_get_ticks(void)
{
if (!vm_running) {
return timers_state.cpu_ticks_offset;
} else {
int64_t ticks;
ticks = cpu_get_real_ticks();
return ticks + timers_state.cpu_ticks_offset;
}
}

/* return the host CPU monotonic timer and handle stop/restart */
static int64_t cpu_get_clock(void)
{
int64_t ti;
if (!vm_running) {
return timers_state.cpu_clock_offset;
} else {
ti = get_clock();
return ti + timers_state.cpu_clock_offset;
}
}

which use the same algorithm but with different base clocks (TSC vs. 
CLOCK_MONOTONIC).


With -icount, things get indeed more complicated.  I'll cover only the 
iothread case since all my attempts at understanding the non-iothread 
case failed. :)  The "-icount N" vm_clock has nanosecond resolution just 
like the normal vm_clock, but those are not real nanoseconds.  While the 
CPU is running, each instruction increments the vm_clock by 2^N 
nanoseconds (yes, this is completely bollocks for SMP. :).  When the CPU 
is not running, instead, the vm_clock follows CLOCK_MONOTONIC; the 
rationale is there in qemu-timer.c.


"-icount auto" is the same, except that we try to keep vm_clock roughly 
the same as CLOCK_MONOTONIC by oscillating the clock frequency between 
"-icount N" values that are more representative of the actual execution 
frequency.


On top of this, the VM has to do two things in icount mode. The first is 
to stop execution at the next vm_clock deadline, which means breaking 
translation blocks after executing the appropriate number of 
instructions; this is quite obvious.  The second is to stop execution of 
a TB after any MMIO instruction, in order to recalculate deadlines if 
necessary.  The latter is responsible for most of the icount black magic 
spread all over the tree.  However, it's not that bad: long term, it 
sounds at least plausible to reuse this machinery to run the CPU threads 
outside the iothread lock (and only take it when doing MMIO, just like 
KVM does).


Paolo



Re: [Qemu-devel] [PATCH 2/2] [RFC] time: refactor QEMU timer to use GHRTimer

2011-08-23 Thread Edgar E. Iglesias
On Tue, Aug 23, 2011 at 10:12:05AM +0200, Paolo Bonzini wrote:
> On 08/22/2011 10:28 PM, Jan Kiszka wrote:
> >   - QEMU_CLOCK_VIRTUAL: Without -icount, same as above, but stops when
> > the guest is stopped. The offset to compensate for stopped
> > times is based on TSC, not sure why. With -icount, things get more
> > complicated, Paolo had some nice explanations for the details.
> 
> The TSC is actually out of the picture.  However, it is easy to get 
> confused, because the same code handles stopping both the vm_clock and 
> the TSC when the guest is paused.  cpu_get_ticks handles the TSC, 
> cpu_get_clock handles the clock:
> 
> Removing some "uninteresting" details, you have:
> 
> /* return the host CPU cycle counter and handle stop/restart */
> int64_t cpu_get_ticks(void)
> {
>  if (!vm_running) {
>  return timers_state.cpu_ticks_offset;
>  } else {
>  int64_t ticks;
>  ticks = cpu_get_real_ticks();
>  return ticks + timers_state.cpu_ticks_offset;
>  }
> }
> 
> /* return the host CPU monotonic timer and handle stop/restart */
> static int64_t cpu_get_clock(void)
> {
>  int64_t ti;
>  if (!vm_running) {
>  return timers_state.cpu_clock_offset;
>  } else {
>  ti = get_clock();
>  return ti + timers_state.cpu_clock_offset;
>  }
> }
> 
> which use the same algorithm but with different base clocks (TSC vs. 
> CLOCK_MONOTONIC).
> 
> With -icount, things get indeed more complicated.  I'll cover only the 
> iothread case since all my attempts at understanding the non-iothread 
> case failed. :)  The "-icount N" vm_clock has nanosecond resolution just 
> like the normal vm_clock, but those are not real nanoseconds.  While the 
> CPU is running, each instruction increments the vm_clock by 2^N 
> nanoseconds (yes, this is completely bollocks for SMP. :).  When the CPU 
> is not running, instead, the vm_clock follows CLOCK_MONOTONIC; the 
> rationale is there in qemu-timer.c.
> 
> "-icount auto" is the same, except that we try to keep vm_clock roughly 
> the same as CLOCK_MONOTONIC by oscillating the clock frequency between 
> "-icount N" values that are more representative of the actual execution 
> frequency.
> 
> On top of this, the VM has to do two things in icount mode. The first is 
> to stop execution at the next vm_clock deadline, which means breaking 
> translation blocks after executing the appropriate number of 
> instructions; this is quite obvious.  The second is to stop execution of 
> a TB after any MMIO instruction, in order to recalculate deadlines if 
> necessary.  The latter is responsible for most of the icount black magic 
> spread all over the tree.  However, it's not that bad: long term, it 
> sounds at least plausible to reuse this machinery to run the CPU threads 
> outside the iothread lock (and only take it when doing MMIO, just like 
> KVM does).

Interesting idea :)

Cheers



Re: [Qemu-devel] [RFC PATCH v4 2/5] ramlist mutex

2011-08-23 Thread Marcelo Tosatti
On Tue, Aug 23, 2011 at 06:15:33AM -0300, Marcelo Tosatti wrote:
> On Tue, Aug 16, 2011 at 11:56:37PM -0400, Umesh Deshpande wrote:
> > ramlist mutex is implemented to protect the RAMBlock list traversal in the
> > migration thread from their addition/removal from the iothread.
> > 
> > Signed-off-by: Umesh Deshpande 
> > ---
> >  cpu-all.h |2 ++
> >  exec.c|   19 +++
> >  qemu-common.h |2 ++
> >  3 files changed, 23 insertions(+), 0 deletions(-)
> > 
> > diff --git a/cpu-all.h b/cpu-all.h
> > index 6b217a2..eab9803 100644
> > --- a/cpu-all.h
> > +++ b/cpu-all.h
> > @@ -21,6 +21,7 @@
> >  
> >  #include "qemu-common.h"
> >  #include "cpu-common.h"
> > +#include "qemu-thread.h"
> >  
> >  /* some important defines:
> >   *
> > @@ -932,6 +933,7 @@ typedef struct RAMBlock {
> >  } RAMBlock;
> >  
> >  typedef struct RAMList {
> > +QemuMutex mutex;
> >  uint8_t *phys_dirty;
> >  QLIST_HEAD(ram, RAMBlock) blocks;
> >  QLIST_HEAD(, RAMBlock) blocks_mru;
> 
> A comment on what the mutex protects would be good.

And on the lock ordering.




Re: [Qemu-devel] [RFC PATCH v4 2/5] ramlist mutex

2011-08-23 Thread Marcelo Tosatti
On Tue, Aug 16, 2011 at 11:56:37PM -0400, Umesh Deshpande wrote:
> ramlist mutex is implemented to protect the RAMBlock list traversal in the
> migration thread from their addition/removal from the iothread.
> 
> Signed-off-by: Umesh Deshpande 
> ---
>  cpu-all.h |2 ++
>  exec.c|   19 +++
>  qemu-common.h |2 ++
>  3 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/cpu-all.h b/cpu-all.h
> index 6b217a2..eab9803 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -21,6 +21,7 @@
>  
>  #include "qemu-common.h"
>  #include "cpu-common.h"
> +#include "qemu-thread.h"
>  
>  /* some important defines:
>   *
> @@ -932,6 +933,7 @@ typedef struct RAMBlock {
>  } RAMBlock;
>  
>  typedef struct RAMList {
> +QemuMutex mutex;
>  uint8_t *phys_dirty;
>  QLIST_HEAD(ram, RAMBlock) blocks;
>  QLIST_HEAD(, RAMBlock) blocks_mru;

A comment on what the mutex protects would be good.



Re: [Qemu-devel] [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-08-23 Thread Marcelo Tosatti
On Mon, Aug 22, 2011 at 12:18:49PM +0300, Avi Kivity wrote:
> On 08/17/2011 07:19 AM, Liu, Jinsong wrote:
> > From a9670ddff84080c56183e2d678189e100f891174 Mon Sep 17 00:00:00 2001
> >From: Liu, Jinsong
> >Date: Wed, 17 Aug 2011 11:36:28 +0800
> >Subject: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
> 
> kvm doesn't have hvm.
> 
> >This patch emulate lapic tsc deadline timer for hvm:
> >Enumerate tsc deadline timer capacibility by CPUID;
> >Enable tsc deadline timer mode by LAPIC MMIO;
> >Start tsc deadline timer by MSR;
> 
> >diff --git a/arch/x86/include/asm/cpufeature.h 
> >b/arch/x86/include/asm/cpufeature.h
> >index 4258aac..28bcf48 100644
> >--- a/arch/x86/include/asm/cpufeature.h
> >+++ b/arch/x86/include/asm/cpufeature.h
> >@@ -120,6 +120,7 @@
> >  #define X86_FEATURE_X2APIC (4*32+21) /* x2APIC */
> >  #define X86_FEATURE_MOVBE  (4*32+22) /* MOVBE instruction */
> >  #define X86_FEATURE_POPCNT  (4*32+23) /* POPCNT instruction */
> >+#define X86_FEATURE_TSC_DEADLINE_TIMER(4*32+24) /* Tsc deadline timer */
> >  #define X86_FEATURE_AES(4*32+25) /* AES instructions */
> >  #define X86_FEATURE_XSAVE  (4*32+26) /* XSAVE/XRSTOR/XSETBV/XGETBV */
> >  #define X86_FEATURE_OSXSAVE(4*32+27) /* "" XSAVE enabled in the OS 
> > */
> >diff --git a/arch/x86/include/asm/kvm_host.h 
> >b/arch/x86/include/asm/kvm_host.h
> >index 307e3cf..28f7128 100644
> >--- a/arch/x86/include/asm/kvm_host.h
> >+++ b/arch/x86/include/asm/kvm_host.h
> >@@ -635,6 +635,7 @@ struct kvm_x86_ops {
> > int (*check_intercept)(struct kvm_vcpu *vcpu,
> >struct x86_instruction_info *info,
> >enum x86_intercept_stage stage);
> >+u64 (*guest_to_host_tsc)(u64 guest_tsc);
> >  };
> 
> Please put this near the other tsc functions.  Add a comment
> explaining what value is returned under nested virtualization.
> 
> Please add the svm callback implementation.
> 
> >
> >--- a/arch/x86/include/asm/msr-index.h
> >+++ b/arch/x86/include/asm/msr-index.h
> >@@ -229,6 +229,8 @@
> >  #define MSR_IA32_APICBASE_ENABLE   (1<<11)
> >  #define MSR_IA32_APICBASE_BASE (0xf<<12)
> >
> >+#define MSR_IA32_TSCDEADLINE0x06e0
> >+
> >  #define MSR_IA32_UCODE_WRITE   0x0079
> >  #define MSR_IA32_UCODE_REV 0x008b
> 
> Need to add to msrs_to_save so live migration works.

MSR must be explicitly listed in qemu, also.

> >+if (!apic->lapic_timer.tscdeadline)
> >+return;
> >+
> >+tsc_target = kvm_x86_ops->
> >+guest_to_host_tsc(apic->lapic_timer.tscdeadline);
> >+rdtscll(tsc_now);
> >+tsc_delta = tsc_target - tsc_now;
> 
> This only works if we have a constant tsc, that's not true for large
> multiboard machines.  Need to do this with irqs disabled as well
> (reading both 'now' and 'tsc_now' in the same critical section).

Should look like this:

local_irq_disable();
u64 guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
if (guest_tsc <= tscdeadline)
hrtimer_start(now);
else {
ns = convert_to_ns(guest_tsc - tscdeadline);
hrtimer_start(now + ns);
}
local_irq_enable();

Note the vcpus tsc can have different frequency than the hosts, so
vcpu_tsc_khz() should be used to convert to nanoseconds, not tsc_khz.

> >+if (tsc_delta<  0)
> >+tsc_delta = 0;
> >+
> >+nsec = tsc_delta * 100L / tsc_khz;
> >+hrtimer_start(&apic->lapic_timer.timer,
> >+ktime_add_ns(now, nsec), HRTIMER_MODE_ABS);
> >+}
> >  }
> >
> >@@ -883,6 +936,28 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu)
> >   *--
> >   */
> >
> >+u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu)
> >+{
> >+struct kvm_lapic *apic = vcpu->arch.apic;
> >+
> >+if (apic_lvtt_oneshot(apic) || apic_lvtt_period(apic))
> >+return 0;
> 
> Why?

The hardware reset value of the IA32_TSC_DEADLINE MSR is 0. In other
timer modes (LVT bit 18 = 0), the IA32_TSC_DEADLINE MSR reads zero and
writes are ignored.

> 
> >+
> >+return apic->lapic_timer.tscdeadline;
> >+}
> >+
> >+void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
> >+{
> >+struct kvm_lapic *apic = vcpu->arch.apic;
> >+
> >+if (apic_lvtt_oneshot(apic) || apic_lvtt_period(apic))
> >+return;
> >+
> >+hrtimer_cancel(&apic->lapic_timer.timer);
> >+apic->lapic_timer.tscdeadline = data;
> >+start_apic_timer(apic);
> 
> Shouldn't the msr value be updated even if we're outside tsc-deadline mode?



Re: [Qemu-devel] [PULL] slirp: Fix issues with -mms-bitfields

2011-08-23 Thread TeLeMan
On Sun, Aug 21, 2011 at 04:00, Stefan Weil  wrote:
> Am 15.08.2011 08:39, schrieb Jan Kiszka:
>>
>> The following changes since commit
>> 3b6ffe50300f13240e1b46420ad05da1116df410:
>>
>> hw/scsi-bus.c: Fix use of uninitialised variable (2011-08-14 19:34:25
>> +)
>>
>> are available in the git repository at:
>> git://git.kiszka.org/qemu.git queues/slirp
>>
>> Jan Kiszka (1):
>> slirp: Fix bit field types in IP header structs
>>
>> slirp/ip.h | 8 
>> slirp/tcp.h | 4 ++--
>> 2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> ---
>>
>> slirp: Fix bit field types in IP header structs
>>
>> -mms-bitfields prevents that the bitfields in current IP header structs
>> are packed into a single byte as it is required. Fix this by using
>> uint8_t as backing type.
>>
>> Signed-off-by: Jan Kiszka 
>> ---
>> slirp/ip.h | 8 
>> slirp/tcp.h | 4 ++--
>> 2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/slirp/ip.h b/slirp/ip.h
>> index 48ea38e..72dbe9a 100644
>> --- a/slirp/ip.h
>> +++ b/slirp/ip.h
>> @@ -74,10 +74,10 @@ typedef uint32_t n_long; /* long as received from the
>> net */
>> */
>> struct ip {
>> #ifdef HOST_WORDS_BIGENDIAN
>> - u_int ip_v:4, /* version */
>> + uint8_t ip_v:4, /* version */
>> ip_hl:4; /* header length */
>> #else
>> - u_int ip_hl:4, /* header length */
>> + uint8_t ip_hl:4, /* header length */
>> ip_v:4; /* version */
>> #endif
>> uint8_t ip_tos; /* type of service */
>> @@ -140,10 +140,10 @@ struct ip_timestamp {
>> uint8_t ipt_len; /* size of structure (variable) */
>> uint8_t ipt_ptr; /* index of current entry */
>> #ifdef HOST_WORDS_BIGENDIAN
>> - u_int ipt_oflw:4, /* overflow counter */
>> + uint8_t ipt_oflw:4, /* overflow counter */
>> ipt_flg:4; /* flags, see below */
>> #else
>> - u_int ipt_flg:4, /* flags, see below */
>> + uint8_t ipt_flg:4, /* flags, see below */
>> ipt_oflw:4; /* overflow counter */
>> #endif
>> union ipt_timestamp {
>> diff --git a/slirp/tcp.h b/slirp/tcp.h
>> index 9d06836..b3817cb 100644
>> --- a/slirp/tcp.h
>> +++ b/slirp/tcp.h
>> @@ -51,10 +51,10 @@ struct tcphdr {
>> tcp_seq th_seq; /* sequence number */
>> tcp_seq th_ack; /* acknowledgement number */
>> #ifdef HOST_WORDS_BIGENDIAN
>> - u_int th_off:4, /* data offset */
>> + uint8_t th_off:4, /* data offset */
>> th_x2:4; /* (unused) */
>> #else
>> - u_int th_x2:4, /* (unused) */
>> + uint8_t th_x2:4, /* (unused) */
>> th_off:4; /* data offset */
>> #endif
>> uint8_t th_flags;
>
> Tested-by: Stefan Weil 
>
slirp is still broken on my mingw32. I used "#progma
pack(push,1)/#progma pack(pop)" to resolve this issue.

> There remain some issues with other packed structs
> which are obviously no longer packed with -mms-bitfields,
> for example those from bt.h, but slirp looks good with
> this patch.
>
> I used pahole (thanks Blue Swirl for this hint) and codiff
> to investigate and compare the structs with and without
> -mms-bitfields:
>
> * create qemu.exe (or any binary which you want)
> * convert it to elf format using objcopy -Oelf32-i386
>  (or the cross variant)
> * apply pahole -a to the resulting elf file
>  (without -a some structs are missing)
>
> Cheers,
> Stefan W.
>
>
>



Re: [Qemu-devel] [PATCH v2] posix-aio-compat: fix latency issues

2011-08-23 Thread Stefan Hajnoczi
On Mon, Aug 22, 2011 at 6:29 PM, Jan Kiszka  wrote:
> On 2011-08-14 06:04, Avi Kivity wrote:
>> In certain circumstances, posix-aio-compat can incur a lot of latency:
>>  - threads are created by vcpu threads, so if vcpu affinity is set,
>>    aio threads inherit vcpu affinity.  This can cause many aio threads
>>    to compete for one cpu.
>>  - we can create up to max_threads (64) aio threads in one go; since a
>>    pthread_create can take around 30μs, we have up to 2ms of cpu time
>>    under a global lock.
>>
>> Fix by:
>>  - moving thread creation to the main thread, so we inherit the main
>>    thread's affinity instead of the vcpu thread's affinity.
>>  - if a thread is currently being created, and we need to create yet
>>    another thread, let thread being born create the new thread, reducing
>>    the amount of time we spend under the main thread.
>>  - drop the local lock while creating a thread (we may still hold the
>>    global mutex, though)
>>
>> Note this doesn't eliminate latency completely; scheduler artifacts or
>> lack of host cpu resources can still cause it.  We may want pre-allocated
>> threads when this cannot be tolerated.
>>
>> Thanks to Uli Obergfell of Red Hat for his excellent analysis and 
>> suggestions.
>
> At this chance: What is the state of getting rid of the remaining delta
> between upstream's version and qemu-kvm?

That would be nice.  qemu-kvm.git uses a signalfd to handle I/O
completion whereas qemu.git uses a signal, writes to a pipe from the
signal handler, and uses qemu_notify_event() to break the vcpu.  Once
the force iothread patch is merged we should be able to move to
qemu-kvm.git's signalfd approach.

Stefan



[Qemu-devel] Execute QEMU "the same"

2011-08-23 Thread Lyu Mitnick
Hello all,

I am wondering to know whether QEMU could run "the same" twice. ie. the same
of execution trace.
I read the user manual and find options like:  "-clock dynticks -rtc
base=2006-06-17T16:01:21,clock=vm -icount 1".
Would QEMU run "the same" with these options??

PS. I also found the function cpu_io_recompile() with comments:
  /* in deterministic execution mode, instructions doing device I/Os
 must be at the end of the TB */
  But I don't know the means of deterministic execution mode and how to
turn on it.

thanks a lot

Mitnick


Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Joerg Roedel
On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

That is makes uiommu basically the same as the meta-groups, right?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Joerg Roedel
On Tue, Aug 23, 2011 at 02:54:43AM -0400, Benjamin Herrenschmidt wrote:
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.
> 
> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

Hmm, I don't think that these groups are static for the systems
run-time. They only exist for the lifetime of a guest per default, at
least on x86. Thats why I prefer to do this grouping using VFIO and not
some sysfs interface (which would be the third interface beside the
ioctls and netlink a VFIO user needs to be aware of). Doing this in the
ioctl interface just makes things easier.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




Re: [Qemu-devel] Block layer roadmap on wiki

2011-08-23 Thread Kevin Wolf
Am 22.08.2011 23:01, schrieb Anthony Liguori:
> On 08/22/2011 03:48 PM, Ryan Harper wrote:
>> * Stefan Hajnoczi  [2011-08-22 15:32]:
>>> We wouldn't rm -rf block/* because we still need qemu-nbd.  It
>>> probably makes sense to keep what we have today.  I'm talking more
>>> about a shift from writing our own image format to integrating
>>> existing storage support.
>>
>> I think this is a key point.  While I do like the idea of keeping QEMU
>> focused on single VM, I think we don't help ourselves by not consuming
>> the hypervisor platform services and integrating/exploiting those
>> features to make using QEMU easier.
> 
> Let's avoid the h-word here as it's not terribly relevant to the discussion.
> 
> Configuring block devices is fundamentally a privileged operation.  QEMU 
> fundamentally is designed to be useful as an unprivileged user.
> 
> That's the trouble with something like LVM.  Only root can create LVM 
> snapshots and it's an all-or-nothing security model.
> 
> If you want to get QEMU out of the snapshot business, you need a file 
> system that's widely available that allows non-privileged users to take 
> snapshots of individual files.

I agree with you there (and it's interesting how different perception of
the BoF results can be ;-))

It's probably true that there are ways to do certain things on host
block devices and we should definitely support such use cases better
(where we means mostly the management layer, but we can possibly
integrate things into qemu like a file-btrfs or lvm_device backend that
supports snapshots or something).

It isn't for everyone, though, and this is why I tried to point out in
the BoF that image formats aren't going to go away and we still need
good support for them. Providing only raw for running VMs and declaring
the rest of the formats to be intended for import/export only doesn't work.

Kevin



Re: [Qemu-devel] [RFC PATCH v4 2/5] ramlist mutex

2011-08-23 Thread Paolo Bonzini

On 08/23/2011 11:17 AM, Marcelo Tosatti wrote:

>  >typedef struct RAMList {
>  >  +QemuMutex mutex;
>  >uint8_t *phys_dirty;
>  >QLIST_HEAD(ram, RAMBlock) blocks;
>  >QLIST_HEAD(, RAMBlock) blocks_mru;

>
>  A comment on what the mutex protects would be good.


Indeed, especially because Umesh wanted to use the ramlist+iothread 
combo as a rw-lock: iothread = read-lock for the I/O thread, ramlist = 
read-lock for the migration thread, together = exclusive (write) lock. 
But I think I talked him out of this. :)  It's not a bad idea in 
general, it just sounds like overkill in this case.



And on the lock ordering.


I think when only two locks are involved, we can always assume iothread 
is outer and the other is inner.  Do you agree?


Paolo



Re: [Qemu-devel] Block layer roadmap on wiki

2011-08-23 Thread Stefan Hajnoczi
On Tue, Aug 23, 2011 at 12:25 PM, Kevin Wolf  wrote:
> Am 22.08.2011 23:01, schrieb Anthony Liguori:
>> On 08/22/2011 03:48 PM, Ryan Harper wrote:
>>> * Stefan Hajnoczi  [2011-08-22 15:32]:
 We wouldn't rm -rf block/* because we still need qemu-nbd.  It
 probably makes sense to keep what we have today.  I'm talking more
 about a shift from writing our own image format to integrating
 existing storage support.
>>>
>>> I think this is a key point.  While I do like the idea of keeping QEMU
>>> focused on single VM, I think we don't help ourselves by not consuming
>>> the hypervisor platform services and integrating/exploiting those
>>> features to make using QEMU easier.
>>
>> Let's avoid the h-word here as it's not terribly relevant to the discussion.
>>
>> Configuring block devices is fundamentally a privileged operation.  QEMU
>> fundamentally is designed to be useful as an unprivileged user.
>>
>> That's the trouble with something like LVM.  Only root can create LVM
>> snapshots and it's an all-or-nothing security model.
>>
>> If you want to get QEMU out of the snapshot business, you need a file
>> system that's widely available that allows non-privileged users to take
>> snapshots of individual files.
>
> I agree with you there (and it's interesting how different perception of
> the BoF results can be ;-))
>
> It's probably true that there are ways to do certain things on host
> block devices and we should definitely support such use cases better
> (where we means mostly the management layer, but we can possibly
> integrate things into qemu like a file-btrfs or lvm_device backend that
> supports snapshots or something).
>
> It isn't for everyone, though, and this is why I tried to point out in
> the BoF that image formats aren't going to go away and we still need
> good support for them. Providing only raw for running VMs and declaring
> the rest of the formats to be intended for import/export only doesn't work.

I have said that block/*.c doesn't go away.  But we need to look at
exploiting storage features rather than reinventing them.

Snapshots are an example: we do not have a scalable snapshot mechanism
in QEMU.  External snapshots are inefficient when you build up
multiple levels (due to having to follow the backing file chain) and
when you delete a snapshot (due to copying data back into the backing
file).  Internal snapshots in qcow2 involve operations that traverse
the image metadata.  This traversal becomes a problem when image files
grow large (e.g. 1 TB and beyond) because the I/O required can take
more than 1 second which is problematic for taking snapshots while the
VM is running.

There are known ways of doing better internal snapshots along the
lines of what ZFS, btrfs, and thin-dev do.  But that means redesigning
the image metadata and reimplementing these storage systems in
userspace.

What I'm suggesting is that we draw the line here.  Keep what we've
got and continue the optimizations that we have in the pipeline.  But
when we hit significant new features, work with existing storage
systems.  Why?  Because we need to support existing storage anyway and
therefore reinventing our own is not a good use of resources.

Stefan



Re: [Qemu-devel] [PATCH] sheepdog: use coroutines

2011-08-23 Thread Kevin Wolf
Am 12.08.2011 14:33, schrieb MORITA Kazutaka:
> This makes the sheepdog block driver support bdrv_co_readv/writev
> instead of bdrv_aio_readv/writev.
> 
> With this patch, Sheepdog network I/O becomes fully asynchronous.  The
> block driver yields back when send/recv returns EAGAIN, and is resumed
> when the sheepdog network connection is ready for the operation.
> 
> Signed-off-by: MORITA Kazutaka 
> ---
>  block/sheepdog.c |  150 +
>  1 files changed, 93 insertions(+), 57 deletions(-)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index e150ac0..e283c34 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -274,7 +274,7 @@ struct SheepdogAIOCB {
>  int ret;
>  enum AIOCBState aiocb_type;
>  
> -QEMUBH *bh;
> +Coroutine *coroutine;
>  void (*aio_done_func)(SheepdogAIOCB *);
>  
>  int canceled;
> @@ -295,6 +295,10 @@ typedef struct BDRVSheepdogState {
>  char *port;
>  int fd;
>  
> +CoMutex lock;
> +Coroutine *co_send;
> +Coroutine *co_recv;
> +
>  uint32_t aioreq_seq_num;
>  QLIST_HEAD(outstanding_aio_head, AIOReq) outstanding_aio_head;
>  } BDRVSheepdogState;
> @@ -346,19 +350,16 @@ static const char * sd_strerror(int err)
>  /*
>   * Sheepdog I/O handling:
>   *
> - * 1. In the sd_aio_readv/writev, read/write requests are added to the
> - *QEMU Bottom Halves.
> - *
> - * 2. In sd_readv_writev_bh_cb, the callbacks of BHs, we send the I/O
> - *requests to the server and link the requests to the
> - *outstanding_list in the BDRVSheepdogState.  we exits the
> - *function without waiting for receiving the response.
> + * 1. In sd_co_rw_vector, we send the I/O requests to the server and
> + *link the requests to the outstanding_list in the
> + *BDRVSheepdogState.  The function exits without waiting for
> + *receiving the response.
>   *
> - * 3. We receive the response in aio_read_response, the fd handler to
> + * 2. We receive the response in aio_read_response, the fd handler to
>   *the sheepdog connection.  If metadata update is needed, we send
>   *the write request to the vdi object in sd_write_done, the write
> - *completion function.  The AIOCB callback is not called until all
> - *the requests belonging to the AIOCB are finished.
> + *completion function.  We switch back to sd_co_readv/writev after
> + *all the requests belonging to the AIOCB are finished.
>   */
>  
>  static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, SheepdogAIOCB *acb,
> @@ -398,7 +399,7 @@ static inline int free_aio_req(BDRVSheepdogState *s, 
> AIOReq *aio_req)
>  static void sd_finish_aiocb(SheepdogAIOCB *acb)
>  {
>  if (!acb->canceled) {
> -acb->common.cb(acb->common.opaque, acb->ret);
> +qemu_coroutine_enter(acb->coroutine, NULL);
>  }
>  qemu_aio_release(acb);
>  }
> @@ -411,7 +412,8 @@ static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
>   * Sheepdog cannot cancel the requests which are already sent to
>   * the servers, so we just complete the request with -EIO here.
>   */
> -acb->common.cb(acb->common.opaque, -EIO);
> +acb->ret = -EIO;
> +qemu_coroutine_enter(acb->coroutine, NULL);
>  acb->canceled = 1;
>  }
>  
> @@ -435,24 +437,12 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState 
> *bs, QEMUIOVector *qiov,
>  
>  acb->aio_done_func = NULL;
>  acb->canceled = 0;
> -acb->bh = NULL;
> +acb->coroutine = qemu_coroutine_self();
>  acb->ret = 0;
>  QLIST_INIT(&acb->aioreq_head);
>  return acb;
>  }
>  
> -static int sd_schedule_bh(QEMUBHFunc *cb, SheepdogAIOCB *acb)
> -{
> -if (acb->bh) {
> -error_report("bug: %d %d", acb->aiocb_type, acb->aiocb_type);
> -return -EIO;
> -}
> -
> -acb->bh = qemu_bh_new(cb, acb);
> -qemu_bh_schedule(acb->bh);
> -return 0;
> -}
> -
>  #ifdef _WIN32
>  
>  struct msghdr {
> @@ -635,7 +625,13 @@ static int do_readv_writev(int sockfd, struct iovec 
> *iov, int len,
>  again:
>  ret = do_send_recv(sockfd, iov, len, iov_offset, write);
>  if (ret < 0) {
> -if (errno == EINTR || errno == EAGAIN) {
> +if (errno == EINTR) {
> +goto again;
> +}
> +if (errno == EAGAIN) {
> +if (qemu_in_coroutine()) {
> +qemu_coroutine_yield();
> +}

Who reenters the coroutine if we yield here?

And of course for a coroutine based block driver I would like to see
much less code in callback functions. But it's a good start anyway.

Kevin



Re: [Qemu-devel] [RFC][PATCH 000/111] QEMU m68k core additions

2011-08-23 Thread Rob Landley
On 08/21/2011 09:15 PM, Natalia Portillo wrote:
> Definitively you don't know how a Mac works, you don't want to know
> and you don't need to.

I do not care about the half of MacOS that Apple burns into ROM which
Linux does not use, true.

 On 08/20/2011 07:23 PM, Natalia Portillo wrote:
> 
> I hate huge emails. Anyway I can't answer a lot of things you said
> because of NDAs so doesn't matter.

Right, so the emulator you suggested not actually emulating any real
hardware devices is either "tl;dr", "I can't tell you because of NDA",
or "you don't understand how macs work".

*shrug*  Have fun with that.

> You can create your own virtual non-existent hardware (it's done
> extensively in this world) and patch Linux to boot of it inside
> qemu. Or you can check for Linux's support for development boards
> (sure there are one or two) and implement it on qemu based on what
> Linux's source says.

I have, in fact, _seen_ Linux run on an m68k mac, although it was many
moons ago.  Obviously it _can_ be done.

(And I could play with coldfire now, but doing nommu in userspace is
fiddly and irritating.  Stuff like cris is higher on my todo list than
that.)

> And FOR GOD'S SAKE CHECK THE ** COMPATIBILITY LIST ON
> LINUX-M68K.
> 
> No Mac m68k suits your needs for Linux NONE NINGUNO AUCUN KEINER
> NESSUNO NENHUM.

Which is of course why Linux has a mac_defconfig under the m68k
directory specifying numerous drivers for actual hardware.

It seems unlikely that the powerpc mac99 and g3beige worked just fine
for me, but before they swapped the CPU out it all ran by magic.

Rob

P.S. For context as to what counts as "intimidating" and "magic", Jeri
Ellsworth reverse engineered the Amiga chips in an FPGA:
http://www.youtube.com/watch?v=5uaDzF99a80



Re: [Qemu-devel] [PATCH 2/2] [RFC] time: refactor QEMU timer to use GHRTimer

2011-08-23 Thread Anthony Liguori

On 08/23/2011 02:43 AM, Paolo Bonzini wrote:

On 08/22/2011 09:21 PM, Anthony Liguori wrote:

- ticks = cpu_get_real_ticks();
- if (timers_state.cpu_ticks_prev > ticks) {
- /* Note: non increasing ticks may happen if the host uses
- software suspend */
- timers_state.cpu_ticks_offset += timers_state.cpu_ticks_prev - ticks;
- }
+ ticks = get_clock();

[...]

-static inline int64_t cpu_get_real_ticks(void)
-{
- int64_t val;
- asm volatile ("rdtsc" : "=A" (val));
- return val;
-}
-


cpu_get_ticks is used also to emulate the guest TSC, are you sure you
want to change that uniformly to a 1 GHz rate?


Where possible, CLOCK_MONOTONIC_RAW should be implemented in terms or 
rdtscp.  It will provide the highest accuracy time source that you can get.


So I don't think there's any different in terms of timer granularity 
from using CLOCK_MONOTONIC_RAW and rdtsc directly other than the former 
is more correct.


Regards,

Anthony Liguori



I had some more cleanups in this area, I'll try to get them tested and
submitted but I have little time for this right now.

Paolo





Re: [Qemu-devel] [PATCH] qed: make qed_alloc_clusters round up offset to nearest cluster

2011-08-23 Thread Kevin Wolf
Am 16.08.2011 01:16, schrieb Devin Nakamura:
> Signed-off-by: Devin Nakamura 
> ---
>  block/qed.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/block/qed.c b/block/qed.c
> index 333f067..9a1e49c 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -263,6 +263,8 @@ static int qed_read_string(BlockDriverState *file, 
> uint64_t offset, size_t n,
>   */
>  static uint64_t qed_alloc_clusters(BDRVQEDState *s, unsigned int n)
>  {
> +s->file_size =  qed_start_of_cluster(s, s->file_size +
> + s->header.cluster_size - 1);
>  uint64_t offset = s->file_size;
>  s->file_size += n * s->header.cluster_size;
>  return offset;

Stefan, Devin, have you come to a conclusion about this patch?

Kevin



Re: [Qemu-devel] [RFC PATCH v4 2/5] ramlist mutex

2011-08-23 Thread Marcelo Tosatti
On Tue, Aug 23, 2011 at 01:41:48PM +0200, Paolo Bonzini wrote:
> On 08/23/2011 11:17 AM, Marcelo Tosatti wrote:
>   >typedef struct RAMList {
>   >  +QemuMutex mutex;
>   >uint8_t *phys_dirty;
>   >QLIST_HEAD(ram, RAMBlock) blocks;
>   >QLIST_HEAD(, RAMBlock) blocks_mru;
> >>>
> >>>  A comment on what the mutex protects would be good.
> 
> Indeed, especially because Umesh wanted to use the ramlist+iothread
> combo as a rw-lock: iothread = read-lock for the I/O thread, ramlist
> = read-lock for the migration thread, together = exclusive (write)
> lock. But I think I talked him out of this. :)  It's not a bad idea
> in general, it just sounds like overkill in this case.
> 
> >And on the lock ordering.
> 
> I think when only two locks are involved, we can always assume
> iothread is outer and the other is inner.  Do you agree?
> 
> Paolo

Yep. 



Re: [Qemu-devel] [PATCH v2] posix-aio-compat: fix latency issues

2011-08-23 Thread Anthony Liguori

On 08/23/2011 06:01 AM, Stefan Hajnoczi wrote:

On Mon, Aug 22, 2011 at 6:29 PM, Jan Kiszka  wrote:

On 2011-08-14 06:04, Avi Kivity wrote:

In certain circumstances, posix-aio-compat can incur a lot of latency:
  - threads are created by vcpu threads, so if vcpu affinity is set,
aio threads inherit vcpu affinity.  This can cause many aio threads
to compete for one cpu.
  - we can create up to max_threads (64) aio threads in one go; since a
pthread_create can take around 30μs, we have up to 2ms of cpu time
under a global lock.

Fix by:
  - moving thread creation to the main thread, so we inherit the main
thread's affinity instead of the vcpu thread's affinity.
  - if a thread is currently being created, and we need to create yet
another thread, let thread being born create the new thread, reducing
the amount of time we spend under the main thread.
  - drop the local lock while creating a thread (we may still hold the
global mutex, though)

Note this doesn't eliminate latency completely; scheduler artifacts or
lack of host cpu resources can still cause it.  We may want pre-allocated
threads when this cannot be tolerated.

Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.


At this chance: What is the state of getting rid of the remaining delta
between upstream's version and qemu-kvm?


That would be nice.  qemu-kvm.git uses a signalfd to handle I/O
completion whereas qemu.git uses a signal, writes to a pipe from the
signal handler, and uses qemu_notify_event() to break the vcpu.  Once
the force iothread patch is merged we should be able to move to
qemu-kvm.git's signalfd approach.


No need to use a signal at all actually.  The use of a signal is 
historic and was required to work around the TCG race that I referred to 
in another thread.


You should be able to just use an eventfd or pipe.

Better yet, we should look at using GThreadPool to replace posix-aio-compat.

Regards,

Anthony Liguori



Stefan






Re: [Qemu-devel] [PATCH 2/2] [RFC] time: refactor QEMU timer to use GHRTimer

2011-08-23 Thread Paolo Bonzini

On 08/23/2011 02:33 PM, Anthony Liguori wrote:

cpu_get_ticks is used also to emulate the guest TSC, are you sure you
want to change that uniformly to a 1 GHz rate?


Where possible, CLOCK_MONOTONIC_RAW should be implemented in terms or
rdtscp.  It will provide the highest accuracy time source that you can get.


Not all machines have rdtscp.  People do not want to see their rdtsc 
suddenly 20 times slower or so (that's more or less the speed difference 
between HPET access and rdtsc).  Many also would likely be surprised to 
see a guest TSC that is 1 GHz.  Xen did the same at some point, and 
backed it out.


And you're not using CLOCK_MONOTONIC_RAW anyway (yet?).  If you used it, 
you should also change the CPUID features since QEMU would be providing 
a constant, invariant TSC to all guests.



So I don't think there's any different in terms of timer granularity
from using CLOCK_MONOTONIC_RAW and rdtsc directly other than the former
is more correct.


CLOCK_MONOTONIC_RAW has its resolution capped to 1 ns.  It is as correct 
and as precise, but less accurate than the TSC if your machine's TSC is 
constant and invariant.


Paolo



[Qemu-devel] [PATCH 4/4] qmp: add query-block-jobs

2011-08-23 Thread Stefan Hajnoczi
This patch introduces the query-block-jobs HMP/QMP command.  It is
currently unimplemented and returns an empty dict.

query-block-jobs


Show progress of ongoing block device operations.

Return a json-array of all block device operations.  If no operation is
active then return an empty array.  Each operation is a json-object with
the following data:

- type: job type ("stream" for image streaming, json-string)
- device:   device name (json-string)
- end:  maximum progress value (json-int)
- position: current progress value (json-int)
- speed:rate limit, bytes per second (json-int)

Progress can be observed as position increases and it reaches end when
the operation completes.  Position and end have undefined units but can
be used to calculate a percentage indicating the progress that has been
made.

Example:

-> { "execute": "query-block-jobs" }
<- { "return":[
  { "type": "stream", "device": "virtio0",
"end": 10737418240, "position": 709632,
"speed": 0 }
   ]
 }

Signed-off-by: Stefan Hajnoczi 
---
 blockdev.c  |   10 ++
 blockdev.h  |2 ++
 monitor.c   |   16 
 qmp-commands.hx |   32 
 4 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 036b7eb..e9098f6 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -835,3 +835,13 @@ int do_block_job_cancel(Monitor *mon, const QDict *params, 
QObject **ret_data)
 qerror_report(QERR_DEVICE_NOT_ACTIVE, device);
 return -1;
 }
+
+void monitor_print_block_jobs(Monitor *mon, const QObject *data)
+{
+monitor_printf(mon, "No active jobs\n");
+}
+
+void do_info_block_jobs(Monitor *mon, QObject **ret_data)
+{
+*ret_data = QOBJECT(qdict_new());
+}
diff --git a/blockdev.h b/blockdev.h
index a132d36..f199d7a 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -69,5 +69,7 @@ int do_block_stream(Monitor *mon, const QDict *qdict, QObject 
**ret_data);
 int do_block_job_set_speed(Monitor *mon, const QDict *qdict,
QObject **ret_data);
 int do_block_job_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data);
+void monitor_print_block_jobs(Monitor *mon, const QObject *data);
+void do_info_block_jobs(Monitor *mon, QObject **ret_data);
 
 #endif
diff --git a/monitor.c b/monitor.c
index dc55fca..e832d10 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2907,6 +2907,14 @@ static const mon_cmd_t info_cmds[] = {
 .mhandler.info_new = bdrv_info_stats,
 },
 {
+.name   = "block-jobs",
+.args_type  = "",
+.params = "",
+.help   = "show progress of ongoing block device operations",
+.user_print = monitor_print_block_jobs,
+.mhandler.info_new = do_info_block_jobs,
+},
+{
 .name   = "registers",
 .args_type  = "",
 .params = "",
@@ -3206,6 +3214,14 @@ static const mon_cmd_t qmp_query_cmds[] = {
 .mhandler.info_new = bdrv_info_stats,
 },
 {
+.name   = "block-jobs",
+.args_type  = "",
+.params = "",
+.help   = "show progress of ongoing block device operations",
+.user_print = monitor_print_block_jobs,
+.mhandler.info_new = do_info_block_jobs,
+},
+{
 .name   = "cpus",
 .args_type  = "",
 .params = "",
diff --git a/qmp-commands.hx b/qmp-commands.hx
index de442f7..ffac014 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2006,3 +2006,35 @@ Example:
 
 EQMP
 
+SQMP
+
+query-block-jobs
+
+
+Show progress of ongoing block device operations.
+
+Return a json-array of all block device operations.  If no operation is active
+then return an empty array.  Each operation is a json-object with the following
+data:
+
+- type: job type ("stream" for image streaming, json-string)
+- device:   device name (json-string)
+- end:  maximum progress value (json-int)
+- position: current progress value (json-int)
+- speed:rate limit, bytes per second (json-int)
+
+Progress can be observed as position increases and it reaches end when the
+operation completes.  Position and end have undefined units but can be used to
+calculate a percentage indicating the progress that has been made.
+
+Example:
+
+-> { "execute": "query-block-jobs" }
+<- { "return":[
+  { "type": "stream", "device": "virtio0",
+"end": 10737418240, "position": 709632,
+"speed": 0 }
+   ]
+ }
+
+EQMP
-- 
1.7.5.4




[Qemu-devel] [PATCH 3/4] qmp: add block_job_cancel command

2011-08-23 Thread Stefan Hajnoczi
This patch introduces the block_job_cancel HMP/QMP command.  It is
currently unimplemented and returns the 'DeviceNotActive' error.

block_job_cancel


Stop an active block streaming operation.

This command returns once the active block streaming operation has been
stopped.  It is an error to call this command if no operation is in
progress.

The image file retains its backing file unless the streaming operation
happens to complete just as it is being cancelled.

A new block streaming operation can be started at a later time to finish
copying all data from the backing file.

Arguments:

- device: device name (json-string)

Errors:

DeviceNotActive: streaming is not active on this device
DeviceInUse: cancellation already in progress

Examples:

-> { "execute": "block_job_cancel", "arguments": { "device": "virtio0" }
}
<- { "return":  {} }

Signed-off-by: Stefan Hajnoczi 
---
 blockdev.c  |8 
 blockdev.h  |1 +
 hmp-commands.hx |   15 +++
 qmp-commands.hx |   40 
 4 files changed, 64 insertions(+), 0 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index ee40263..036b7eb 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -827,3 +827,11 @@ int do_block_job_set_speed(Monitor *mon, const QDict 
*params,
 qerror_report(QERR_NOT_SUPPORTED);
 return -1;
 }
+
+int do_block_job_cancel(Monitor *mon, const QDict *params, QObject **ret_data)
+{
+const char *device = qdict_get_str(params, "device");
+
+qerror_report(QERR_DEVICE_NOT_ACTIVE, device);
+return -1;
+}
diff --git a/blockdev.h b/blockdev.h
index 6b48405..a132d36 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -68,5 +68,6 @@ int do_block_resize(Monitor *mon, const QDict *qdict, QObject 
**ret_data);
 int do_block_stream(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_job_set_speed(Monitor *mon, const QDict *qdict,
QObject **ret_data);
+int do_block_job_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data);
 
 #endif
diff --git a/hmp-commands.hx b/hmp-commands.hx
index a5fb4d5..ff1a54b 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -101,6 +101,21 @@ Set maximum speed for a background block operation.
 ETEXI
 
 {
+.name   = "block_job_cancel",
+.args_type  = "device:B",
+.params = "device",
+.help   = "stop an active block streaming operation",
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_block_job_cancel,
+},
+
+STEXI
+@item block_job_cancel
+@findex block_job_cancel
+Stop an active block streaming operation.
+ETEXI
+
+{
 .name   = "eject",
 .args_type  = "force:-f,device:B",
 .params = "[-f] device",
diff --git a/qmp-commands.hx b/qmp-commands.hx
index eface05..de442f7 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -794,6 +794,46 @@ Example:
 EQMP
 
 {
+.name   = "block_job_cancel",
+.args_type  = "device:B",
+.params = "device",
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_block_job_cancel,
+},
+
+SQMP
+
+block_job_cancel
+
+
+Stop an active block streaming operation.
+
+This command returns once the active block streaming operation has been
+stopped.  It is an error to call this command if no operation is in progress.
+
+The image file retains its backing file unless the streaming operation happens
+to complete just as it is being cancelled.
+
+A new block streaming operation can be started at a later time to finish
+copying all data from the backing file.
+
+Arguments:
+
+- device: device name (json-string)
+
+Errors:
+
+DeviceNotActive: streaming is not active on this device
+DeviceInUse: cancellation already in progress
+
+Examples:
+
+-> { "execute": "block_job_cancel", "arguments": { "device": "virtio0" } }
+<- { "return":  {} }
+
+EQMP
+
+{
 .name   = "blockdev-snapshot-sync",
 .args_type  = "device:B,snapshot-file:s?,format:s?",
 .params = "device [new-image-file] [format]",
-- 
1.7.5.4




[Qemu-devel] [PATCH 0/4] Image Streaming API

2011-08-23 Thread Stefan Hajnoczi
These patches put in place the image streaming QMP/HMP commands and
documentation.  Image streaming itself is not implemented by this patch series
but the HMP/QMP commands that libvirt uses are implemented to return
NotSupported.

The Image Streaming API can be used to copy the contents of a backing file into
the image file while the guest is running.  The API is described on the wiki:
http://wiki.qemu.org/Features/LiveBlockMigration/ImageStreamingAPI

The point of this series is to commit QEMU to the API that we have worked out
with libvirt.  The QED Image Streaming series that I posted earlier provides an
implementation for the QED image format only.  I am currently working on a
generic block layer implementation so that any format with backing file support
can do image streaming.

For reference, the QED-specific implementation lives here:
http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command

Stefan Hajnoczi (4):
  qmp: add block_stream command
  qmp: add block_job_set_speed command
  qmp: add block_job_cancel command
  qmp: add query-block-jobs

 blockdev.c  |   55 ++
 blockdev.h  |6 ++
 hmp-commands.hx |   44 ++
 monitor.c   |   19 ++
 monitor.h   |1 +
 qerror.h|3 +
 qmp-commands.hx |  172 +++
 7 files changed, 300 insertions(+), 0 deletions(-)

-- 
1.7.5.4




[Qemu-devel] [PATCH 2/4] qmp: add block_job_set_speed command

2011-08-23 Thread Stefan Hajnoczi
This patch introduces the block_job_set_speed HMP/QMP command.  It is
currently unimplemented and returns the 'NotSupported' error.

block_job_set_speed
---

Set maximum speed for a background block operation.

This command can only be issued when there is an active block job.

Throttling can be disabled by setting the speed to 0.

Arguments:

- device: device name (json-string)
- value:  maximum speed, in bytes per second (json-int)

Errors:
NotSupported:job type does not support speed setting
DeviceNotActive: streaming is not active on this device

Example:

-> { "execute": "block_job_set_speed",
"arguments": { "device": "virtio0", "value": 1024 } }

Signed-off-by: Stefan Hajnoczi 
---
 blockdev.c  |   19 +++
 blockdev.h  |2 ++
 hmp-commands.hx |   15 +++
 qmp-commands.hx |   35 +++
 4 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 208bfc9..ee40263 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -808,3 +808,22 @@ int do_block_stream(Monitor *mon, const QDict *params, 
QObject **ret_data)
 qerror_report(QERR_NOT_SUPPORTED);
 return -1;
 }
+
+int do_block_job_set_speed(Monitor *mon, const QDict *params,
+   QObject **ret_data)
+{
+const char *device = qdict_get_str(params, "device");
+BlockDriverState *bs;
+
+bs = bdrv_find(device);
+if (!bs) {
+qerror_report(QERR_DEVICE_NOT_ACTIVE, device);
+return -1;
+}
+
+/* This command is not yet implemented.  The device not found check above
+ * is done so that error ordering will not change when fully implemented.
+ */
+qerror_report(QERR_NOT_SUPPORTED);
+return -1;
+}
diff --git a/blockdev.h b/blockdev.h
index ad98d37..6b48405 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -66,5 +66,7 @@ int do_drive_del(Monitor *mon, const QDict *qdict, QObject 
**ret_data);
 int do_snapshot_blkdev(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_resize(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_stream(Monitor *mon, const QDict *qdict, QObject **ret_data);
+int do_block_job_set_speed(Monitor *mon, const QDict *qdict,
+   QObject **ret_data);
 
 #endif
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 2a16fd9..a5fb4d5 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -86,6 +86,21 @@ Copy data from a backing file into a block device.
 ETEXI
 
 {
+.name   = "block_job_set_speed",
+.args_type  = "device:B,value:o",
+.params = "device value",
+.help   = "set maximum speed for a background block operation",
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_block_job_set_speed,
+},
+
+STEXI
+@item block_job_set_stream
+@findex block_job_set_stream
+Set maximum speed for a background block operation.
+ETEXI
+
+{
 .name   = "eject",
 .args_type  = "force:-f,device:B",
 .params = "[-f] device",
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 60c9bdf..eface05 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -759,6 +759,41 @@ Examples:
 EQMP
 
 {
+.name   = "block_job_set_speed",
+.args_type  = "device:B,value:o",
+.params = "device value",
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_block_job_set_speed,
+},
+
+SQMP
+
+block_job_set_speed
+---
+
+Set maximum speed for a background block operation.
+
+This command can only be issued when there is an active block job.
+
+Throttling can be disabled by setting the speed to 0.
+
+Arguments:
+
+- device: device name (json-string)
+- value:  maximum speed, in bytes per second (json-int)
+
+Errors:
+NotSupported:job type does not support speed setting
+DeviceNotActive: streaming is not active on this device
+
+Example:
+
+-> { "execute": "block_job_set_speed",
+"arguments": { "device": "virtio0", "value": 1024 } }
+
+EQMP
+
+{
 .name   = "blockdev-snapshot-sync",
 .args_type  = "device:B,snapshot-file:s?,format:s?",
 .params = "device [new-image-file] [format]",
-- 
1.7.5.4




[Qemu-devel] [PATCH 1/4] qmp: add block_stream command

2011-08-23 Thread Stefan Hajnoczi
This patch introduces the block_stream HMP/QMP command.  It is currently
unimplemented and returns the 'NotSupported' error.

block_stream


Copy data from a backing file into a block device.

The block streaming operation is performed in the background until the
entire backing file has been copied.  This command returns immediately
once streaming has started.  The status of ongoing block streaming
operations can be checked with query-block-jobs.  The operation can be
stopped before it has completed using the block_job_cancel command.

If a base file is specified then sectors are not copied from that base
file and its backing chain.  When streaming completes the image file
will have the base file as its backing file.  This can be used to stream
a subset of the backing file chain instead of flattening the entire
image.

On successful completion the image file is updated to drop the backing
file.

Arguments:

- device: device name (json-string)
- base:   common backing file (json-string, optional)

Errors:

DeviceInUse:streaming is already active on this device
DeviceNotFound: device name is invalid
NotSupported:   image streaming is not supported by this device

Events:

On completion the BLOCK_JOB_COMPLETED event is raised with the following
fields:

- type: job type ("stream" for image streaming, json-string)
- device:   device name (json-string)
- end:  maximum progress value (json-int)
- position: current progress value (json-int)
- speed:rate limit, bytes per second (json-int)
- error:error message (json-string, only on error)

The completion event is raised both on success and on failure.  On
success position is equal to end.  On failure position and end can be
used to indicate at which point the operation failed.

On failure the error field contains a human-readable error message.
There are no semantics other than that streaming has failed and clients
should not try to interpret the error string.

Examples:

-> { "execute": "block_stream", "arguments": { "device": "virtio0" } }
<- { "return":  {} }

Signed-off-by: Stefan Hajnoczi 
---
 blockdev.c  |   18 +++
 blockdev.h  |1 +
 hmp-commands.hx |   14 +++
 monitor.c   |3 ++
 monitor.h   |1 +
 qerror.h|3 ++
 qmp-commands.hx |   65 +++
 7 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index d272659..208bfc9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -790,3 +790,21 @@ int do_block_resize(Monitor *mon, const QDict *qdict, 
QObject **ret_data)
 
 return 0;
 }
+
+int do_block_stream(Monitor *mon, const QDict *params, QObject **ret_data)
+{
+const char *device = qdict_get_str(params, "device");
+BlockDriverState *bs;
+
+bs = bdrv_find(device);
+if (!bs) {
+qerror_report(QERR_DEVICE_NOT_FOUND, device);
+return -1;
+}
+
+/* This command is not yet implemented.  The device not found check above
+ * is done so that error ordering will not change when fully implemented.
+ */
+qerror_report(QERR_NOT_SUPPORTED);
+return -1;
+}
diff --git a/blockdev.h b/blockdev.h
index 3587786..ad98d37 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -65,5 +65,6 @@ int do_change_block(Monitor *mon, const char *device,
 int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_snapshot_blkdev(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_resize(Monitor *mon, const QDict *qdict, QObject **ret_data);
+int do_block_stream(Monitor *mon, const QDict *qdict, QObject **ret_data);
 
 #endif
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 0ccfb28..2a16fd9 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -70,6 +70,20 @@ but should be used with extreme caution.  Note that this 
command only
 resizes image files, it can not resize block devices like LVM volumes.
 ETEXI
 
+{
+.name   = "block_stream",
+.args_type  = "device:B,base:s?",
+.params = "device [base]",
+.help   = "copy data from a backing file into a block device",
+.user_print = monitor_user_noop,
+.mhandler.cmd_new = do_block_stream,
+},
+
+STEXI
+@item block_stream
+@findex block_stream
+Copy data from a backing file into a block device.
+ETEXI
 
 {
 .name   = "eject",
diff --git a/monitor.c b/monitor.c
index ada51d0..dc55fca 100644
--- a/monitor.c
+++ b/monitor.c
@@ -468,6 +468,9 @@ void monitor_protocol_event(MonitorEvent event, QObject 
*data)
 case QEVENT_SPICE_DISCONNECTED:
 event_name = "SPICE_DISCONNECTED";
 break;
+case QEVENT_BLOCK_JOB_COMPLETED:
+event_name = "BLOCK_JOB_COMPLETED";
+break;
 default:
 abort();
 break;
diff --git a/monitor.h b/monitor.h
index 4f2d328..135c927 100644
--- a/monitor.h
+++ b/monitor.h
@@ -35,6 +35,7 @@ typedef enum MonitorEvent {

Re: [Qemu-devel] [PATCH v2] posix-aio-compat: fix latency issues

2011-08-23 Thread Jan Kiszka
On 2011-08-23 14:40, Anthony Liguori wrote:
> On 08/23/2011 06:01 AM, Stefan Hajnoczi wrote:
>> On Mon, Aug 22, 2011 at 6:29 PM, Jan Kiszka  wrote:
>>> On 2011-08-14 06:04, Avi Kivity wrote:
 In certain circumstances, posix-aio-compat can incur a lot of latency:
   - threads are created by vcpu threads, so if vcpu affinity is set,
 aio threads inherit vcpu affinity.  This can cause many aio threads
 to compete for one cpu.
   - we can create up to max_threads (64) aio threads in one go; since a
 pthread_create can take around 30μs, we have up to 2ms of cpu time
 under a global lock.

 Fix by:
   - moving thread creation to the main thread, so we inherit the main
 thread's affinity instead of the vcpu thread's affinity.
   - if a thread is currently being created, and we need to create yet
 another thread, let thread being born create the new thread, reducing
 the amount of time we spend under the main thread.
   - drop the local lock while creating a thread (we may still hold the
 global mutex, though)

 Note this doesn't eliminate latency completely; scheduler artifacts or
 lack of host cpu resources can still cause it.  We may want pre-allocated
 threads when this cannot be tolerated.

 Thanks to Uli Obergfell of Red Hat for his excellent analysis and 
 suggestions.
>>>
>>> At this chance: What is the state of getting rid of the remaining delta
>>> between upstream's version and qemu-kvm?
>>
>> That would be nice.  qemu-kvm.git uses a signalfd to handle I/O
>> completion whereas qemu.git uses a signal, writes to a pipe from the
>> signal handler, and uses qemu_notify_event() to break the vcpu.  Once
>> the force iothread patch is merged we should be able to move to
>> qemu-kvm.git's signalfd approach.
> 
> No need to use a signal at all actually.  The use of a signal is 
> historic and was required to work around the TCG race that I referred to 
> in another thread.
> 
> You should be able to just use an eventfd or pipe.
> 
> Better yet, we should look at using GThreadPool to replace posix-aio-compat.

When interacting with the thread pool is part of some time-critical path
(easily possible with a real-time Linux guest), general-purpose
implementations like what glib offers are typically out of the game.
They do not provide sufficient customizability, specifically control
over their internal synchronization and allocation policies. That
applies to the other rather primitive glib threading and locking
services as well.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Roedel, Joerg
On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> 
> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> > 
> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> We can treat it as a hard unplug (like a cardbus gone away).
> 
> IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> all ff's from reads (& ignore writes).
> 
> Then send an unplug event via whatever mechanism the platform provides
> (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> what to do on power for hotplug yet).

Hmm, good idea. But as far as I know the hotplug-event needs to be in
the guest _before_ the device is actually unplugged (so that the guest
can unbind its driver first). That somehow brings back the sleep-idea
and the timeout in the .release function.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




[Qemu-devel] [PATCH v3 00/15] qcow/qcow2 cleanups

2011-08-23 Thread Frediano Ziglio
These patches mostly cleanup some AIO code using coroutines.
Mostly they use stack instead of allocated AIO structure.
Feel free to collapse it too short.

Frediano Ziglio (15):
  qcow: allocate QCowAIOCB structure using stack
  qcow: QCowAIOCB field cleanup
  qcow: move some blocks of code to avoid useless variable
initialization
  qcow: embed qcow_aio_read_cb into qcow_co_readv and qcow_aio_write_cb
into qcow_co_writev
  qcow: remove old #undefined code
  qcow2: removed unused fields
  qcow2: removed cur_nr_sectors field in QCowAIOCB
  qcow2: remove l2meta from QCowAIOCB
  qcow2: remove cluster_offset from QCowAIOCB
  qcow2: remove common from QCowAIOCB
  qcow2: reindent and use while before the big jump
  qcow2: removed QCowAIOCB entirely
  qcow2: remove memory leak
  qcow2: small math optimization
  qcow2: small optimization

 block/qcow.c   |  378 ++--
 block/qcow2-refcount.c |   16 +--
 block/qcow2.c  |  412 +++
 3 files changed, 294 insertions(+), 512 deletions(-)




[Qemu-devel] [PATCH v3 01/15] qcow: allocate QCowAIOCB structure using stack

2011-08-23 Thread Frediano Ziglio
instead of calling qemi_aio_get use stack

Signed-off-by: Frediano Ziglio 
---
 block/qcow.c  |   52 
 block/qcow2.c |   38 +++---
 2 files changed, 27 insertions(+), 63 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index e155d3c..b4506b4 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -503,28 +503,12 @@ typedef struct QCowAIOCB {
 BlockDriverAIOCB *hd_aiocb;
 } QCowAIOCB;
 
-static void qcow_aio_cancel(BlockDriverAIOCB *blockacb)
-{
-QCowAIOCB *acb = container_of(blockacb, QCowAIOCB, common);
-if (acb->hd_aiocb)
-bdrv_aio_cancel(acb->hd_aiocb);
-qemu_aio_release(acb);
-}
-
-static AIOPool qcow_aio_pool = {
-.aiocb_size = sizeof(QCowAIOCB),
-.cancel = qcow_aio_cancel,
-};
-
 static QCowAIOCB *qcow_aio_setup(BlockDriverState *bs,
 int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-int is_write)
+int is_write, QCowAIOCB *acb)
 {
-QCowAIOCB *acb;
-
-acb = qemu_aio_get(&qcow_aio_pool, bs, NULL, NULL);
-if (!acb)
-return NULL;
+memset(acb, 0, sizeof(*acb));
+acb->common.bs = bs;
 acb->hd_aiocb = NULL;
 acb->sector_num = sector_num;
 acb->qiov = qiov;
@@ -543,9 +527,8 @@ static QCowAIOCB *qcow_aio_setup(BlockDriverState *bs,
 return acb;
 }
 
-static int qcow_aio_read_cb(void *opaque)
+static int qcow_aio_read_cb(QCowAIOCB *acb)
 {
-QCowAIOCB *acb = opaque;
 BlockDriverState *bs = acb->common.bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster;
@@ -634,29 +617,27 @@ static int qcow_co_readv(BlockDriverState *bs, int64_t 
sector_num,
  int nb_sectors, QEMUIOVector *qiov)
 {
 BDRVQcowState *s = bs->opaque;
-QCowAIOCB *acb;
+QCowAIOCB acb;
 int ret;
 
-acb = qcow_aio_setup(bs, sector_num, qiov, nb_sectors, 0);
+qcow_aio_setup(bs, sector_num, qiov, nb_sectors, 0, &acb);
 
 qemu_co_mutex_lock(&s->lock);
 do {
-ret = qcow_aio_read_cb(acb);
+ret = qcow_aio_read_cb(&acb);
 } while (ret > 0);
 qemu_co_mutex_unlock(&s->lock);
 
-if (acb->qiov->niov > 1) {
-qemu_iovec_from_buffer(acb->qiov, acb->orig_buf, acb->qiov->size);
-qemu_vfree(acb->orig_buf);
+if (acb.qiov->niov > 1) {
+qemu_iovec_from_buffer(acb.qiov, acb.orig_buf, acb.qiov->size);
+qemu_vfree(acb.orig_buf);
 }
-qemu_aio_release(acb);
 
 return ret;
 }
 
-static int qcow_aio_write_cb(void *opaque)
+static int qcow_aio_write_cb(QCowAIOCB *acb)
 {
-QCowAIOCB *acb = opaque;
 BlockDriverState *bs = acb->common.bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster;
@@ -714,23 +695,22 @@ static int qcow_co_writev(BlockDriverState *bs, int64_t 
sector_num,
   int nb_sectors, QEMUIOVector *qiov)
 {
 BDRVQcowState *s = bs->opaque;
-QCowAIOCB *acb;
+QCowAIOCB acb;
 int ret;
 
 s->cluster_cache_offset = -1; /* disable compressed cache */
 
-acb = qcow_aio_setup(bs, sector_num, qiov, nb_sectors, 1);
+qcow_aio_setup(bs, sector_num, qiov, nb_sectors, 1, &acb);
 
 qemu_co_mutex_lock(&s->lock);
 do {
-ret = qcow_aio_write_cb(acb);
+ret = qcow_aio_write_cb(&acb);
 } while (ret > 0);
 qemu_co_mutex_unlock(&s->lock);
 
-if (acb->qiov->niov > 1) {
-qemu_vfree(acb->orig_buf);
+if (acb.qiov->niov > 1) {
+qemu_vfree(acb.orig_buf);
 }
-qemu_aio_release(acb);
 
 return ret;
 }
diff --git a/block/qcow2.c b/block/qcow2.c
index bfff6cd..bb6c75e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -388,17 +388,6 @@ typedef struct QCowAIOCB {
 QLIST_ENTRY(QCowAIOCB) next_depend;
 } QCowAIOCB;
 
-static void qcow2_aio_cancel(BlockDriverAIOCB *blockacb)
-{
-QCowAIOCB *acb = container_of(blockacb, QCowAIOCB, common);
-qemu_aio_release(acb);
-}
-
-static AIOPool qcow2_aio_pool = {
-.aiocb_size = sizeof(QCowAIOCB),
-.cancel = qcow2_aio_cancel,
-};
-
 /*
  * Returns 0 when the request is completed successfully, 1 when there is still
  * a part left to do and -errno in error cases.
@@ -528,13 +517,10 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, int64_t sector_num,
   QEMUIOVector *qiov, int nb_sectors,
   BlockDriverCompletionFunc *cb,
-  void *opaque, int is_write)
+  void *opaque, int is_write, QCowAIOCB *acb)
 {
-QCowAIOCB *acb;
-
-acb = qemu_aio_get(&qcow2_aio_pool, bs, cb, opaque);
-if (!acb)
-return NULL;
+memset(acb, 0, sizeof(*acb));
+acb->common.bs = bs;
 acb->sector_num = sector_num;
 acb->qiov = qiov;
 acb->is_write = is_write;
@@ -554,19 +540,18 @@ static int qcow2_co_readv(BlockDriverState *bs, int64_t 
sector_num,
 

[Qemu-devel] [PATCH v3 02/15] qcow: QCowAIOCB field cleanup

2011-08-23 Thread Frediano Ziglio
remove unused field from this structure and put some of them in 
qcow_aio_read_cb and qcow_aio_write_cb

Signed-off-by: Frediano Ziglio 
---
 block/qcow.c |  137 +++--
 1 files changed, 65 insertions(+), 72 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index b4506b4..9754ca9 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -487,72 +487,61 @@ static int qcow_read(BlockDriverState *bs, int64_t 
sector_num,
 #endif
 
 typedef struct QCowAIOCB {
-BlockDriverAIOCB common;
+BlockDriverState *bs;
 int64_t sector_num;
 QEMUIOVector *qiov;
 uint8_t *buf;
 void *orig_buf;
 int nb_sectors;
-int n;
-uint64_t cluster_offset;
-uint8_t *cluster_data;
-struct iovec hd_iov;
-bool is_write;
-QEMUBH *bh;
-QEMUIOVector hd_qiov;
-BlockDriverAIOCB *hd_aiocb;
 } QCowAIOCB;
 
 static QCowAIOCB *qcow_aio_setup(BlockDriverState *bs,
 int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
 int is_write, QCowAIOCB *acb)
 {
-memset(acb, 0, sizeof(*acb));
-acb->common.bs = bs;
-acb->hd_aiocb = NULL;
+acb->bs = bs;
 acb->sector_num = sector_num;
 acb->qiov = qiov;
-acb->is_write = is_write;
 
 if (qiov->niov > 1) {
 acb->buf = acb->orig_buf = qemu_blockalign(bs, qiov->size);
 if (is_write)
 qemu_iovec_to_buffer(qiov, acb->buf);
 } else {
+acb->orig_buf = NULL;
 acb->buf = (uint8_t *)qiov->iov->iov_base;
 }
 acb->nb_sectors = nb_sectors;
-acb->n = 0;
-acb->cluster_offset = 0;
 return acb;
 }
 
 static int qcow_aio_read_cb(QCowAIOCB *acb)
 {
-BlockDriverState *bs = acb->common.bs;
+BlockDriverState *bs = acb->bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster;
-int ret;
-
-acb->hd_aiocb = NULL;
+int ret, n = 0;
+uint64_t cluster_offset = 0;
+struct iovec hd_iov;
+QEMUIOVector hd_qiov;
 
  redo:
 /* post process the read buffer */
-if (!acb->cluster_offset) {
+if (!cluster_offset) {
 /* nothing to do */
-} else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
+} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
 /* nothing to do */
 } else {
 if (s->crypt_method) {
 encrypt_sectors(s, acb->sector_num, acb->buf, acb->buf,
-acb->n, 0,
+n, 0,
 &s->aes_decrypt_key);
 }
 }
 
-acb->nb_sectors -= acb->n;
-acb->sector_num += acb->n;
-acb->buf += acb->n * 512;
+acb->nb_sectors -= n;
+acb->sector_num += n;
+acb->buf += n * 512;
 
 if (acb->nb_sectors == 0) {
 /* request completed */
@@ -560,57 +549,58 @@ static int qcow_aio_read_cb(QCowAIOCB *acb)
 }
 
 /* prepare next AIO request */
-acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9,
+cluster_offset = get_cluster_offset(bs, acb->sector_num << 9,
  0, 0, 0, 0);
 index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
-acb->n = s->cluster_sectors - index_in_cluster;
-if (acb->n > acb->nb_sectors)
-acb->n = acb->nb_sectors;
+n = s->cluster_sectors - index_in_cluster;
+if (n > acb->nb_sectors) {
+n = acb->nb_sectors;
+}
 
-if (!acb->cluster_offset) {
+if (!cluster_offset) {
 if (bs->backing_hd) {
 /* read from the base image */
-acb->hd_iov.iov_base = (void *)acb->buf;
-acb->hd_iov.iov_len = acb->n * 512;
-qemu_iovec_init_external(&acb->hd_qiov, &acb->hd_iov, 1);
+hd_iov.iov_base = (void *)acb->buf;
+hd_iov.iov_len = n * 512;
+qemu_iovec_init_external(&hd_qiov, &hd_iov, 1);
 qemu_co_mutex_unlock(&s->lock);
 ret = bdrv_co_readv(bs->backing_hd, acb->sector_num,
-acb->n, &acb->hd_qiov);
+n, &hd_qiov);
 qemu_co_mutex_lock(&s->lock);
 if (ret < 0) {
 return -EIO;
 }
 } else {
 /* Note: in this case, no need to wait */
-memset(acb->buf, 0, 512 * acb->n);
+memset(acb->buf, 0, 512 * n);
 goto redo;
 }
-} else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
+} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
 /* add AIO support for compressed blocks ? */
-if (decompress_cluster(bs, acb->cluster_offset) < 0) {
+if (decompress_cluster(bs, cluster_offset) < 0) {
 return -EIO;
 }
 memcpy(acb->buf,
-   s->cluster_cache + index_in_cluster * 512, 512 * acb->n);
+   s->cluster_cache + index_in_cluster * 512, 512 * n);
 goto redo;
 } else {
-if ((acb->cluster_offset & 511) != 0) {
+if ((cluster_offset & 511) != 0) {

[Qemu-devel] [PATCH v3 03/15] qcow: move some blocks of code to avoid useless variable initialization

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow.c |   53 ++---
 1 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index 9754ca9..4ede7f3 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -520,35 +520,18 @@ static int qcow_aio_read_cb(QCowAIOCB *acb)
 BlockDriverState *bs = acb->bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster;
-int ret, n = 0;
-uint64_t cluster_offset = 0;
+int ret, n;
+uint64_t cluster_offset;
 struct iovec hd_iov;
 QEMUIOVector hd_qiov;
 
  redo:
-/* post process the read buffer */
-if (!cluster_offset) {
-/* nothing to do */
-} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
-/* nothing to do */
-} else {
-if (s->crypt_method) {
-encrypt_sectors(s, acb->sector_num, acb->buf, acb->buf,
-n, 0,
-&s->aes_decrypt_key);
-}
-}
-
-acb->nb_sectors -= n;
-acb->sector_num += n;
-acb->buf += n * 512;
-
 if (acb->nb_sectors == 0) {
 /* request completed */
 return 0;
 }
 
-/* prepare next AIO request */
+/* prepare next request */
 cluster_offset = get_cluster_offset(bs, acb->sector_num << 9,
  0, 0, 0, 0);
 index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
@@ -573,7 +556,6 @@ static int qcow_aio_read_cb(QCowAIOCB *acb)
 } else {
 /* Note: in this case, no need to wait */
 memset(acb->buf, 0, 512 * n);
-goto redo;
 }
 } else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
 /* add AIO support for compressed blocks ? */
@@ -582,7 +564,6 @@ static int qcow_aio_read_cb(QCowAIOCB *acb)
 }
 memcpy(acb->buf,
s->cluster_cache + index_in_cluster * 512, 512 * n);
-goto redo;
 } else {
 if ((cluster_offset & 511) != 0) {
 return -EIO;
@@ -600,6 +581,23 @@ static int qcow_aio_read_cb(QCowAIOCB *acb)
 }
 }
 
+/* post process the read buffer */
+if (!cluster_offset) {
+/* nothing to do */
+} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+/* nothing to do */
+} else {
+if (s->crypt_method) {
+encrypt_sectors(s, acb->sector_num, acb->buf, acb->buf,
+n, 0,
+&s->aes_decrypt_key);
+}
+}
+
+acb->nb_sectors -= n;
+acb->sector_num += n;
+acb->buf += n * 512;
+
 goto redo;
 }
 
@@ -631,16 +629,12 @@ static int qcow_aio_write_cb(QCowAIOCB *acb)
 int index_in_cluster;
 uint64_t cluster_offset;
 const uint8_t *src_buf;
-int ret, n = 0;
+int ret, n;
 uint8_t *cluster_data = NULL;
 struct iovec hd_iov;
 QEMUIOVector hd_qiov;
 
 redo:
-acb->nb_sectors -= n;
-acb->sector_num += n;
-acb->buf += n * 512;
-
 if (acb->nb_sectors == 0) {
 /* request completed */
 return 0;
@@ -683,6 +677,11 @@ redo:
 if (ret < 0) {
 return ret;
 }
+
+acb->nb_sectors -= n;
+acb->sector_num += n;
+acb->buf += n * 512;
+
 goto redo;
 }
 
-- 
1.7.1




[Qemu-devel] [PATCH v3 05/15] qcow: remove old #undefined code

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow.c |   63 --
 1 files changed, 0 insertions(+), 63 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index f28c821..8cbabdd 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -190,24 +190,6 @@ static int qcow_set_key(BlockDriverState *bs, const char 
*key)
 return -1;
 if (AES_set_decrypt_key(keybuf, 128, &s->aes_decrypt_key) != 0)
 return -1;
-#if 0
-/* test */
-{
-uint8_t in[16];
-uint8_t out[16];
-uint8_t tmp[16];
-for(i=0;i<16;i++)
-in[i] = i;
-AES_encrypt(in, tmp, &s->aes_encrypt_key);
-AES_decrypt(tmp, out, &s->aes_decrypt_key);
-for(i = 0; i < 16; i++)
-printf(" %02x", tmp[i]);
-printf("\n");
-for(i = 0; i < 16; i++)
-printf(" %02x", out[i]);
-printf("\n");
-}
-#endif
 return 0;
 }
 
@@ -441,51 +423,6 @@ static int decompress_cluster(BlockDriverState *bs, 
uint64_t cluster_offset)
 return 0;
 }
 
-#if 0
-
-static int qcow_read(BlockDriverState *bs, int64_t sector_num,
- uint8_t *buf, int nb_sectors)
-{
-BDRVQcowState *s = bs->opaque;
-int ret, index_in_cluster, n;
-uint64_t cluster_offset;
-
-while (nb_sectors > 0) {
-cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0);
-index_in_cluster = sector_num & (s->cluster_sectors - 1);
-n = s->cluster_sectors - index_in_cluster;
-if (n > nb_sectors)
-n = nb_sectors;
-if (!cluster_offset) {
-if (bs->backing_hd) {
-/* read from the base image */
-ret = bdrv_read(bs->backing_hd, sector_num, buf, n);
-if (ret < 0)
-return -1;
-} else {
-memset(buf, 0, 512 * n);
-}
-} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
-if (decompress_cluster(bs, cluster_offset) < 0)
-return -1;
-memcpy(buf, s->cluster_cache + index_in_cluster * 512, 512 * n);
-} else {
-ret = bdrv_pread(bs->file, cluster_offset + index_in_cluster * 
512, buf, n * 512);
-if (ret != n * 512)
-return -1;
-if (s->crypt_method) {
-encrypt_sectors(s, sector_num, buf, buf, n, 0,
-&s->aes_decrypt_key);
-}
-}
-nb_sectors -= n;
-sector_num += n;
-buf += n * 512;
-}
-return 0;
-}
-#endif
-
 static int qcow_co_readv(BlockDriverState *bs, int64_t sector_num,
  int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.7.1




[Qemu-devel] [PATCH v3 12/15] qcow2: removed QCowAIOCB entirely

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |  207 ++---
 1 files changed, 80 insertions(+), 127 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 287dd99..d34bd72 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -372,83 +372,77 @@ int qcow2_backing_read1(BlockDriverState *bs, 
QEMUIOVector *qiov,
 return n1;
 }
 
-typedef struct QCowAIOCB {
-BlockDriverState *bs;
-int64_t sector_num;
-QEMUIOVector *qiov;
-int remaining_sectors;
-uint64_t bytes_done;
-uint8_t *cluster_data;
-QEMUIOVector hd_qiov;
-} QCowAIOCB;
-
-/*
- * Returns 0 when the request is completed successfully, 1 when there is still
- * a part left to do and -errno in error cases.
- */
-static int qcow2_aio_read_cb(QCowAIOCB *acb)
+static int qcow2_co_readv(BlockDriverState *bs, int64_t sector_num,
+  int remaining_sectors, QEMUIOVector *qiov)
 {
-BlockDriverState *bs = acb->bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster, n1;
 int ret;
 int cur_nr_sectors; /* number of sectors in current iteration */
 uint64_t cluster_offset = 0;
+uint64_t bytes_done = 0;
+QEMUIOVector hd_qiov;
+uint8_t *cluster_data = NULL;
 
-while (acb->remaining_sectors != 0) {
+qemu_iovec_init(&hd_qiov, qiov->niov);
+
+qemu_co_mutex_lock(&s->lock);
+
+while (remaining_sectors != 0) {
 
 /* prepare next request */
-cur_nr_sectors = acb->remaining_sectors;
+cur_nr_sectors = remaining_sectors;
 if (s->crypt_method) {
 cur_nr_sectors = MIN(cur_nr_sectors,
 QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
 }
 
-ret = qcow2_get_cluster_offset(bs, acb->sector_num << 9,
+ret = qcow2_get_cluster_offset(bs, sector_num << 9,
 &cur_nr_sectors, &cluster_offset);
 if (ret < 0) {
-return ret;
+goto fail;
 }
 
-index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
+index_in_cluster = sector_num & (s->cluster_sectors - 1);
 
-qemu_iovec_reset(&acb->hd_qiov);
-qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
+qemu_iovec_reset(&hd_qiov);
+qemu_iovec_copy(&hd_qiov, qiov, bytes_done,
 cur_nr_sectors * 512);
 
 if (!cluster_offset) {
 
 if (bs->backing_hd) {
 /* read from the base image */
-n1 = qcow2_backing_read1(bs->backing_hd, &acb->hd_qiov,
-acb->sector_num, cur_nr_sectors);
+n1 = qcow2_backing_read1(bs->backing_hd, &hd_qiov,
+sector_num, cur_nr_sectors);
 if (n1 > 0) {
 BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
 qemu_co_mutex_unlock(&s->lock);
-ret = bdrv_co_readv(bs->backing_hd, acb->sector_num,
-n1, &acb->hd_qiov);
+ret = bdrv_co_readv(bs->backing_hd, sector_num,
+n1, &hd_qiov);
 qemu_co_mutex_lock(&s->lock);
 if (ret < 0) {
-return ret;
+goto fail;
 }
 }
 } else {
 /* Note: in this case, no need to wait */
-qemu_iovec_memset(&acb->hd_qiov, 0, 512 * cur_nr_sectors);
+qemu_iovec_memset(&hd_qiov, 0, 512 * cur_nr_sectors);
 }
 } else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
 /* add AIO support for compressed blocks ? */
 ret = qcow2_decompress_cluster(bs, cluster_offset);
 if (ret < 0) {
-return ret;
+goto fail;
 }
 
-qemu_iovec_from_buffer(&acb->hd_qiov,
+qemu_iovec_from_buffer(&hd_qiov,
 s->cluster_cache + index_in_cluster * 512,
 512 * cur_nr_sectors);
 } else {
 if ((cluster_offset & 511) != 0) {
-return -EIO;
+ret = -EIO;
+goto fail;
 }
 
 if (s->crypt_method) {
@@ -456,15 +450,15 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
  * For encrypted images, read everything into a temporary
  * contiguous buffer on which the AES functions can work.
  */
-if (!acb->cluster_data) {
-acb->cluster_data =
+if (!cluster_data) {
+cluster_data =
 g_malloc0(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
 }
 
 assert(cur_nr_sectors <=
 QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
-qemu_iovec_reset(&acb->hd_qiov);
-qemu_iovec_add(&acb->hd_qiov, acb->cluster_data,
+ 

[Qemu-devel] [PATCH v3 13/15] qcow2: remove memory leak

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index d34bd72..07529b1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -492,6 +492,7 @@ fail:
 qemu_co_mutex_unlock(&s->lock);
 
 qemu_iovec_destroy(&hd_qiov);
+g_free(cluster_data);
 
 return ret;
 }
@@ -604,6 +605,7 @@ fail:
 qemu_co_mutex_unlock(&s->lock);
 
 qemu_iovec_destroy(&hd_qiov);
+g_free(cluster_data);
 
 return ret;
 }
-- 
1.7.1




[Qemu-devel] [PATCH v3 14/15] qcow2: small math optimization

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2-refcount.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2a915be..0f9a64a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -140,10 +140,7 @@ static unsigned int next_refcount_table_size(BDRVQcowState 
*s,
 static int in_same_refcount_block(BDRVQcowState *s, uint64_t offset_a,
 uint64_t offset_b)
 {
-uint64_t block_a = offset_a >> (2 * s->cluster_bits - REFCOUNT_SHIFT);
-uint64_t block_b = offset_b >> (2 * s->cluster_bits - REFCOUNT_SHIFT);
-
-return (block_a == block_b);
+return ((offset_a ^ offset_b) >> (2 * s->cluster_bits - REFCOUNT_SHIFT)) 
== 0;
 }
 
 /*
-- 
1.7.1




[Qemu-devel] [PATCH v3 15/15] qcow2: small optimization

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2-refcount.c |   11 +--
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 0f9a64a..243c24b 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -681,14 +681,13 @@ void qcow2_create_refcount_update(QCowCreateState *s, 
int64_t offset,
 int64_t size)
 {
 int refcount;
-int64_t start, last, cluster_offset;
+int64_t start, last, cluster;
 uint16_t *p;
 
-start = offset & ~(s->cluster_size - 1);
-last = (offset + size - 1)  & ~(s->cluster_size - 1);
-for(cluster_offset = start; cluster_offset <= last;
-cluster_offset += s->cluster_size) {
-p = &s->refcount_block[cluster_offset >> s->cluster_bits];
+start = offset >> s->cluster_bits;
+last = (offset + size - 1)  >> s->cluster_bits;
+for(cluster = start; cluster <= last; ++cluster) {
+p = &s->refcount_block[cluster];
 refcount = be16_to_cpu(*p);
 refcount++;
 *p = cpu_to_be16(refcount);
-- 
1.7.1




Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Roedel, Joerg
On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:

> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> 
> That's an interesting idea.  Maybe an interface similar to the current
> uiommu interface, where you open() the 2nd group fd and pass the fd via
> ioctl to the primary group.  IOMMUs that don't support this would fail
> the attach device callback, which would fail the ioctl to bind them.  It
> will need to be designed so any group can be removed from the super-set
> and the remaining group(s) still works.  This feels like something that
> can be added after we get an initial implementation.

Handling it through fds is a good idea. This makes sure that everything
belongs to one process. I am not really sure yet if we go the way to
just bind plain groups together or if we create meta-groups. The
meta-groups thing seems somewhat cleaner, though.

> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> The current vfio has the same problem, we can't unbind a device from
> vfio while it's attached to a guest.  I think we'd use the same solution
> too; send out a netlink packet for a device removal and have the .remove
> call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> and SIGBUS the PIDs holding the device if they don't return it
> willingly.  Thanks,

Putting the process to sleep (which would be uninterruptible) seems bad.
The process would sleep until the guest releases the device-group, which
can take days or months.
The best thing (and the most intrusive :-) ) is to change PCI core to
allow unbindings to fail, I think. But this probably further complicates
the way to upstream VFIO...

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




[Qemu-devel] [PATCH v3 04/15] qcow: embed qcow_aio_read_cb into qcow_co_readv and qcow_aio_write_cb into qcow_co_writev

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow.c |  291 -
 1 files changed, 123 insertions(+), 168 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index 4ede7f3..f28c821 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -486,223 +486,178 @@ static int qcow_read(BlockDriverState *bs, int64_t 
sector_num,
 }
 #endif
 
-typedef struct QCowAIOCB {
-BlockDriverState *bs;
-int64_t sector_num;
-QEMUIOVector *qiov;
-uint8_t *buf;
-void *orig_buf;
-int nb_sectors;
-} QCowAIOCB;
-
-static QCowAIOCB *qcow_aio_setup(BlockDriverState *bs,
-int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-int is_write, QCowAIOCB *acb)
-{
-acb->bs = bs;
-acb->sector_num = sector_num;
-acb->qiov = qiov;
-
-if (qiov->niov > 1) {
-acb->buf = acb->orig_buf = qemu_blockalign(bs, qiov->size);
-if (is_write)
-qemu_iovec_to_buffer(qiov, acb->buf);
-} else {
-acb->orig_buf = NULL;
-acb->buf = (uint8_t *)qiov->iov->iov_base;
-}
-acb->nb_sectors = nb_sectors;
-return acb;
-}
-
-static int qcow_aio_read_cb(QCowAIOCB *acb)
+static int qcow_co_readv(BlockDriverState *bs, int64_t sector_num,
+ int nb_sectors, QEMUIOVector *qiov)
 {
-BlockDriverState *bs = acb->bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster;
-int ret, n;
+int ret = 0, n;
 uint64_t cluster_offset;
 struct iovec hd_iov;
 QEMUIOVector hd_qiov;
+uint8_t *buf;
+void *orig_buf;
 
- redo:
-if (acb->nb_sectors == 0) {
-/* request completed */
-return 0;
+if (qiov->niov > 1) {
+buf = orig_buf = qemu_blockalign(bs, qiov->size);
+} else {
+orig_buf = NULL;
+buf = (uint8_t *)qiov->iov->iov_base;
 }
 
-/* prepare next request */
-cluster_offset = get_cluster_offset(bs, acb->sector_num << 9,
- 0, 0, 0, 0);
-index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
-n = s->cluster_sectors - index_in_cluster;
-if (n > acb->nb_sectors) {
-n = acb->nb_sectors;
-}
+qemu_co_mutex_lock(&s->lock);
+
+while (nb_sectors != 0) {
+/* prepare next request */
+cluster_offset = get_cluster_offset(bs, sector_num << 9,
+ 0, 0, 0, 0);
+index_in_cluster = sector_num & (s->cluster_sectors - 1);
+n = s->cluster_sectors - index_in_cluster;
+if (n > nb_sectors) {
+n = nb_sectors;
+}
 
-if (!cluster_offset) {
-if (bs->backing_hd) {
-/* read from the base image */
-hd_iov.iov_base = (void *)acb->buf;
+if (!cluster_offset) {
+if (bs->backing_hd) {
+/* read from the base image */
+hd_iov.iov_base = (void *)buf;
+hd_iov.iov_len = n * 512;
+qemu_iovec_init_external(&hd_qiov, &hd_iov, 1);
+qemu_co_mutex_unlock(&s->lock);
+ret = bdrv_co_readv(bs->backing_hd, sector_num,
+n, &hd_qiov);
+qemu_co_mutex_lock(&s->lock);
+if (ret < 0) {
+goto fail;
+}
+} else {
+/* Note: in this case, no need to wait */
+memset(buf, 0, 512 * n);
+}
+} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+/* add AIO support for compressed blocks ? */
+if (decompress_cluster(bs, cluster_offset) < 0) {
+goto fail;
+}
+memcpy(buf,
+   s->cluster_cache + index_in_cluster * 512, 512 * n);
+} else {
+if ((cluster_offset & 511) != 0) {
+goto fail;
+}
+hd_iov.iov_base = (void *)buf;
 hd_iov.iov_len = n * 512;
 qemu_iovec_init_external(&hd_qiov, &hd_iov, 1);
 qemu_co_mutex_unlock(&s->lock);
-ret = bdrv_co_readv(bs->backing_hd, acb->sector_num,
+ret = bdrv_co_readv(bs->file,
+(cluster_offset >> 9) + index_in_cluster,
 n, &hd_qiov);
 qemu_co_mutex_lock(&s->lock);
 if (ret < 0) {
-return -EIO;
+break;
+}
+if (s->crypt_method) {
+encrypt_sectors(s, sector_num, buf, buf,
+n, 0,
+&s->aes_decrypt_key);
 }
-} else {
-/* Note: in this case, no need to wait */
-memset(acb->buf, 0, 512 * n);
-}
-} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
-/* add AIO support for compressed blocks ? */
-if (decompress_cluster(bs, cluster_offset) < 0) {
-return -EIO;
-

[Qemu-devel] [PATCH v3 09/15] qcow2: remove cluster_offset from QCowAIOCB

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |   22 +++---
 1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4b9ec29..7e13283 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -378,7 +378,6 @@ typedef struct QCowAIOCB {
 QEMUIOVector *qiov;
 int remaining_sectors;
 uint64_t bytes_done;
-uint64_t cluster_offset;
 uint8_t *cluster_data;
 QEMUIOVector hd_qiov;
 } QCowAIOCB;
@@ -394,6 +393,7 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 int index_in_cluster, n1;
 int ret;
 int cur_nr_sectors; /* number of sectors in current iteration */
+uint64_t cluster_offset = 0;
 
 if (acb->remaining_sectors == 0) {
 /* request completed */
@@ -408,7 +408,7 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 }
 
 ret = qcow2_get_cluster_offset(bs, acb->sector_num << 9,
-&cur_nr_sectors, &acb->cluster_offset);
+&cur_nr_sectors, &cluster_offset);
 if (ret < 0) {
 return ret;
 }
@@ -419,7 +419,7 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
 cur_nr_sectors * 512);
 
-if (!acb->cluster_offset) {
+if (!cluster_offset) {
 
 if (bs->backing_hd) {
 /* read from the base image */
@@ -439,9 +439,9 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 /* Note: in this case, no need to wait */
 qemu_iovec_memset(&acb->hd_qiov, 0, 512 * cur_nr_sectors);
 }
-} else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
+} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
 /* add AIO support for compressed blocks ? */
-ret = qcow2_decompress_cluster(bs, acb->cluster_offset);
+ret = qcow2_decompress_cluster(bs, cluster_offset);
 if (ret < 0) {
 return ret;
 }
@@ -450,7 +450,7 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 s->cluster_cache + index_in_cluster * 512,
 512 * cur_nr_sectors);
 } else {
-if ((acb->cluster_offset & 511) != 0) {
+if ((cluster_offset & 511) != 0) {
 return -EIO;
 }
 
@@ -474,7 +474,7 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
 qemu_co_mutex_unlock(&s->lock);
 ret = bdrv_co_readv(bs->file,
-(acb->cluster_offset >> 9) + index_in_cluster,
+(cluster_offset >> 9) + index_in_cluster,
 cur_nr_sectors, &acb->hd_qiov);
 qemu_co_mutex_lock(&s->lock);
 if (ret < 0) {
@@ -512,7 +512,6 @@ static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, 
int64_t sector_num,
 
 acb->bytes_done = 0;
 acb->remaining_sectors = nb_sectors;
-acb->cluster_offset = 0;
 return acb;
 }
 
@@ -564,6 +563,7 @@ static int qcow2_aio_write_cb(QCowAIOCB *acb)
 int ret;
 int cur_nr_sectors; /* number of sectors in current iteration */
 QCowL2Meta l2meta;
+uint64_t cluster_offset;
 
 l2meta.nb_clusters = 0;
 qemu_co_queue_init(&l2meta.dependent_requests);
@@ -585,8 +585,8 @@ static int qcow2_aio_write_cb(QCowAIOCB *acb)
 return ret;
 }
 
-acb->cluster_offset = l2meta.cluster_offset;
-assert((acb->cluster_offset & 511) == 0);
+cluster_offset = l2meta.cluster_offset;
+assert((cluster_offset & 511) == 0);
 
 qemu_iovec_reset(&acb->hd_qiov);
 qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
@@ -612,7 +612,7 @@ static int qcow2_aio_write_cb(QCowAIOCB *acb)
 BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
 qemu_co_mutex_unlock(&s->lock);
 ret = bdrv_co_writev(bs->file,
- (acb->cluster_offset >> 9) + index_in_cluster,
+ (cluster_offset >> 9) + index_in_cluster,
  cur_nr_sectors, &acb->hd_qiov);
 qemu_co_mutex_lock(&s->lock);
 if (ret < 0) {
-- 
1.7.1




[Qemu-devel] [PATCH v3 10/15] qcow2: remove common from QCowAIOCB

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 7e13283..519fc8b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -373,7 +373,7 @@ int qcow2_backing_read1(BlockDriverState *bs, QEMUIOVector 
*qiov,
 }
 
 typedef struct QCowAIOCB {
-BlockDriverAIOCB common;
+BlockDriverState *bs;
 int64_t sector_num;
 QEMUIOVector *qiov;
 int remaining_sectors;
@@ -388,7 +388,7 @@ typedef struct QCowAIOCB {
  */
 static int qcow2_aio_read_cb(QCowAIOCB *acb)
 {
-BlockDriverState *bs = acb->common.bs;
+BlockDriverState *bs = acb->bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster, n1;
 int ret;
@@ -504,7 +504,7 @@ static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, 
int64_t sector_num,
   void *opaque, QCowAIOCB *acb)
 {
 memset(acb, 0, sizeof(*acb));
-acb->common.bs = bs;
+acb->bs = bs;
 acb->sector_num = sector_num;
 acb->qiov = qiov;
 
@@ -556,7 +556,7 @@ static void run_dependent_requests(BDRVQcowState *s, 
QCowL2Meta *m)
  */
 static int qcow2_aio_write_cb(QCowAIOCB *acb)
 {
-BlockDriverState *bs = acb->common.bs;
+BlockDriverState *bs = acb->bs;
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster;
 int n_end;
-- 
1.7.1




Re: [Qemu-devel] [PATCH v2 00/15] qcow/qcow2 cleanups

2011-08-23 Thread Frediano Ziglio
2011/8/22 Kevin Wolf :
> Am 09.08.2011 09:46, schrieb Frediano Ziglio:
>> These patches mostly cleanup some AIO code using coroutines.
>> Mostly they use stack instead of allocated AIO structure.
>> Feel free to collapse it too short.
>>
>> Frediano Ziglio (15):
>>   qcow: allocate QCowAIOCB structure using stack
>>   qcow: QCowAIOCB field cleanup
>>   qcow: move some blocks of code to avoid useless variable
>>     initialization
>>   qcow: embed qcow_aio_read_cb into qcow_co_readv and qcow_aio_write_cb
>>     into qcow_co_writev
>>   qcow: remove old #undefined code
>>   qcow2: removed unused fields
>>   qcow2: removed cur_nr_sectors field in QCowAIOCB
>>   qcow2: remove l2meta from QCowAIOCB
>>   qcow2: remove cluster_offset from QCowAIOCB
>>   qcow2: remove common from QCowAIOCB
>>   qcow2: reindent and use while before the big jump
>>   qcow2: removed QCowAIOCB entirely
>>   qcow2: remove memory leak
>>   qcow2: small math optimization
>>   qcow2: small optimization
>>
>>  block/qcow.c           |  378 ++--
>>  block/qcow2-refcount.c |   16 +--
>>  block/qcow2.c          |  412 
>> +++
>>  3 files changed, 294 insertions(+), 512 deletions(-)
>
> Can you please rebase this series to current master? I expect that most
> conflicts are related to the qemu_malloc -> g_malloc change, so they
> should be easy to fix.
>

Done, yes glib rename was the problem. I rebased all patches and
tested latest with iotests.

> One thing I noticed when reading the first two patches is that there is
> some state in ACBs that previously was shared across multiple requests
> (e.g. the bounce buffer for encryption). It is no longer shared after
> you move the ACB to the stack, so each request allocates a new buffer.
> Have you checked if this makes a noticeable difference in performance
> for encrypted images?
>
> Kevin
>

I didn't test encrypt performance. Looking at master code the only
cached state that get reused is cluster_data so only encryption is
affected.

Frediano



Re: [Qemu-devel] [PATCH v2] posix-aio-compat: fix latency issues

2011-08-23 Thread Anthony Liguori

On 08/23/2011 08:02 AM, Jan Kiszka wrote:

On 2011-08-23 14:40, Anthony Liguori wrote:

You should be able to just use an eventfd or pipe.

Better yet, we should look at using GThreadPool to replace posix-aio-compat.


When interacting with the thread pool is part of some time-critical path
(easily possible with a real-time Linux guest), general-purpose
implementations like what glib offers are typically out of the game.
They do not provide sufficient customizability, specifically control
over their internal synchronization and allocation policies. That
applies to the other rather primitive glib threading and locking
services as well.


We can certainly enhance glib.  glib is a cross platform library.  I 
don't see a compelling reason to invent a new cross platform library 
just for QEMU especially if the justification is future features, not 
current features.


Regards,

Anthony Liguori



Jan






[Qemu-devel] [PATCH v3 06/15] qcow2: removed unused fields

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |   10 +++---
 1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index bb6c75e..4171b47 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -381,11 +381,8 @@ typedef struct QCowAIOCB {
 uint64_t bytes_done;
 uint64_t cluster_offset;
 uint8_t *cluster_data;
-bool is_write;
 QEMUIOVector hd_qiov;
-QEMUBH *bh;
 QCowL2Meta l2meta;
-QLIST_ENTRY(QCowAIOCB) next_depend;
 } QCowAIOCB;
 
 /*
@@ -517,13 +514,12 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, int64_t sector_num,
   QEMUIOVector *qiov, int nb_sectors,
   BlockDriverCompletionFunc *cb,
-  void *opaque, int is_write, QCowAIOCB *acb)
+  void *opaque, QCowAIOCB *acb)
 {
 memset(acb, 0, sizeof(*acb));
 acb->common.bs = bs;
 acb->sector_num = sector_num;
 acb->qiov = qiov;
-acb->is_write = is_write;
 
 qemu_iovec_init(&acb->hd_qiov, qiov->niov);
 
@@ -543,7 +539,7 @@ static int qcow2_co_readv(BlockDriverState *bs, int64_t 
sector_num,
 QCowAIOCB acb;
 int ret;
 
-qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, NULL, NULL, 0, &acb);
+qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, NULL, NULL, &acb);
 
 qemu_co_mutex_lock(&s->lock);
 do {
@@ -658,7 +654,7 @@ static int qcow2_co_writev(BlockDriverState *bs,
 QCowAIOCB acb;
 int ret;
 
-qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, NULL, NULL, 1, &acb);
+qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, NULL, NULL, &acb);
 s->cluster_cache_offset = -1; /* disable compressed cache */
 
 qemu_co_mutex_lock(&s->lock);
-- 
1.7.1




[Qemu-devel] [PATCH v3 07/15] qcow2: removed cur_nr_sectors field in QCowAIOCB

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |   98 +
 1 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4171b47..f920dbe 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -377,7 +377,6 @@ typedef struct QCowAIOCB {
 int64_t sector_num;
 QEMUIOVector *qiov;
 int remaining_sectors;
-int cur_nr_sectors;/* number of sectors in current iteration */
 uint64_t bytes_done;
 uint64_t cluster_offset;
 uint8_t *cluster_data;
@@ -395,42 +394,22 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 BDRVQcowState *s = bs->opaque;
 int index_in_cluster, n1;
 int ret;
-
-/* post process the read buffer */
-if (!acb->cluster_offset) {
-/* nothing to do */
-} else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
-/* nothing to do */
-} else {
-if (s->crypt_method) {
-qcow2_encrypt_sectors(s, acb->sector_num,  acb->cluster_data,
-acb->cluster_data, acb->cur_nr_sectors, 0, 
&s->aes_decrypt_key);
-qemu_iovec_reset(&acb->hd_qiov);
-qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
-acb->cur_nr_sectors * 512);
-qemu_iovec_from_buffer(&acb->hd_qiov, acb->cluster_data,
-512 * acb->cur_nr_sectors);
-}
-}
-
-acb->remaining_sectors -= acb->cur_nr_sectors;
-acb->sector_num += acb->cur_nr_sectors;
-acb->bytes_done += acb->cur_nr_sectors * 512;
+int cur_nr_sectors; /* number of sectors in current iteration */
 
 if (acb->remaining_sectors == 0) {
 /* request completed */
 return 0;
 }
 
-/* prepare next AIO request */
-acb->cur_nr_sectors = acb->remaining_sectors;
+/* prepare next request */
+cur_nr_sectors = acb->remaining_sectors;
 if (s->crypt_method) {
-acb->cur_nr_sectors = MIN(acb->cur_nr_sectors,
+cur_nr_sectors = MIN(cur_nr_sectors,
 QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
 }
 
 ret = qcow2_get_cluster_offset(bs, acb->sector_num << 9,
-&acb->cur_nr_sectors, &acb->cluster_offset);
+&cur_nr_sectors, &acb->cluster_offset);
 if (ret < 0) {
 return ret;
 }
@@ -439,14 +418,14 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 
 qemu_iovec_reset(&acb->hd_qiov);
 qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
-acb->cur_nr_sectors * 512);
+cur_nr_sectors * 512);
 
 if (!acb->cluster_offset) {
 
 if (bs->backing_hd) {
 /* read from the base image */
 n1 = qcow2_backing_read1(bs->backing_hd, &acb->hd_qiov,
-acb->sector_num, acb->cur_nr_sectors);
+acb->sector_num, cur_nr_sectors);
 if (n1 > 0) {
 BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
 qemu_co_mutex_unlock(&s->lock);
@@ -457,11 +436,9 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 return ret;
 }
 }
-return 1;
 } else {
 /* Note: in this case, no need to wait */
-qemu_iovec_memset(&acb->hd_qiov, 0, 512 * acb->cur_nr_sectors);
-return 1;
+qemu_iovec_memset(&acb->hd_qiov, 0, 512 * cur_nr_sectors);
 }
 } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
 /* add AIO support for compressed blocks ? */
@@ -472,9 +449,7 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 
 qemu_iovec_from_buffer(&acb->hd_qiov,
 s->cluster_cache + index_in_cluster * 512,
-512 * acb->cur_nr_sectors);
-
-return 1;
+512 * cur_nr_sectors);
 } else {
 if ((acb->cluster_offset & 511) != 0) {
 return -EIO;
@@ -490,24 +465,37 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 g_malloc0(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
 }
 
-assert(acb->cur_nr_sectors <=
+assert(cur_nr_sectors <=
 QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
 qemu_iovec_reset(&acb->hd_qiov);
 qemu_iovec_add(&acb->hd_qiov, acb->cluster_data,
-512 * acb->cur_nr_sectors);
+512 * cur_nr_sectors);
 }
 
 BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
 qemu_co_mutex_unlock(&s->lock);
 ret = bdrv_co_readv(bs->file,
 (acb->cluster_offset >> 9) + index_in_cluster,
-acb->cur_nr_sectors, &acb->hd_qiov);
+cur_nr_sectors, &acb->hd_qiov);
 qemu_co_mutex_lock(&s->lock);
 if (ret < 0) {
 return ret;
 }
+if (s->crypt_method) {
+qcow2_encrypt_sectors(s, acb->sector_num,  acb->cluster_data,
+acb->cluster_data, cur_nr_sectors, 0, &s->aes_decrypt_key);
+  

[Qemu-devel] [PATCH v3 11/15] qcow2: reindent and use while before the big jump

2011-08-23 Thread Frediano Ziglio
prepare to remove read/write callbacks

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |  272 -
 1 files changed, 135 insertions(+), 137 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 519fc8b..287dd99 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -395,107 +395,105 @@ static int qcow2_aio_read_cb(QCowAIOCB *acb)
 int cur_nr_sectors; /* number of sectors in current iteration */
 uint64_t cluster_offset = 0;
 
-if (acb->remaining_sectors == 0) {
-/* request completed */
-return 0;
-}
-
-/* prepare next request */
-cur_nr_sectors = acb->remaining_sectors;
-if (s->crypt_method) {
-cur_nr_sectors = MIN(cur_nr_sectors,
-QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
-}
+while (acb->remaining_sectors != 0) {
 
-ret = qcow2_get_cluster_offset(bs, acb->sector_num << 9,
-&cur_nr_sectors, &cluster_offset);
-if (ret < 0) {
-return ret;
-}
-
-index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
-
-qemu_iovec_reset(&acb->hd_qiov);
-qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
-cur_nr_sectors * 512);
-
-if (!cluster_offset) {
-
-if (bs->backing_hd) {
-/* read from the base image */
-n1 = qcow2_backing_read1(bs->backing_hd, &acb->hd_qiov,
-acb->sector_num, cur_nr_sectors);
-if (n1 > 0) {
-BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
-qemu_co_mutex_unlock(&s->lock);
-ret = bdrv_co_readv(bs->backing_hd, acb->sector_num,
-n1, &acb->hd_qiov);
-qemu_co_mutex_lock(&s->lock);
-if (ret < 0) {
-return ret;
-}
-}
-} else {
-/* Note: in this case, no need to wait */
-qemu_iovec_memset(&acb->hd_qiov, 0, 512 * cur_nr_sectors);
+/* prepare next request */
+cur_nr_sectors = acb->remaining_sectors;
+if (s->crypt_method) {
+cur_nr_sectors = MIN(cur_nr_sectors,
+QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
 }
-} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
-/* add AIO support for compressed blocks ? */
-ret = qcow2_decompress_cluster(bs, cluster_offset);
+
+ret = qcow2_get_cluster_offset(bs, acb->sector_num << 9,
+&cur_nr_sectors, &cluster_offset);
 if (ret < 0) {
 return ret;
 }
 
-qemu_iovec_from_buffer(&acb->hd_qiov,
-s->cluster_cache + index_in_cluster * 512,
-512 * cur_nr_sectors);
-} else {
-if ((cluster_offset & 511) != 0) {
-return -EIO;
-}
+index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
 
-if (s->crypt_method) {
-/*
- * For encrypted images, read everything into a temporary
- * contiguous buffer on which the AES functions can work.
- */
-if (!acb->cluster_data) {
-acb->cluster_data =
-g_malloc0(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
+qemu_iovec_reset(&acb->hd_qiov);
+qemu_iovec_copy(&acb->hd_qiov, acb->qiov, acb->bytes_done,
+cur_nr_sectors * 512);
+
+if (!cluster_offset) {
+
+if (bs->backing_hd) {
+/* read from the base image */
+n1 = qcow2_backing_read1(bs->backing_hd, &acb->hd_qiov,
+acb->sector_num, cur_nr_sectors);
+if (n1 > 0) {
+BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
+qemu_co_mutex_unlock(&s->lock);
+ret = bdrv_co_readv(bs->backing_hd, acb->sector_num,
+n1, &acb->hd_qiov);
+qemu_co_mutex_lock(&s->lock);
+if (ret < 0) {
+return ret;
+}
+}
+} else {
+/* Note: in this case, no need to wait */
+qemu_iovec_memset(&acb->hd_qiov, 0, 512 * cur_nr_sectors);
+}
+} else if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+/* add AIO support for compressed blocks ? */
+ret = qcow2_decompress_cluster(bs, cluster_offset);
+if (ret < 0) {
+return ret;
 }
 
-assert(cur_nr_sectors <=
-QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors);
-qemu_iovec_reset(&acb->hd_qiov);
-qemu_iovec_add(&acb->hd_qiov, acb->cluster_data,
+qemu_iovec_from_buffer(&acb->hd_qiov,
+s->cluster_cache + index_in_cluster * 512,
 512 * cur_nr_sectors);
-}
+} else {
+if ((cluster_offset & 511) != 0) {
+

[Qemu-devel] [PATCH v3 08/15] qcow2: remove l2meta from QCowAIOCB

2011-08-23 Thread Frediano Ziglio

Signed-off-by: Frediano Ziglio 
---
 block/qcow2.c |   15 ---
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index f920dbe..4b9ec29 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -381,7 +381,6 @@ typedef struct QCowAIOCB {
 uint64_t cluster_offset;
 uint8_t *cluster_data;
 QEMUIOVector hd_qiov;
-QCowL2Meta l2meta;
 } QCowAIOCB;
 
 /*
@@ -514,8 +513,6 @@ static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, 
int64_t sector_num,
 acb->bytes_done = 0;
 acb->remaining_sectors = nb_sectors;
 acb->cluster_offset = 0;
-acb->l2meta.nb_clusters = 0;
-qemu_co_queue_init(&acb->l2meta.dependent_requests);
 return acb;
 }
 
@@ -566,6 +563,10 @@ static int qcow2_aio_write_cb(QCowAIOCB *acb)
 int n_end;
 int ret;
 int cur_nr_sectors; /* number of sectors in current iteration */
+QCowL2Meta l2meta;
+
+l2meta.nb_clusters = 0;
+qemu_co_queue_init(&l2meta.dependent_requests);
 
 if (acb->remaining_sectors == 0) {
 /* request completed */
@@ -579,12 +580,12 @@ static int qcow2_aio_write_cb(QCowAIOCB *acb)
 n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
 
 ret = qcow2_alloc_cluster_offset(bs, acb->sector_num << 9,
-index_in_cluster, n_end, &cur_nr_sectors, &acb->l2meta);
+index_in_cluster, n_end, &cur_nr_sectors, &l2meta);
 if (ret < 0) {
 return ret;
 }
 
-acb->cluster_offset = acb->l2meta.cluster_offset;
+acb->cluster_offset = l2meta.cluster_offset;
 assert((acb->cluster_offset & 511) == 0);
 
 qemu_iovec_reset(&acb->hd_qiov);
@@ -618,9 +619,9 @@ static int qcow2_aio_write_cb(QCowAIOCB *acb)
 return ret;
 }
 
-ret = qcow2_alloc_cluster_link_l2(bs, &acb->l2meta);
+ret = qcow2_alloc_cluster_link_l2(bs, &l2meta);
 
-run_dependent_requests(s, &acb->l2meta);
+run_dependent_requests(s, &l2meta);
 
 if (ret < 0) {
 return ret;
-- 
1.7.1




Re: [Qemu-devel] [PATCH v2] posix-aio-compat: fix latency issues

2011-08-23 Thread Jan Kiszka
On 2011-08-23 16:02, Anthony Liguori wrote:
> On 08/23/2011 08:02 AM, Jan Kiszka wrote:
>> On 2011-08-23 14:40, Anthony Liguori wrote:
>>> You should be able to just use an eventfd or pipe.
>>>
>>> Better yet, we should look at using GThreadPool to replace posix-aio-compat.
>>
>> When interacting with the thread pool is part of some time-critical path
>> (easily possible with a real-time Linux guest), general-purpose
>> implementations like what glib offers are typically out of the game.
>> They do not provide sufficient customizability, specifically control
>> over their internal synchronization and allocation policies. That
>> applies to the other rather primitive glib threading and locking
>> services as well.
> 
> We can certainly enhance glib.  glib is a cross platform library.  I 

Do you want to carry forked glib bits in QEMU?

> don't see a compelling reason to invent a new cross platform library 
> just for QEMU especially if the justification is future features, not 
> current features.

Tweaking affinity of aio threads is already a current requirement.

And we already have a working threading and locking system. One that is
growing beyond glib's level of support quickly (think of RCU).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Corey Bryant

On 08/22/2011 03:25 PM, Anthony Liguori wrote:

On 08/22/2011 01:22 PM, Daniel P. Berrange wrote:

On Mon, Aug 22, 2011 at 12:25:25PM -0500, Anthony Liguori wrote:

On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:

On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:

I don't think it makes sense to have qemu-fe do dynamic labelling.
You certainly could avoid the fd passing by having qemu-fe do the
open though and just let qemu-fe run without the restricted security
context.


qemu-fe would also not be entirely simple,


Indeed.


because it will need to act
as a proxy for the monitor, in order to make hotplug work. ie the mgmt
app would be sending 'drive_add file:/foo/bar' to qemu-fe, which would
then have to open the file and send 'drive_add fd:NN' onto the real
QEMU,
and then pass the results on back.

In addition qemu-fe would still have to be under some kind of
restricted
security context for it to be acceptable. This is going to want to
be as
locked down as possible.


I think there's got to be some give and take here.

It should at least be as locked down as libvirtd. From a security
point of view, we should be able to agree that we want libvirtd to
be as locked down as possible.

But there shouldn't be a hard requirement to lock down qemu-fe more
than libvirtd. Instead, the requirement should be for qemu-fe to be
as/more vigilant in not trusting qemu-system-x86_64 as libvirtd is.

The fundamental problem here, is that there is some logic in
libvirtd that rightly belongs in QEMU. In order to preserve the
security model, that means that we're going to have to take a
subsection of QEMU and trust it more.


Well we have a process that makes security decisions, and a process
which applies those security decisions and a process which is confined
by those decisions. Currently libvirtd makes& applies the decisions,
and qemu is confined. A qemu-fe model would mean that libvirt is making
the decisions, but is then relying on qemu-fe to apply them. IMHO that
split is undesirable, but that's besides the point, since this is not
a decision that needs to be made now.

'qemu-fe' needs to have a way to communicate with the confined process
('qemu-system-XXX') to supply it the resources (file FDs) it needs to
access. The requirements of such a comms channel for qemu-fe are going
to be the same as those needed by libvirtd talking to QEMU today, or
indeed by any process that is applying security decisions to QEMU.


But the fundamental difference is that libvirtd uses what's ostensible a
public, supported interface. That means when we add things like this,
we're stuck supporting it for general use cases.

It's much more palatable to do these things using a private interface
such that we can change these things down the road without worrying
about compatibility with third-party tools.

Regards,

Anthony Liguori



Is this a nack for the fd: protocol?  Or do we want to implement the fd: 
protocol as a stepping stone on the way to a privilege-separated qemu 
model?  I know the fd: protocol is not ideal, but it does provide NFS 
image isolation, perhaps much sooner than privilege-separated qemu can.


--
Regards,
Corey



Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Anthony Liguori

On 08/23/2011 09:26 AM, Corey Bryant wrote:

On 08/22/2011 03:25 PM, Anthony Liguori wrote:

On 08/22/2011 01:22 PM, Daniel P. Berrange wrote:

On Mon, Aug 22, 2011 at 12:25:25PM -0500, Anthony Liguori wrote:

On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:

On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:

I don't think it makes sense to have qemu-fe do dynamic labelling.
You certainly could avoid the fd passing by having qemu-fe do the
open though and just let qemu-fe run without the restricted security
context.


qemu-fe would also not be entirely simple,


Indeed.


because it will need to act
as a proxy for the monitor, in order to make hotplug work. ie the mgmt
app would be sending 'drive_add file:/foo/bar' to qemu-fe, which would
then have to open the file and send 'drive_add fd:NN' onto the real
QEMU,
and then pass the results on back.

In addition qemu-fe would still have to be under some kind of
restricted
security context for it to be acceptable. This is going to want to
be as
locked down as possible.


I think there's got to be some give and take here.

It should at least be as locked down as libvirtd. From a security
point of view, we should be able to agree that we want libvirtd to
be as locked down as possible.

But there shouldn't be a hard requirement to lock down qemu-fe more
than libvirtd. Instead, the requirement should be for qemu-fe to be
as/more vigilant in not trusting qemu-system-x86_64 as libvirtd is.

The fundamental problem here, is that there is some logic in
libvirtd that rightly belongs in QEMU. In order to preserve the
security model, that means that we're going to have to take a
subsection of QEMU and trust it more.


Well we have a process that makes security decisions, and a process
which applies those security decisions and a process which is confined
by those decisions. Currently libvirtd makes& applies the decisions,
and qemu is confined. A qemu-fe model would mean that libvirt is making
the decisions, but is then relying on qemu-fe to apply them. IMHO that
split is undesirable, but that's besides the point, since this is not
a decision that needs to be made now.

'qemu-fe' needs to have a way to communicate with the confined process
('qemu-system-XXX') to supply it the resources (file FDs) it needs to
access. The requirements of such a comms channel for qemu-fe are going
to be the same as those needed by libvirtd talking to QEMU today, or
indeed by any process that is applying security decisions to QEMU.


But the fundamental difference is that libvirtd uses what's ostensible a
public, supported interface. That means when we add things like this,
we're stuck supporting it for general use cases.

It's much more palatable to do these things using a private interface
such that we can change these things down the road without worrying
about compatibility with third-party tools.

Regards,

Anthony Liguori



Is this a nack for the fd: protocol?


No, I think we're trying to understand what the options are.

Regards,

Anthony Liguori

 Or do we want to implement the fd:

protocol as a stepping stone on the way to a privilege-separated qemu
model? I know the fd: protocol is not ideal, but it does provide NFS
image isolation, perhaps much sooner than privilege-separated qemu can.






Re: [Qemu-devel] Grand Unified Tracing

2011-08-23 Thread Blue Swirl
On Tue, Aug 23, 2011 at 5:35 AM, Stefan Hajnoczi  wrote:
> On Mon, Aug 22, 2011 at 9:59 PM, Blue Swirl  wrote:
>> On Mon, Aug 22, 2011 at 8:23 PM, Stefan Hajnoczi  wrote:
>>> On Mon, Aug 22, 2011 at 7:14 PM, Blue Swirl  wrote:
 OpenBIOS uses traditional DPRINTFs for debugging. I was considering
 replacing those with tracepoints, which would output to serial device
 or whatever DPRINTFs are using currently. This would not be extremely
 useful by itself, except maybe that configuration for debugging would
 be concentrated to single 'trace-events' file, but this would not be a
 major improvement over current XML configuration.

 But developing this further, maybe OpenBIOS could also pass the
 tracepoint data back to QEMU? Then all tracepoint data would be
 synchronized, coherent and all gathered to single place.

 The implementation could be that fw_cfg would be used to pass
 simpletrace style data. An offset should be added to event IDs and
 data would then be output as usual. On OpenBIOS side, the
 implementation would be pretty similar to current QEMU tracepoints but
 in place of file output there would be fw_cfg output.

 Syntax for trace-events file should be augmented with include
 directive, so that QEMU knows also OpenBIOS tracepoints. I think the
 only change to simpletrace.py would be to parse this directive.

 Controlling OpenBIOS tracepoints from QEMU monitor would be cool too.

 Going even further, other targets like kernels could use something
 similar, probably not using fw_cfg though.

 What do you think?
>>>
>>> Dhaval showed me a demo of unified host/guest Linux tracing last week.
>>>  He is doing something similar except using a hypercall to pass a
>>> string to the host kernel.  In his case kvm.ko handles the hypercall
>>> and qemu is not involved.
>>
>> So the events flow directly from guest kernel to host kernel?
>
> Yes.

How is that used then, with dmesg?

>>> One issue with QEMU tracing is that trace call sites are static.  You
>>> need to compile in a trace_*() call, which means that there are two
>>> choices for how to tunnel OpenBIOS trace events:
>>>
>>> 1. Define a tunnel trace event:
>>> openbios_event(uint64_t event_id, uint64_t arg1, uint64_t arg2, ...)
>>>
>>> QEMU only has one trace event to tunnel OpenBIOS trace events.  Then
>>> the host is unable to pretty-print OpenBIOS traces automatically and
>>> the max arguments becomes 6 - 1 (for the openbios_event tunnel event
>>> id).
>>>
>>> 2. Generate a switch statement to demultiplex trace events:
>>> void hypercall(uint64_t event_id, uint64_t arg1, ...)
>>> {
>>>    /* This is auto-generated by tracetool */
>>>    switch (event_id) {
>>>    case TRACE_EVENT_OPENBIOS_FOO:
>>>        trace_openbios_foo(arg1, arg2, arg3);
>>>        break;
>>>    case TRACE_EVENT_OPENBIOS_BAR:
>>>        trace_openbios_bar(arg1);
>>>        break;
>>>    ...
>>>    }
>>> }
>>>
>>> With this approach the user can toggle trace events at runtime and it
>>> works out much nicer.
>>
>> Maybe I'm missing something, but why would we have to multiplex
>> anything? Since the trace IDs are 64 bits, we can easily allocate a
>> static range (starting from 0x8000) to non-native events
>> so that QEMU IDs and OpenBIOS IDs do not overlap. QEMU would add this
>> offset to IDs coming from fw_cfg. Then QEMU would just pass the data
>> (8*64 bit words) to tracing back end. There obviously something needs
>> to be changed so that OpenBIOS messages can be generated. For
>> simpletrace this should be easy, for stderr and other back ends that
>> may be more complicated.
>
> I meant that the interface generated by tracetool is a header file
> with trace_*() functions.  If you want to support any tracing backend
> then you need to use this interface and cannot call simpletrace.c
> trace0(), trace1(), ... directly.
>
> Are you thinking of only supporting the simple trace backend?

That would be simplest, then fw_cfg.c (or some intermediate entity)
could just call trace6().

But stderr would be useful too. I think that could be done by adding
code to read trace-events file to grab the format string. Then a
modified vfprintf would parse it but instead of va_arg(args, type) it
would just grab the uint64_t arg. Your approach would be much simpler.



Re: [Qemu-devel] Grand Unified Tracing

2011-08-23 Thread Stefan Hajnoczi
On Tue, Aug 23, 2011 at 3:34 PM, Blue Swirl  wrote:
> On Tue, Aug 23, 2011 at 5:35 AM, Stefan Hajnoczi  wrote:
>> On Mon, Aug 22, 2011 at 9:59 PM, Blue Swirl  wrote:
>>> On Mon, Aug 22, 2011 at 8:23 PM, Stefan Hajnoczi  wrote:
 On Mon, Aug 22, 2011 at 7:14 PM, Blue Swirl  wrote:
> OpenBIOS uses traditional DPRINTFs for debugging. I was considering
> replacing those with tracepoints, which would output to serial device
> or whatever DPRINTFs are using currently. This would not be extremely
> useful by itself, except maybe that configuration for debugging would
> be concentrated to single 'trace-events' file, but this would not be a
> major improvement over current XML configuration.
>
> But developing this further, maybe OpenBIOS could also pass the
> tracepoint data back to QEMU? Then all tracepoint data would be
> synchronized, coherent and all gathered to single place.
>
> The implementation could be that fw_cfg would be used to pass
> simpletrace style data. An offset should be added to event IDs and
> data would then be output as usual. On OpenBIOS side, the
> implementation would be pretty similar to current QEMU tracepoints but
> in place of file output there would be fw_cfg output.
>
> Syntax for trace-events file should be augmented with include
> directive, so that QEMU knows also OpenBIOS tracepoints. I think the
> only change to simpletrace.py would be to parse this directive.
>
> Controlling OpenBIOS tracepoints from QEMU monitor would be cool too.
>
> Going even further, other targets like kernels could use something
> similar, probably not using fw_cfg though.
>
> What do you think?

 Dhaval showed me a demo of unified host/guest Linux tracing last week.
  He is doing something similar except using a hypercall to pass a
 string to the host kernel.  In his case kvm.ko handles the hypercall
 and qemu is not involved.
>>>
>>> So the events flow directly from guest kernel to host kernel?
>>
>> Yes.
>
> How is that used then, with dmesg?

When the host processes the trace hypercall it simply forwards the
trace event on to host Linux's trace.  It currently logs the string
from the guest and prefixes it with the guest program counter, I
believe.  You can dump out the trace using the usual Linux ftrace
debugfs files.

But the longer term goal, I believe, is to really integrate guest and
host tracing so that host userspace (e.g. perf(1)) can pull out trace
events from the guest and just sees them tagged as belonging to a
particular qemu-kvm process.

>
 One issue with QEMU tracing is that trace call sites are static.  You
 need to compile in a trace_*() call, which means that there are two
 choices for how to tunnel OpenBIOS trace events:

 1. Define a tunnel trace event:
 openbios_event(uint64_t event_id, uint64_t arg1, uint64_t arg2, ...)

 QEMU only has one trace event to tunnel OpenBIOS trace events.  Then
 the host is unable to pretty-print OpenBIOS traces automatically and
 the max arguments becomes 6 - 1 (for the openbios_event tunnel event
 id).

 2. Generate a switch statement to demultiplex trace events:
 void hypercall(uint64_t event_id, uint64_t arg1, ...)
 {
    /* This is auto-generated by tracetool */
    switch (event_id) {
    case TRACE_EVENT_OPENBIOS_FOO:
        trace_openbios_foo(arg1, arg2, arg3);
        break;
    case TRACE_EVENT_OPENBIOS_BAR:
        trace_openbios_bar(arg1);
        break;
    ...
    }
 }

 With this approach the user can toggle trace events at runtime and it
 works out much nicer.
>>>
>>> Maybe I'm missing something, but why would we have to multiplex
>>> anything? Since the trace IDs are 64 bits, we can easily allocate a
>>> static range (starting from 0x8000) to non-native events
>>> so that QEMU IDs and OpenBIOS IDs do not overlap. QEMU would add this
>>> offset to IDs coming from fw_cfg. Then QEMU would just pass the data
>>> (8*64 bit words) to tracing back end. There obviously something needs
>>> to be changed so that OpenBIOS messages can be generated. For
>>> simpletrace this should be easy, for stderr and other back ends that
>>> may be more complicated.
>>
>> I meant that the interface generated by tracetool is a header file
>> with trace_*() functions.  If you want to support any tracing backend
>> then you need to use this interface and cannot call simpletrace.c
>> trace0(), trace1(), ... directly.
>>
>> Are you thinking of only supporting the simple trace backend?
>
> That would be simplest, then fw_cfg.c (or some intermediate entity)
> could just call trace6().
>
> But stderr would be useful too. I think that could be done by adding
> code to read trace-events file to grab the format string. Then a
> modified vfprintf would parse it but instead of va_arg(args, type) it
> would just grab 

[Qemu-devel] MPC5200 + BestComm support in QEMU

2011-08-23 Thread steve . belanger
Hi, 

I'm Steve, an embedded software developper for Bombardier Transportation 
Canada. We use the MPC5200 for most of our onboard computers inside train 
control systems. To enhance our SW engineering process, we would like the 
emulate the MPC5200 processor using QEMU, an open source software CPU 
emulator. This software supports the MPC5200 CPU emulation. 

However, the network interface is handled with the BestComm DMA engine and 
it seems very difficult to simulate this co-processor with our current 
knowledge level. In that sense, I would like to know if someone was able 
to emulate correctly the MPC5200 with the BestComm DMA using QEMU 
software? 

Steve Bélanger, ing. / Eng.
TCMS - Développeur logiciel / Embedded Software Developper
Bombardier Transport- St-Bruno
450-441-2020 ext.6148

 
  
Please consider the environment before you print / Merci de penser à 
l'environnement avant d'imprimer 

___
 

This e-mail communication (and any attachment/s) may contain confidential 
or privileged information and is intended only for the individual(s) or 
entity named above and to others who have been specifically authorized to 
receive it. If you are not the intended recipient, please do not read, 
copy, use or disclose the contents of this communication to others. Please 
notify the sender that you have received this e-mail in error by reply 
e-mail, and delete the e-mail subsequently. Please note that in order to 
protect the security of our information systems an AntiSPAM solution is in 
use and will browse through incoming emails. 
Thank you. 
_
 


Ce message (ainsi que le(s) fichier(s)), transmis par courriel, peut 
contenir des renseignements confidentiels ou protégés et est destiné à 
l?usage exclusif du destinataire ci-dessus. Toute autre personne est, par 
les présentes, avisée qu?il est strictement interdit de le diffuser, le 
distribuer ou le reproduire. Si vous l?avez reçu par inadvertance, 
veuillez nous en aviser et détruire ce message. Veuillez prendre note 
qu'une solution antipollupostage (AntiSPAM) est utilisée afin d'assurer la 
sécurité de nos systèmes d'information et qu'elle furètera les courriels 
entrants.
Merci. 
_
 




Re: [Qemu-devel] Grand Unified Tracing

2011-08-23 Thread Blue Swirl
On Tue, Aug 23, 2011 at 2:45 PM, Stefan Hajnoczi  wrote:
> On Tue, Aug 23, 2011 at 3:34 PM, Blue Swirl  wrote:
>> On Tue, Aug 23, 2011 at 5:35 AM, Stefan Hajnoczi  wrote:
>>> On Mon, Aug 22, 2011 at 9:59 PM, Blue Swirl  wrote:
 On Mon, Aug 22, 2011 at 8:23 PM, Stefan Hajnoczi  
 wrote:
> On Mon, Aug 22, 2011 at 7:14 PM, Blue Swirl  wrote:
>> OpenBIOS uses traditional DPRINTFs for debugging. I was considering
>> replacing those with tracepoints, which would output to serial device
>> or whatever DPRINTFs are using currently. This would not be extremely
>> useful by itself, except maybe that configuration for debugging would
>> be concentrated to single 'trace-events' file, but this would not be a
>> major improvement over current XML configuration.
>>
>> But developing this further, maybe OpenBIOS could also pass the
>> tracepoint data back to QEMU? Then all tracepoint data would be
>> synchronized, coherent and all gathered to single place.
>>
>> The implementation could be that fw_cfg would be used to pass
>> simpletrace style data. An offset should be added to event IDs and
>> data would then be output as usual. On OpenBIOS side, the
>> implementation would be pretty similar to current QEMU tracepoints but
>> in place of file output there would be fw_cfg output.
>>
>> Syntax for trace-events file should be augmented with include
>> directive, so that QEMU knows also OpenBIOS tracepoints. I think the
>> only change to simpletrace.py would be to parse this directive.
>>
>> Controlling OpenBIOS tracepoints from QEMU monitor would be cool too.
>>
>> Going even further, other targets like kernels could use something
>> similar, probably not using fw_cfg though.
>>
>> What do you think?
>
> Dhaval showed me a demo of unified host/guest Linux tracing last week.
>  He is doing something similar except using a hypercall to pass a
> string to the host kernel.  In his case kvm.ko handles the hypercall
> and qemu is not involved.

 So the events flow directly from guest kernel to host kernel?
>>>
>>> Yes.
>>
>> How is that used then, with dmesg?
>
> When the host processes the trace hypercall it simply forwards the
> trace event on to host Linux's trace.  It currently logs the string
> from the guest and prefixes it with the guest program counter, I
> believe.  You can dump out the trace using the usual Linux ftrace
> debugfs files.
>
> But the longer term goal, I believe, is to really integrate guest and
> host tracing so that host userspace (e.g. perf(1)) can pull out trace
> events from the guest and just sees them tagged as belonging to a
> particular qemu-kvm process.
>
>>
> One issue with QEMU tracing is that trace call sites are static.  You
> need to compile in a trace_*() call, which means that there are two
> choices for how to tunnel OpenBIOS trace events:
>
> 1. Define a tunnel trace event:
> openbios_event(uint64_t event_id, uint64_t arg1, uint64_t arg2, ...)
>
> QEMU only has one trace event to tunnel OpenBIOS trace events.  Then
> the host is unable to pretty-print OpenBIOS traces automatically and
> the max arguments becomes 6 - 1 (for the openbios_event tunnel event
> id).
>
> 2. Generate a switch statement to demultiplex trace events:
> void hypercall(uint64_t event_id, uint64_t arg1, ...)
> {
>    /* This is auto-generated by tracetool */
>    switch (event_id) {
>    case TRACE_EVENT_OPENBIOS_FOO:
>        trace_openbios_foo(arg1, arg2, arg3);
>        break;
>    case TRACE_EVENT_OPENBIOS_BAR:
>        trace_openbios_bar(arg1);
>        break;
>    ...
>    }
> }
>
> With this approach the user can toggle trace events at runtime and it
> works out much nicer.

 Maybe I'm missing something, but why would we have to multiplex
 anything? Since the trace IDs are 64 bits, we can easily allocate a
 static range (starting from 0x8000) to non-native events
 so that QEMU IDs and OpenBIOS IDs do not overlap. QEMU would add this
 offset to IDs coming from fw_cfg. Then QEMU would just pass the data
 (8*64 bit words) to tracing back end. There obviously something needs
 to be changed so that OpenBIOS messages can be generated. For
 simpletrace this should be easy, for stderr and other back ends that
 may be more complicated.
>>>
>>> I meant that the interface generated by tracetool is a header file
>>> with trace_*() functions.  If you want to support any tracing backend
>>> then you need to use this interface and cannot call simpletrace.c
>>> trace0(), trace1(), ... directly.
>>>
>>> Are you thinking of only supporting the simple trace backend?
>>
>> That would be simplest, then fw_cfg.c (or some intermediate entity)
>> could just call trace6().
>>
>> But stderr would be useful too. I think that c

Re: [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements

2011-08-23 Thread Stefan Hajnoczi
On Tue, Aug 23, 2011 at 2:40 PM, Eric Blake  wrote:
> On 08/23/2011 03:52 AM, Stefan Hajnoczi wrote:
>>
>> On Wed, Aug 10, 2011 at 11:08 PM, Eric Blake  wrote:
>>>
>>> disk snapshot: the state of a virtual disk used at a given time; once a
>>> snapshot exists, then it is possible to track a delta of changes that
>>> have
>>> happened since that time.
>>
>> Did you go into details of the delta API anywhere?  I don't see it.
>
> It's not available yet, because qemu doesn't provide anything yet.  I think
> that APIs to inspect the actual delta disk contents between the current
> state and a prior snapshot will be similar to block pull, but we can't
> implement anything without support from the underlying tools.

Excellent, this is an opportunity where we need to think things
through on the QEMU side and come up with a proposal that you can give
feedback on.

There is no active work in implementing a dirty block tracking API in QEMU.

We already have the bs->dirty_bitmap for block-migration.c.  Jagane
has also implemented a dirty block feature for his LiveBackup API:
https://github.com/jagane/qemu-livebackup/blob/master/livebackup.h#L95

We also have the actual bdrv_is_allocated() information to determine
whether a qcow2/qed/etc image file has the sectors allocated or not.

As a starting point we could provide a way to enable bs->dirty_bitmap
for a block device and query its status.  This is not persistent (the
bitmap is in RAM) so the question becomes whether or not to persist?
And if we persist do we want to take the cheap route of syncing the
bitmap to disk only when cleanly terminating QEMU or to do a
crash-safe bitmap?

If we specify that the dirty bitmap is not guaranteed to persist
(because it is simply an advisory feature for incremental backup and
similar applications), then we can start simple and consider doing a
persistent implementation later.

Stefan



Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Corey Bryant



On 08/22/2011 02:39 PM, Blue Swirl wrote:

On Mon, Aug 22, 2011 at 5:42 PM, Corey Bryant  wrote:

>
>
>  On 08/22/2011 01:25 PM, Anthony Liguori wrote:

>>
>>  On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:

>>>
>>>  On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:


  I don't think it makes sense to have qemu-fe do dynamic labelling.
  You certainly could avoid the fd passing by having qemu-fe do the
  open though and just let qemu-fe run without the restricted security
  context.

>>>
>>>  qemu-fe would also not be entirely simple,

>>
>>  Indeed.
>>

>
>  I do like the idea of a privileged qemu-fe performing the open and passing
>  the fd to a restricted qemu.

Me too.


>However, I get the impression that this won't
>  get delivered nearly as quickly as fd: passing could be.  How soon do we
>  need image isolation for NFS?
>
>  Btw, this sounds similar to what Blue Swirl recommended here on v1 of this
>  patch:http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg02187.html

I was thinking about simply doing fork() + setuid() at some point and
using the FD passing structures directly. But would it bring
advantages to have two separate executables, are they different from
access control point of view vs. single but forked one?



We could put together an SELinux policy that would transition qemu-fe to 
a more restricted domain (ie. no open privilege on NFS files) when it 
executes qemu-system-x86_64.


--
Regards,
Corey


>  Regards,
>  Corey
>

>>>  because it will need to act
>>>  as a proxy for the monitor, in order to make hotplug work. ie the mgmt
>>>  app would be sending 'drive_addfile:/foo/bar' to qemu-fe, which would
>>>  then have to open the file and send 'drive_add fd:NN' onto the real QEMU,
>>>  and then pass the results on back.
>>>
>>>  In addition qemu-fe would still have to be under some kind of restricted
>>>  security context for it to be acceptable. This is going to want to be as
>>>  locked down as possible.

>>
>>  I think there's got to be some give and take here.
>>
>>  It should at least be as locked down as libvirtd. From a security point
>>  of view, we should be able to agree that we want libvirtd to be as
>>  locked down as possible.
>>
>>  But there shouldn't be a hard requirement to lock down qemu-fe more than
>>  libvirtd. Instead, the requirement should be for qemu-fe to be as/more
>>  vigilant in not trusting qemu-system-x86_64 as libvirtd is.
>>
>>  The fundamental problem here, is that there is some logic in libvirtd
>>  that rightly belongs in QEMU. In order to preserve the security model,
>>  that means that we're going to have to take a subsection of QEMU and
>>  trust it more.
>>

>>>  So I'd see that you'd likely end up with the
>>>  qemu-fe security policy being identical to the qemu security policy,

>>
>>  Then there's no point in doing qemu-fe. qemu-fe should be thought of as
>>  QEMU supplied libvirtd plugin.
>>

>>>  with the exception that it would be allowed to open files on NFS without
>>>  needing them to be labelled. So I don't really see that all this gives us
>>>  any tangible benefits over just allowing the mgmt app to pass in the FDs
>>>  directly.
>>>

  But libvirt would still need to parse image files.

>>>
>>>  Not neccessarily. As mentioned below, it is entirely possible to
>>>  enable the mgmt app to pass in details of the backing files, at
>>>  which point no image parsing is required by libvirt. Hence my
>>>  assertion that the question of who does image parsing is irrelevant
>>>  to this discussion.

>>
>>  That's certainly true.
>>
>>  Regards,
>>
>>  Anthony Liguori

>
>
>




Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Daniel P. Berrange
On Tue, Aug 23, 2011 at 11:13:34AM -0400, Corey Bryant wrote:
> 
> 
> On 08/22/2011 02:39 PM, Blue Swirl wrote:
> >On Mon, Aug 22, 2011 at 5:42 PM, Corey Bryant  
> >wrote:
> >>>
> >>>
> >>>  On 08/22/2011 01:25 PM, Anthony Liguori wrote:
> >
> >  On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:
> >>>
> >>>  On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:
> >
> >  I don't think it makes sense to have qemu-fe do dynamic labelling.
> >  You certainly could avoid the fd passing by having qemu-fe do the
> >  open though and just let qemu-fe run without the restricted 
> > security
> >  context.
> >>>
> >>>  qemu-fe would also not be entirely simple,
> >
> >  Indeed.
> >
> >>>
> >>>  I do like the idea of a privileged qemu-fe performing the open and 
> >>> passing
> >>>  the fd to a restricted qemu.
> >Me too.
> >
> >>>However, I get the impression that this won't
> >>>  get delivered nearly as quickly as fd: passing could be.  How soon do we
> >>>  need image isolation for NFS?
> >>>
> >>>  Btw, this sounds similar to what Blue Swirl recommended here on v1 of 
> >>> this
> >>>  patch:http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg02187.html
> >I was thinking about simply doing fork() + setuid() at some point and
> >using the FD passing structures directly. But would it bring
> >advantages to have two separate executables, are they different from
> >access control point of view vs. single but forked one?
> >
> 
> We could put together an SELinux policy that would transition
> qemu-fe to a more restricted domain (ie. no open privilege on NFS
> files) when it executes qemu-system-x86_64.

Thinking about this some more, I don't really think the idea of delegating
open of NFS files to a separate qemu-fe is very desirable. Libvirt makes the
decision on the security policy that the VM will run under, and provides
audit records to log what resources are being assigned to the VM. From that
point onwards, we must be able to guarantee that MAC will be enforced on
the VM, according to what we logged via the auditd system.

In the case where we delegate opening of the files to qemu-fe, and allow
its policy to open NFS files, we no longer have a guarentee that the MAC
policy will be enforced as we originally intended. Yes, qemu-fe will very
likely honour what we tell it and open the correct files, and yes qmeu-fe
has lower attack surface wrt the guest than the real qemu does, but we
still loose the guarentee of MAC enforcement from libvirt's POV.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|



Re: [Qemu-devel] [PATCH v3 14/15] qcow2: small math optimization

2011-08-23 Thread Kevin Wolf
Am 23.08.2011 15:21, schrieb Frediano Ziglio:
> Signed-off-by: Frediano Ziglio 
> ---
>  block/qcow2-refcount.c |5 +
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 2a915be..0f9a64a 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -140,10 +140,7 @@ static unsigned int 
> next_refcount_table_size(BDRVQcowState *s,
>  static int in_same_refcount_block(BDRVQcowState *s, uint64_t offset_a,
>  uint64_t offset_b)
>  {
> -uint64_t block_a = offset_a >> (2 * s->cluster_bits - REFCOUNT_SHIFT);
> -uint64_t block_b = offset_b >> (2 * s->cluster_bits - REFCOUNT_SHIFT);
> -
> -return (block_a == block_b);
> +return ((offset_a ^ offset_b) >> (2 * s->cluster_bits - REFCOUNT_SHIFT)) 
> == 0;
>  }

Depending on whether the compiler is smart enough this will or will not
change performance. However, even if we assume that it's a slight
improvement, this is in a function that is hardly ever run and the
optimisation comes with a high cost in terms of readability.

I wouldn't do this.

Kevin



Re: [Qemu-devel] [PATCH v3 15/15] qcow2: small optimization

2011-08-23 Thread Kevin Wolf
Am 23.08.2011 15:21, schrieb Frediano Ziglio:
> Signed-off-by: Frediano Ziglio 
> ---
>  block/qcow2-refcount.c |   11 +--
>  1 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 0f9a64a..243c24b 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -681,14 +681,13 @@ void qcow2_create_refcount_update(QCowCreateState *s, 
> int64_t offset,
>  int64_t size)
>  {
>  int refcount;
> -int64_t start, last, cluster_offset;
> +int64_t start, last, cluster;
>  uint16_t *p;
>  
> -start = offset & ~(s->cluster_size - 1);
> -last = (offset + size - 1)  & ~(s->cluster_size - 1);
> -for(cluster_offset = start; cluster_offset <= last;
> -cluster_offset += s->cluster_size) {
> -p = &s->refcount_block[cluster_offset >> s->cluster_bits];
> +start = offset >> s->cluster_bits;
> +last = (offset + size - 1)  >> s->cluster_bits;
> +for(cluster = start; cluster <= last; ++cluster) {
> +p = &s->refcount_block[cluster];
>  refcount = be16_to_cpu(*p);
>  refcount++;
>  *p = cpu_to_be16(refcount);

This function is unused nowadays. I'd prefer a patch that removes it
altogether. :-)

Kevin



Re: [Qemu-devel] [PATCH v3 00/15] qcow/qcow2 cleanups

2011-08-23 Thread Kevin Wolf
Am 23.08.2011 15:21, schrieb Frediano Ziglio:
> These patches mostly cleanup some AIO code using coroutines.
> Mostly they use stack instead of allocated AIO structure.
> Feel free to collapse it too short.
> 
> Frediano Ziglio (15):
>   qcow: allocate QCowAIOCB structure using stack
>   qcow: QCowAIOCB field cleanup
>   qcow: move some blocks of code to avoid useless variable
> initialization
>   qcow: embed qcow_aio_read_cb into qcow_co_readv and qcow_aio_write_cb
> into qcow_co_writev
>   qcow: remove old #undefined code
>   qcow2: removed unused fields
>   qcow2: removed cur_nr_sectors field in QCowAIOCB
>   qcow2: remove l2meta from QCowAIOCB
>   qcow2: remove cluster_offset from QCowAIOCB
>   qcow2: remove common from QCowAIOCB
>   qcow2: reindent and use while before the big jump
>   qcow2: removed QCowAIOCB entirely
>   qcow2: remove memory leak
>   qcow2: small math optimization
>   qcow2: small optimization
> 
>  block/qcow.c   |  378 ++--
>  block/qcow2-refcount.c |   16 +--
>  block/qcow2.c  |  412 +++
>  3 files changed, 294 insertions(+), 512 deletions(-)
> 

Thanks, applied patches 1-13 to the block branch.

Kevin



Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Kevin Wolf
Am 23.08.2011 17:26, schrieb Daniel P. Berrange:
> On Tue, Aug 23, 2011 at 11:13:34AM -0400, Corey Bryant wrote:
>>
>>
>> On 08/22/2011 02:39 PM, Blue Swirl wrote:
>>> On Mon, Aug 22, 2011 at 5:42 PM, Corey Bryant  
>>> wrote:
>
>
>  On 08/22/2011 01:25 PM, Anthony Liguori wrote:
>>>
>>>  On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:
>
>  On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:
>>>
>>>  I don't think it makes sense to have qemu-fe do dynamic labelling.
>>>  You certainly could avoid the fd passing by having qemu-fe do the
>>>  open though and just let qemu-fe run without the restricted 
>>> security
>>>  context.
>
>  qemu-fe would also not be entirely simple,
>>>
>>>  Indeed.
>>>
>
>  I do like the idea of a privileged qemu-fe performing the open and 
> passing
>  the fd to a restricted qemu.
>>> Me too.
>>>
>However, I get the impression that this won't
>  get delivered nearly as quickly as fd: passing could be.  How soon do we
>  need image isolation for NFS?
>
>  Btw, this sounds similar to what Blue Swirl recommended here on v1 of 
> this
>  patch:http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg02187.html
>>> I was thinking about simply doing fork() + setuid() at some point and
>>> using the FD passing structures directly. But would it bring
>>> advantages to have two separate executables, are they different from
>>> access control point of view vs. single but forked one?
>>>
>>
>> We could put together an SELinux policy that would transition
>> qemu-fe to a more restricted domain (ie. no open privilege on NFS
>> files) when it executes qemu-system-x86_64.
> 
> Thinking about this some more, I don't really think the idea of delegating
> open of NFS files to a separate qemu-fe is very desirable. Libvirt makes the
> decision on the security policy that the VM will run under, and provides
> audit records to log what resources are being assigned to the VM. From that
> point onwards, we must be able to guarantee that MAC will be enforced on
> the VM, according to what we logged via the auditd system.
> 
> In the case where we delegate opening of the files to qemu-fe, and allow
> its policy to open NFS files, we no longer have a guarentee that the MAC
> policy will be enforced as we originally intended. Yes, qemu-fe will very
> likely honour what we tell it and open the correct files, and yes qmeu-fe
> has lower attack surface wrt the guest than the real qemu does, but we
> still loose the guarentee of MAC enforcement from libvirt's POV.

On the other hand, from a qemu POV libvirt is only one possible
management tool. In practice, another very popular "management tool" for
qemu is bash. With qemu-fe all the other tools, including direct
invocation from the command line, would get some protection, too.

Kevin



Re: [Qemu-devel] [PATCH 2/2] [FYI] coverage test for Linux installs

2011-08-23 Thread Ryan Harper
* Anthony Liguori  [2011-08-08 14:33]:
> This is a simple tool that I've been using for the past couple months to help
> with day-to-day testing of changes.  It may seem like it's similar to
> kvm-autotest but it's actually quite different.  Most noticably:
> 
>  - It's a coverage test instead of an acceptance test.  Each time it runs it
>uses a slightly a semi-random configuration for the guest.  This means that
>the more you run it, the more coverage you get.  This is a good fit for
>maintainer testing because you get wide testing without having to wait for
>48-hour test cycles.
> 
>  - It runs in the foreground as an unprivileged user.  This means I can kick 
> it
>off in a terminal while I go off and do something else, but still keep an
>eye on what's going on.
> 
>  - It works with TCG guests too and includes a TCG test case for the pseries
>Power machine.
> 
> That said, it's *not* a replacement for KVM autotest.  It is not a good tool
> for doing acceptance testing, like you would do to cut a new QEMU release.  
> But
> perhaps there's some behavior that could be leveraged by KVM autotest in the
> future here.
> 
> I'm not proposing this for tree inclusion right now.  Just sharing a tool that
> I've found to be useful.  I really just want the previous patch to go in so 
> that
> I can stop carrying that patch privately.
> 
> Right now, you need to setup an ISO directory to use the test tool.  After
> copy-on-read lands in the tree, I plan on making it create CoR files backing 
> to
> http so that no explicit setup is required.
> ---
>  test-linux.c |  549 
> ++
>  1 files changed, 549 insertions(+), 0 deletions(-)
>  create mode 100644 test-linux.c

I needed the following changes to the Makefile to build this..


diff --git a/Makefile b/Makefile
index 8606849..b7df0e1 100644
--- a/Makefile
+++ b/Makefile
@@ -162,6 +162,7 @@ check-qlist: check-qlist.o qlist.o qint.o $(CHECK_PROG_DEPS)
 check-qfloat: check-qfloat.o qfloat.o $(CHECK_PROG_DEPS)
 check-qjson: check-qjson.o qfloat.o qint.o qdict.o qstring.o qlist.o qbool.o 
qjson.o json-streamer.o json-lexer.o js
 test-coroutine: test-coroutine.o qemu-timer-common.o async.o 
$(coroutine-obj-y) $(CHECK_PROG_DEPS)
+test-linux: test-linux.o $(CHECK_PROG_DEPS)
 
 $(qapi-obj-y): $(GENERATED_HEADERS)
 qapi-dir := qapi-generated


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com



Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Daniel P. Berrange
On Tue, Aug 23, 2011 at 05:50:03PM +0200, Kevin Wolf wrote:
> Am 23.08.2011 17:26, schrieb Daniel P. Berrange:
> > On Tue, Aug 23, 2011 at 11:13:34AM -0400, Corey Bryant wrote:
> >>
> >>
> >> On 08/22/2011 02:39 PM, Blue Swirl wrote:
> >>> On Mon, Aug 22, 2011 at 5:42 PM, Corey Bryant  
> >>> wrote:
> >
> >
> >  On 08/22/2011 01:25 PM, Anthony Liguori wrote:
> >>>
> >>>  On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:
> >
> >  On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:
> >>>
> >>>  I don't think it makes sense to have qemu-fe do dynamic 
> >>> labelling.
> >>>  You certainly could avoid the fd passing by having qemu-fe do the
> >>>  open though and just let qemu-fe run without the restricted 
> >>> security
> >>>  context.
> >
> >  qemu-fe would also not be entirely simple,
> >>>
> >>>  Indeed.
> >>>
> >
> >  I do like the idea of a privileged qemu-fe performing the open and 
> > passing
> >  the fd to a restricted qemu.
> >>> Me too.
> >>>
> >However, I get the impression that this won't
> >  get delivered nearly as quickly as fd: passing could be.  How soon do 
> > we
> >  need image isolation for NFS?
> >
> >  Btw, this sounds similar to what Blue Swirl recommended here on v1 of 
> > this
> >  
> > patch:http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg02187.html
> >>> I was thinking about simply doing fork() + setuid() at some point and
> >>> using the FD passing structures directly. But would it bring
> >>> advantages to have two separate executables, are they different from
> >>> access control point of view vs. single but forked one?
> >>>
> >>
> >> We could put together an SELinux policy that would transition
> >> qemu-fe to a more restricted domain (ie. no open privilege on NFS
> >> files) when it executes qemu-system-x86_64.
> > 
> > Thinking about this some more, I don't really think the idea of delegating
> > open of NFS files to a separate qemu-fe is very desirable. Libvirt makes the
> > decision on the security policy that the VM will run under, and provides
> > audit records to log what resources are being assigned to the VM. From that
> > point onwards, we must be able to guarantee that MAC will be enforced on
> > the VM, according to what we logged via the auditd system.
> > 
> > In the case where we delegate opening of the files to qemu-fe, and allow
> > its policy to open NFS files, we no longer have a guarentee that the MAC
> > policy will be enforced as we originally intended. Yes, qemu-fe will very
> > likely honour what we tell it and open the correct files, and yes qmeu-fe
> > has lower attack surface wrt the guest than the real qemu does, but we
> > still loose the guarentee of MAC enforcement from libvirt's POV.
> 
> On the other hand, from a qemu POV libvirt is only one possible
> management tool. In practice, another very popular "management tool" for
> qemu is bash. With qemu-fe all the other tools, including direct
> invocation from the command line, would get some protection, too.

That's why I said a qemu-fe like tool need not be mutually exclusive
with exposing FD passing to a management tool like libvirt. Both
qemu-fe and libvirt need to some mechanism to pass open FDs to the
real QEMU.  We could needlessly invent a new communication channel
between qemu-fe and qemu that only it can use, or we can use the
channel we already have - QMP - to provide an interface that either
qemu-fe or libvirt, can use to pass FDs into the real QEMU.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|



Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Daniel P. Berrange
On Tue, Aug 23, 2011 at 04:51:31PM +0100, Daniel P. Berrange wrote:
> On Tue, Aug 23, 2011 at 05:50:03PM +0200, Kevin Wolf wrote:
> > Am 23.08.2011 17:26, schrieb Daniel P. Berrange:
> > > On Tue, Aug 23, 2011 at 11:13:34AM -0400, Corey Bryant wrote:
> > >>
> > >>
> > >> On 08/22/2011 02:39 PM, Blue Swirl wrote:
> > >>> On Mon, Aug 22, 2011 at 5:42 PM, Corey 
> > >>> Bryant  wrote:
> > >
> > >
> > >  On 08/22/2011 01:25 PM, Anthony Liguori wrote:
> > >>>
> > >>>  On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:
> > >
> > >  On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:
> > >>>
> > >>>  I don't think it makes sense to have qemu-fe do dynamic 
> > >>> labelling.
> > >>>  You certainly could avoid the fd passing by having qemu-fe do 
> > >>> the
> > >>>  open though and just let qemu-fe run without the restricted 
> > >>> security
> > >>>  context.
> > >
> > >  qemu-fe would also not be entirely simple,
> > >>>
> > >>>  Indeed.
> > >>>
> > >
> > >  I do like the idea of a privileged qemu-fe performing the open and 
> > > passing
> > >  the fd to a restricted qemu.
> > >>> Me too.
> > >>>
> > >However, I get the impression that this won't
> > >  get delivered nearly as quickly as fd: passing could be.  How soon 
> > > do we
> > >  need image isolation for NFS?
> > >
> > >  Btw, this sounds similar to what Blue Swirl recommended here on v1 
> > > of this
> > >  
> > > patch:http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg02187.html
> > >>> I was thinking about simply doing fork() + setuid() at some point and
> > >>> using the FD passing structures directly. But would it bring
> > >>> advantages to have two separate executables, are they different from
> > >>> access control point of view vs. single but forked one?
> > >>>
> > >>
> > >> We could put together an SELinux policy that would transition
> > >> qemu-fe to a more restricted domain (ie. no open privilege on NFS
> > >> files) when it executes qemu-system-x86_64.
> > > 
> > > Thinking about this some more, I don't really think the idea of delegating
> > > open of NFS files to a separate qemu-fe is very desirable. Libvirt makes 
> > > the
> > > decision on the security policy that the VM will run under, and provides
> > > audit records to log what resources are being assigned to the VM. From 
> > > that
> > > point onwards, we must be able to guarantee that MAC will be enforced on
> > > the VM, according to what we logged via the auditd system.
> > > 
> > > In the case where we delegate opening of the files to qemu-fe, and allow
> > > its policy to open NFS files, we no longer have a guarentee that the MAC
> > > policy will be enforced as we originally intended. Yes, qemu-fe will very
> > > likely honour what we tell it and open the correct files, and yes qmeu-fe
> > > has lower attack surface wrt the guest than the real qemu does, but we
> > > still loose the guarentee of MAC enforcement from libvirt's POV.
> > 
> > On the other hand, from a qemu POV libvirt is only one possible
> > management tool. In practice, another very popular "management tool" for
> > qemu is bash. With qemu-fe all the other tools, including direct
> > invocation from the command line, would get some protection, too.
> 
> That's why I said a qemu-fe like tool need not be mutually exclusive
> with exposing FD passing to a management tool like libvirt. Both
> qemu-fe and libvirt need to some mechanism to pass open FDs to the
> real QEMU.  We could needlessly invent a new communication channel
> between qemu-fe and qemu that only it can use, or we can use the
> channel we already have - QMP - to provide an interface that either
> qemu-fe or libvirt, can use to pass FDs into the real QEMU.

Or to put it another way...

It should be possible to build a 'qemu-fe' tool which does sandboxing
using soley the QEMU command line + QMP monitor. If this is not possible
then, IMHO, QMP should be considered incomplete / a failure, or may point
to other holes in QEMU's mgmt app APIs. eg perhaps it would demonstrate
that we do in fact need a libblockdriver.so for mgmt apps to query info
about disks.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|



Re: [Qemu-devel] [PATCH 2/2] Added target to build libvdisk

2011-08-23 Thread Daniel P. Berrange
On Mon, Aug 22, 2011 at 02:29:00PM -0500, Anthony Liguori wrote:
> On 08/22/2011 12:06 PM, Saggi Mizrahi wrote:
> >libvdisk is a library that packages qemu's handling of disk images. This
> >allows for other programs to link to it and get access to qemu image
> >file abstractions.
> >
> >To use install the lib and #include
> >all the bdrv_* functions work as expected.
> >
> >Signed-off-by: Saggi Mizrahi
> 
> It's a good idea in principle but the approach is far too naive.
> 
> The block layer needs a good bit of modularization first.  Test
> cases need to be written, and most importantly, using the library
> shouldn't require writing a bunch of dummy functions.

There are also licensing practicalities which can cause us
some trouble/pain. For example:

> >  ##
> >+# libvdisk
> >+
> >+vdisk-obj-y = $(block-obj-y)
> >+
> >+vdisk-obj-y += qemu-tool.o qemu-error.o
> >+vdisk-obj-y += $(oslib-obj-y) $(trace-obj-y) $(block-obj-y)
> >+vdisk-obj-y += $(qobject-obj-y) $(version-obj-y) qemu-timer-common.o

$(block-obj-y) pulls in  'aio.o' which is built from aio.c which
is licensed  "GPLv2 only". So even those many files are BSD
licenses, the combined work will be GPLv2-only. Unfortunately ending
up with a libqemublock.so which is GPLv2-only is as good as useless
for libs/apps since it is incompatible with both LGPLv2(+) and GPLv3.

Now in this case aio.c is labelled as Copyright IBM / Anthony,
so IBM could likely resolve this licensing to be more widely
compatible. This could^H^Hwould become a non-trivial task if we
need to look at many files & then also any patches accepted to
those files from 3rd parties over the years :-(

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|



Re: [Qemu-devel] [PATCH 2/2] Added target to build libvdisk

2011-08-23 Thread Anthony Liguori

On 08/23/2011 11:12 AM, Daniel P. Berrange wrote:

On Mon, Aug 22, 2011 at 02:29:00PM -0500, Anthony Liguori wrote:

On 08/22/2011 12:06 PM, Saggi Mizrahi wrote:

libvdisk is a library that packages qemu's handling of disk images. This
allows for other programs to link to it and get access to qemu image
file abstractions.

To use install the lib and #include
all the bdrv_* functions work as expected.

Signed-off-by: Saggi Mizrahi


It's a good idea in principle but the approach is far too naive.

The block layer needs a good bit of modularization first.  Test
cases need to be written, and most importantly, using the library
shouldn't require writing a bunch of dummy functions.


There are also licensing practicalities which can cause us
some trouble/pain. For example:


  ##
+# libvdisk
+
+vdisk-obj-y = $(block-obj-y)
+
+vdisk-obj-y += qemu-tool.o qemu-error.o
+vdisk-obj-y += $(oslib-obj-y) $(trace-obj-y) $(block-obj-y)
+vdisk-obj-y += $(qobject-obj-y) $(version-obj-y) qemu-timer-common.o


$(block-obj-y) pulls in  'aio.o' which is built from aio.c which
is licensed  "GPLv2 only". So even those many files are BSD
licenses, the combined work will be GPLv2-only. Unfortunately ending
up with a libqemublock.so which is GPLv2-only is as good as useless
for libs/apps since it is incompatible with both LGPLv2(+) and GPLv3.

Now in this case aio.c is labelled as Copyright IBM / Anthony,
so IBM could likely resolve this licensing to be more widely
compatible. This could^H^Hwould become a non-trivial task if we
need to look at many files&  then also any patches accepted to
those files from 3rd parties over the years :-(


If there was a block driver library, I would expect it to be GPL, not LGPL.

Regards,

Anthony Liguori



Regards,
Daniel





Re: [Qemu-devel] [libvirt] [PATCH v4] Add support for fd: protocol

2011-08-23 Thread Corey Bryant

On 08/23/2011 11:50 AM, Kevin Wolf wrote:

Am 23.08.2011 17:26, schrieb Daniel P. Berrange:

>  On Tue, Aug 23, 2011 at 11:13:34AM -0400, Corey Bryant wrote:

>>
>>
>>  On 08/22/2011 02:39 PM, Blue Swirl wrote:

>>>  On Mon, Aug 22, 2011 at 5:42 PM, Corey Bryant   
wrote:

>
>
>On 08/22/2011 01:25 PM, Anthony Liguori wrote:

>>>
>>>On 08/22/2011 11:50 AM, Daniel P. Berrange wrote:

>
>On Mon, Aug 22, 2011 at 11:29:12AM -0500, Anthony Liguori wrote:

>>>
>>>I don't think it makes sense to have qemu-fe do dynamic 
labelling.
>>>You certainly could avoid the fd passing by having qemu-fe do the
>>>open though and just let qemu-fe run without the restricted 
security
>>>context.

>
>qemu-fe would also not be entirely simple,

>>>
>>>Indeed.
>>>

>
>I do like the idea of a privileged qemu-fe performing the open and 
passing
>the fd to a restricted qemu.

>>>  Me too.
>>>

>  However, I get the impression that this won't
>get delivered nearly as quickly as fd: passing could be.  How soon do 
we
>need image isolation for NFS?
>
>Btw, this sounds similar to what Blue Swirl recommended here on v1 of 
this
>
patch:http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg02187.html

>>>  I was thinking about simply doing fork() + setuid() at some point and
>>>  using the FD passing structures directly. But would it bring
>>>  advantages to have two separate executables, are they different from
>>>  access control point of view vs. single but forked one?
>>>

>>
>>  We could put together an SELinux policy that would transition
>>  qemu-fe to a more restricted domain (ie. no open privilege on NFS
>>  files) when it executes qemu-system-x86_64.

>
>  Thinking about this some more, I don't really think the idea of delegating
>  open of NFS files to a separate qemu-fe is very desirable. Libvirt makes the
>  decision on the security policy that the VM will run under, and provides
>  audit records to log what resources are being assigned to the VM. From that
>  point onwards, we must be able to guarantee that MAC will be enforced on
>  the VM, according to what we logged via the auditd system.
>
>  In the case where we delegate opening of the files to qemu-fe, and allow
>  its policy to open NFS files, we no longer have a guarentee that the MAC
>  policy will be enforced as we originally intended. Yes, qemu-fe will very
>  likely honour what we tell it and open the correct files, and yes qmeu-fe
>  has lower attack surface wrt the guest than the real qemu does, but we
>  still loose the guarentee of MAC enforcement from libvirt's POV.

On the other hand, from a qemu POV libvirt is only one possible
management tool. In practice, another very popular "management tool" for
qemu is bash. With qemu-fe all the other tools, including direct
invocation from the command line, would get some protection, too.

Kevin


True, but like you said it provides just some protection.  To really be 
useful qemu-fe would need the ability to label qemu guest processes and 
image files to provide MAC isolation.


--
Regards,
Corey



Re: [Qemu-devel] [PATCH 2/2] Added target to build libvdisk

2011-08-23 Thread Daniel P. Berrange
On Tue, Aug 23, 2011 at 11:14:20AM -0500, Anthony Liguori wrote:
> On 08/23/2011 11:12 AM, Daniel P. Berrange wrote:
> >$(block-obj-y) pulls in  'aio.o' which is built from aio.c which
> >is licensed  "GPLv2 only". So even those many files are BSD
> >licenses, the combined work will be GPLv2-only. Unfortunately ending
> >up with a libqemublock.so which is GPLv2-only is as good as useless
> >for libs/apps since it is incompatible with both LGPLv2(+) and GPLv3.
> >
> >Now in this case aio.c is labelled as Copyright IBM / Anthony,
> >so IBM could likely resolve this licensing to be more widely
> >compatible. This could^H^Hwould become a non-trivial task if we
> >need to look at many files&  then also any patches accepted to
> >those files from 3rd parties over the years :-(
> 
> If there was a block driver library, I would expect it to be GPL, not LGPL.

This would prevent us from using it in libvirt, unless we wrote a
helper program which we spawned anytime we wanted to use some
functionality library :-(

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|



Re: [Qemu-devel] [PATCH 2/2] Added target to build libvdisk

2011-08-23 Thread Anthony Liguori

On 08/23/2011 11:18 AM, Daniel P. Berrange wrote:

On Tue, Aug 23, 2011 at 11:14:20AM -0500, Anthony Liguori wrote:

On 08/23/2011 11:12 AM, Daniel P. Berrange wrote:

$(block-obj-y) pulls in  'aio.o' which is built from aio.c which
is licensed  "GPLv2 only". So even those many files are BSD
licenses, the combined work will be GPLv2-only. Unfortunately ending
up with a libqemublock.so which is GPLv2-only is as good as useless
for libs/apps since it is incompatible with both LGPLv2(+) and GPLv3.

Now in this case aio.c is labelled as Copyright IBM / Anthony,
so IBM could likely resolve this licensing to be more widely
compatible. This could^H^Hwould become a non-trivial task if we
need to look at many files&   then also any patches accepted to
those files from 3rd parties over the years :-(


If there was a block driver library, I would expect it to be GPL, not LGPL.


This would prevent us from using it in libvirt, unless we wrote a
helper program which we spawned anytime we wanted to use some
functionality library :-(


libvirtd is GPL, no?

But QEMU is GPL.  Libraries derived from QEMU will also be GPL.

Regards,

Anthony Liguori



Regards,
Daniel





Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Alex Williamson
On Tue, 2011-08-23 at 12:38 +1000, David Gibson wrote:
> On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > > capture the plan that I think we agreed to:
> > > > 
> > > > We need to address both the description and enforcement of device
> > > > groups.  Groups are formed any time the iommu does not have resolution
> > > > between a set of devices.  On x86, this typically happens when a
> > > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > > Power, partitionable endpoints define a group.  Grouping information
> > > > needs to be exposed for both userspace and kernel internal usage.  This
> > > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > > 
> > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group
> > > > 42
> > > > 
> > > > (I use a PCI example here, but attribute should not be PCI specific)
> > > 
> > > Ok.  Am I correct in thinking these group IDs are representing the
> > > minimum granularity, and are therefore always static, defined only by
> > > the connected hardware, not by configuration?
> > 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Right.  And likewise I can see a place for configuration parameters
> like the present 'allow_unsafe_irqs'.  But these would be more-or-less
> global options which affected the overall granularity, rather than
> detailed configuration such as explicitly binding some devices into a
> group, yes?

Yes, currently the interrupt remapping support is a global iommu
capability.  I suppose it's possible that this could be an iommu option,
where the iommu driver would not advertise a group if the interrupt
remapping constraint isn't met.

> > > > >From there we have a few options.  In the BoF we discussed a model 
> > > > >where
> > > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > > ioctls to enumerate and return a "device" fd for each attached member of
> > > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > > returning an error on open() of the group fd if there are members of the
> > > > group not bound to the vfio driver.  Each device fd would then support a
> > > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > > group fd.
> > > 
> > > It seems a slightly strange distinction that the group device appears
> > > when any device in the group is bound to vfio, but only becomes usable
> > > when all devices are bound.
> > > 
> > > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > > groups when the vfio module is loaded.  The group fd would allow open()
> > > > and some set of iommu querying and device enumeration ioctls, but would
> > > > error on dma mapping and retrieving device fds until all of the group
> > > > devices are bound to the vfio driver.
> > > 
> > > Which is why I marginally prefer this model, although it's not a big
> > > deal.
> > 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> I'm not entirely sure what you mean here.  But, that's now several
> weak votes in favour of the always-present group devices, and none in
> favour of the created-when-first-device-bound model, so I suggest we
> take the /dev/vfio/$GROUP as our tentative approach.

Yep

> > > > In either case, the uiommu interface is removed entirely since dma
> > > > mapping is done via the group fd.  As necessary in the future, we can
> > > > define a more high performance dma mapping interface for streaming dma
> > > > via the group fd.  I expect we'll also include architecture specific
> > > > group ioctls to describe features and capabilities of the iommu.  The
> > > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > > to userspace process ownership model.
> > > 
> > > A 1:1 group<->process correspondance seems wrong to me. But there are
> > > many ways you could legitimately write the userspace side of the code,
> > > many of them involving some sort of concurrency.  Implementing that
> > > concurrency as multiple processes (using explicit shared memory 

Re: [Qemu-devel] [PATCH 4/4] qmp: add query-block-jobs

2011-08-23 Thread Adam Litke
On Tue, 2011-08-23 at 13:58 +0100, Stefan Hajnoczi wrote:
> diff --git a/blockdev.c b/blockdev.c
> index 036b7eb..e9098f6 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -835,3 +835,13 @@ int do_block_job_cancel(Monitor *mon, const QDict 
> *params, QObject **ret_data)
>  qerror_report(QERR_DEVICE_NOT_ACTIVE, device);
>  return -1;
>  }
> +
> +void monitor_print_block_jobs(Monitor *mon, const QObject *data)
> +{
> +monitor_printf(mon, "No active jobs\n");
> +}
> +
> +void do_info_block_jobs(Monitor *mon, QObject **ret_data)
> +{
> +*ret_data = QOBJECT(qdict_new());
> +}

This should return an empty qlist, not a qdict.

-- 
Thanks,
Adam




[Qemu-devel] [PATCH 2/3] rbd: allow client id to be specified in config string

2011-08-23 Thread Sage Weil
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as.  Currently they are stuck with
the default ('admin').  This is necessary for anyone using authentication
in their environment.

Signed-off-by: Sage Weil 
---
 block/rbd.c |   52 
 1 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 52b79fa..e239e04 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -169,6 +169,34 @@ done:
 return ret;
 }
 
+static char *qemu_rbd_parse_clientname(const char *conf, char *clientname)
+{
+const char *p = conf;
+
+while (*p) {
+int len;
+const char *end = strchr(p, ':');
+
+if (end) {
+len = end - p;
+} else {
+len = strlen(p);
+}
+
+if (strncmp(p, "id=", 3) == 0) {
+len -= 3;
+strncpy(clientname, p + 3, len);
+clientname[len] = '\0';
+return clientname;
+}
+if (end == NULL) {
+break;
+}
+p = end + 1;
+}
+return NULL;
+}
+
 static int qemu_rbd_set_conf(rados_t cluster, const char *conf)
 {
 char *p, *buf;
@@ -198,17 +226,19 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 break;
 }
 
-if (strcmp(name, "conf")) {
-ret = rados_conf_set(cluster, name, value);
+if (strcmp(name, "conf") == 0) {
+ret = rados_conf_read_file(cluster, value);
 if (ret < 0) {
-error_report("invalid conf option %s", name);
-ret = -EINVAL;
+error_report("error reading conf file %s", value);
 break;
 }
+} else if (strcmp(name, "id") == 0) {
+/* ignore, this is parsed by qemu_rbd_parse_clientname() */
 } else {
-ret = rados_conf_read_file(cluster, value);
+ret = rados_conf_set(cluster, name, value);
 if (ret < 0) {
-error_report("error reading conf file %s", value);
+error_report("invalid conf option %s", name);
+ret = -EINVAL;
 break;
 }
 }
@@ -227,6 +257,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 char name[RBD_MAX_IMAGE_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 rados_t cluster;
 rados_ioctx_t io_ctx;
 int ret;
@@ -259,7 +291,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 options++;
 }
 
-if (rados_create(&cluster, NULL) < 0) {
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+if (rados_create(&cluster, clientname) < 0) {
 error_report("error initializing");
 return -EIO;
 }
@@ -385,6 +418,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 char pool[RBD_MAX_POOL_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 int r;
 
 if (qemu_rbd_parsename(filename, pool, sizeof(pool),
@@ -394,7 +429,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return -EINVAL;
 }
 
-r = rados_create(&s->cluster, NULL);
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+r = rados_create(&s->cluster, clientname);
 if (r < 0) {
 error_report("error initializing");
 return r;
-- 
1.7.0




[Qemu-devel] [PATCH 3/3] rbd: clean up, fix style

2011-08-23 Thread Sage Weil
No assignment in condition.  Remove duplicate ret > 0 check.

Signed-off-by: Sage Weil 
---
 block/rbd.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index e239e04..5423a2d 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -391,15 +391,14 @@ static void qemu_rbd_aio_event_reader(void *opaque)
 char *p = (char *)&s->event_rcb;
 
 /* now read the rcb pointer that was sent from a non qemu thread */
-if ((ret = read(s->fds[RBD_FD_READ], p + s->event_reader_pos,
-sizeof(s->event_rcb) - s->event_reader_pos)) > 0) {
-if (ret > 0) {
-s->event_reader_pos += ret;
-if (s->event_reader_pos == sizeof(s->event_rcb)) {
-s->event_reader_pos = 0;
-qemu_rbd_complete_aio(s->event_rcb);
-s->qemu_aio_count--;
-}
+ret = read(s->fds[RBD_FD_READ], p + s->event_reader_pos,
+   sizeof(s->event_rcb) - s->event_reader_pos);
+if (ret > 0) {
+s->event_reader_pos += ret;
+if (s->event_reader_pos == sizeof(s->event_rcb)) {
+s->event_reader_pos = 0;
+qemu_rbd_complete_aio(s->event_rcb);
+s->qemu_aio_count--;
 }
 }
 } while (ret < 0 && errno == EINTR);
-- 
1.7.0




[Qemu-devel] [PATCH 1/3] rbd: fix leak in failure path

2011-08-23 Thread Sage Weil
Fix leak of s->snap when rados_create fails.

Reported-by: Stefan Hajnoczi 
Signed-off-by: Sage Weil 
---
 block/rbd.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index d5659cd..52b79fa 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -393,10 +393,6 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
conf, sizeof(conf)) < 0) {
 return -EINVAL;
 }
-s->snap = NULL;
-if (snap_buf[0] != '\0') {
-s->snap = qemu_strdup(snap_buf);
-}
 
 r = rados_create(&s->cluster, NULL);
 if (r < 0) {
@@ -404,6 +400,11 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return r;
 }
 
+s->snap = NULL;
+if (snap_buf[0] != '\0') {
+s->snap = qemu_strdup(snap_buf);
+}
+
 if (strstr(conf, "conf=") == NULL) {
 r = rados_conf_read_file(s->cluster, NULL);
 if (r < 0) {
-- 
1.7.0




Re: [Qemu-devel] [PATCH 1/4] qmp: add block_stream command

2011-08-23 Thread Adam Litke
Under libvirt, I get the following error when trying to start a block
stream:

qerror: bad call in function 'do_block_stream':
qerror: -> error format '{ 'class': 'NotSupported', 'data': {} }' not found
qerror: call at blockdev.c:808
2011-08-23 11:30:09.974: shutting down

Is this patch missing a part of the qerror definition?

On Tue, 2011-08-23 at 13:58 +0100, Stefan Hajnoczi wrote:
> This patch introduces the block_stream HMP/QMP command.  It is currently
> unimplemented and returns the 'NotSupported' error.
> 
> block_stream
> 
> 
> Copy data from a backing file into a block device.
> 
> The block streaming operation is performed in the background until the
> entire backing file has been copied.  This command returns immediately
> once streaming has started.  The status of ongoing block streaming
> operations can be checked with query-block-jobs.  The operation can be
> stopped before it has completed using the block_job_cancel command.
> 
> If a base file is specified then sectors are not copied from that base
> file and its backing chain.  When streaming completes the image file
> will have the base file as its backing file.  This can be used to stream
> a subset of the backing file chain instead of flattening the entire
> image.
> 
> On successful completion the image file is updated to drop the backing
> file.
> 
> Arguments:
> 
> - device: device name (json-string)
> - base:   common backing file (json-string, optional)
> 
> Errors:
> 
> DeviceInUse:streaming is already active on this device
> DeviceNotFound: device name is invalid
> NotSupported:   image streaming is not supported by this device
> 
> Events:
> 
> On completion the BLOCK_JOB_COMPLETED event is raised with the following
> fields:
> 
> - type: job type ("stream" for image streaming, json-string)
> - device:   device name (json-string)
> - end:  maximum progress value (json-int)
> - position: current progress value (json-int)
> - speed:rate limit, bytes per second (json-int)
> - error:error message (json-string, only on error)
> 
> The completion event is raised both on success and on failure.  On
> success position is equal to end.  On failure position and end can be
> used to indicate at which point the operation failed.
> 
> On failure the error field contains a human-readable error message.
> There are no semantics other than that streaming has failed and clients
> should not try to interpret the error string.
> 
> Examples:
> 
> -> { "execute": "block_stream", "arguments": { "device": "virtio0" } }
> <- { "return":  {} }
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  blockdev.c  |   18 +++
>  blockdev.h  |1 +
>  hmp-commands.hx |   14 +++
>  monitor.c   |3 ++
>  monitor.h   |1 +
>  qerror.h|3 ++
>  qmp-commands.hx |   65 
> +++
>  7 files changed, 105 insertions(+), 0 deletions(-)
> 
> diff --git a/blockdev.c b/blockdev.c
> index d272659..208bfc9 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -790,3 +790,21 @@ int do_block_resize(Monitor *mon, const QDict *qdict, 
> QObject **ret_data)
> 
>  return 0;
>  }
> +
> +int do_block_stream(Monitor *mon, const QDict *params, QObject **ret_data)
> +{
> +const char *device = qdict_get_str(params, "device");
> +BlockDriverState *bs;
> +
> +bs = bdrv_find(device);
> +if (!bs) {
> +qerror_report(QERR_DEVICE_NOT_FOUND, device);
> +return -1;
> +}
> +
> +/* This command is not yet implemented.  The device not found check above
> + * is done so that error ordering will not change when fully implemented.
> + */
> +qerror_report(QERR_NOT_SUPPORTED);
> +return -1;
> +}
> diff --git a/blockdev.h b/blockdev.h
> index 3587786..ad98d37 100644
> --- a/blockdev.h
> +++ b/blockdev.h
> @@ -65,5 +65,6 @@ int do_change_block(Monitor *mon, const char *device,
>  int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
>  int do_snapshot_blkdev(Monitor *mon, const QDict *qdict, QObject **ret_data);
>  int do_block_resize(Monitor *mon, const QDict *qdict, QObject **ret_data);
> +int do_block_stream(Monitor *mon, const QDict *qdict, QObject **ret_data);
> 
>  #endif
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 0ccfb28..2a16fd9 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -70,6 +70,20 @@ but should be used with extreme caution.  Note that this 
> command only
>  resizes image files, it can not resize block devices like LVM volumes.
>  ETEXI
> 
> +{
> +.name   = "block_stream",
> +.args_type  = "device:B,base:s?",
> +.params = "device [base]",
> +.help   = "copy data from a backing file into a block device",
> +.user_print = monitor_user_noop,
> +.mhandler.cmd_new = do_block_stream,
> +},
> +
> +STEXI
> +@item block_stream
> +@findex block_stream
> +Copy data from a backing file into a block device.
> +

Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread aafabbri



On 8/23/11 4:04 AM, "Joerg Roedel"  wrote:

> On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
>> You have to enforce group/iommu domain assignment whether you have the
>> existing uiommu API, or if you change it to your proposed
>> ioctl(inherit_iommu) API.
>> 
>> The only change needed to VFIO here should be to make uiommu fd assignment
>> happen on the groups instead of on device fds.  That operation fails or
>> succeeds according to the group semantics (all-or-none assignment/same
>> uiommu).
> 
> That is makes uiommu basically the same as the meta-groups, right?

Yes, functionality seems the same, thus my suggestion to keep uiommu
explicit.  Is there some need for group-groups besides defining sets of
groups which share IOMMU resources?

I do all this stuff (bringing up sets of devices which may share IOMMU
domain) dynamically from C applications.  I don't really want some static
(boot-time or sysfs fiddling) supergroup config unless there is a good
reason KVM/power needs it.

As you say in your next email, doing it all from ioctls is very easy,
programmatically.

-Aaron Fabbri




Re: [Qemu-devel] [PATCH 0/3] PPC: E500: Support SPE guest

2011-08-23 Thread Jason Wessel
On 08/22/2011 11:55 PM, Alexander Graf wrote:
> When I bumped into Jason on Linuxcon, we tried out to run the e500 target
> on his Windriver build that used SPE and immediately ran into emulation
> issues. Fortunately there weren't too many, so here are the patches to get
> a guest using SPE instructions working just fine.
>

The patch set looks good to me.  I tried it out this morning.

I have a patch that implements enough of the the dbcr0, dbsr, and msr DE bit in 
order to single step, but I am seeing random fatal mmu faults.   Before we go 
down the route of implementing more pieces, I am interested to know if you see 
the same behavior, or if you had any ideas around how to further debug it.

Using just your patch series + the QEMU HEAD + the SPE enabled rootfs + qemu 
NFS mounting the rootfs, here is the "litmus test" to see if you experience the 
fatal mmu faults to user space.

ulimit -c unlimited

while [ 1 ] ; do
   echo 3 > /proc/sys/vm/drop_caches
   (echo quit ; sleep 2) | gdb /bin/ls || break
done


I find that this runs between 1-15 times and crashes with a core file.  Looking 
at the core file it is usually in a malloc or free operation in the user space, 
and always in the page fault handler in the kernel.  This really amounts to 
running a medium sized app through the startup, which does a whole bunch of 
malloc and free because it maps in the elf debug info for /bin/ls , and gdb is 
not really getting used for anything to ptrace.

Jason.


Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Alex Williamson
On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> 
> > I'm not following you.
> > 
> > You have to enforce group/iommu domain assignment whether you have the
> > existing uiommu API, or if you change it to your proposed
> > ioctl(inherit_iommu) API.
> > 
> > The only change needed to VFIO here should be to make uiommu fd assignment
> > happen on the groups instead of on device fds.  That operation fails or
> > succeeds according to the group semantics (all-or-none assignment/same
> > uiommu).
> 
> Ok, so I missed that part where you change uiommu to operate on group
> fd's rather than device fd's, my apologies if you actually wrote that
> down :-) It might be obvious ... bare with me I just flew back from the
> US and I am badly jet lagged ...

I missed it too, the model I'm proposing entirely removes the uiommu
concept.

> So I see what you mean, however...
> 
> > I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> > arbitrary mapping (satisfying group constraints) as we do today.
> > 
> > I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> > ability and definitely think the uiommu approach is cleaner than the
> > ioctl(inherit_iommu) approach.  We considered that approach before but it
> > seemed less clean so we went with the explicit uiommu context.
> 
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.

Current code: uiommu creates the domain, we bind a vfio device to that
domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
mappings via MAP_DMA on the vfio device (affecting all the vfio devices
bound to the domain)

My current proposal: "groups" are predefined.  groups ~= iommu domain.
The iommu domain would probably be allocated when the first device is
bound to vfio.  As each device is bound, it gets attached to the group.
DMAs are done via an ioctl on the group.

I think group + uiommu leads to effectively reliving most of the
problems with the current code.  The only benefit is the group
assignment to enforce hardware restrictions.  We still have the problem
that uiommu open() = iommu_domain_alloc(), whose properties are
meaningless without attached devices (groups).  Which I think leads to
the same awkward model of attaching groups to define the domain, then we
end up doing mappings via the group to enforce ordering.

> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

I cringe anytime there's a mention of "static".  IMHO, we have to
support hotplug.  That means "meta groups" change dynamically.  Maybe
this supports the idea that we should be able to retrieve a new fd from
the group to do mappings.  Any groups bound together will return the
same fd and the fd will persist so long as any member of the group is
open.

> That way, any program capable of exploiting VFIO "groups" will
> automatically be able to exploit those "meta groups" (or groups of
> groups) as well as long as they are supported on the system.
> 
> If we ever have system specific constraints as to how such groups can be
> created, then it can all be handled at the level of that configuration
> tool without impact on whatever programs know how to exploit them via
> the VFIO interfaces.

I'd prefer to have the constraints be represented in the ioctl to bind
groups.  It works or not and the platform gets to define what it
considers compatible.  Thanks,

Alex




Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Alex Williamson
On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > 
> > That's an interesting idea.  Maybe an interface similar to the current
> > uiommu interface, where you open() the 2nd group fd and pass the fd via
> > ioctl to the primary group.  IOMMUs that don't support this would fail
> > the attach device callback, which would fail the ioctl to bind them.  It
> > will need to be designed so any group can be removed from the super-set
> > and the remaining group(s) still works.  This feels like something that
> > can be added after we get an initial implementation.
> 
> Handling it through fds is a good idea. This makes sure that everything
> belongs to one process. I am not really sure yet if we go the way to
> just bind plain groups together or if we create meta-groups. The
> meta-groups thing seems somewhat cleaner, though.

I'm leaning towards binding because we need to make it dynamic, but I
don't really have a good picture of the lifecycle of a meta-group.

> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > The current vfio has the same problem, we can't unbind a device from
> > vfio while it's attached to a guest.  I think we'd use the same solution
> > too; send out a netlink packet for a device removal and have the .remove
> > call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> > and SIGBUS the PIDs holding the device if they don't return it
> > willingly.  Thanks,
> 
> Putting the process to sleep (which would be uninterruptible) seems bad.
> The process would sleep until the guest releases the device-group, which
> can take days or months.
> The best thing (and the most intrusive :-) ) is to change PCI core to
> allow unbindings to fail, I think. But this probably further complicates
> the way to upstream VFIO...

Yes, it's not ideal but I think it's sufficient for now and if we later
get support for returning an error from release, we can set a timeout
after notifying the user to make use of that.  Thanks,

Alex




Re: [Qemu-devel] [PATCH 2/6] tcg: Always define all of the TCGOpcode enum members.

2011-08-23 Thread Peter Maydell
On 17 August 2011 22:11, Richard Henderson  wrote:
> By always defining these symbols, we can eliminate a lot of ifdefs.
>
> To allow this to be checked reliably, the semantics of the
> TCG_TARGET_HAS_* macros must be changed from def/undef to true/false.
> This allows even more ifdefs to be removed, converting them into
> C if statements.

This breaks x86-64 hosts if built with --enable-debug:

petmay01@LinaroE102767:~/git/qemu$ ./arm-softmmu/qemu-system-arm
Missing op definition for div_i32
Missing op definition for divu_i32
Missing op definition for rem_i32
Missing op definition for remu_i32
Missing op definition for deposit_i32
Missing op definition for add2_i32
Missing op definition for sub2_i32
Missing op definition for brcond2_i32
Missing op definition for mulu2_i32
Missing op definition for setcond2_i32
Missing op definition for andc_i32
Missing op definition for orc_i32
Missing op definition for eqv_i32
Missing op definition for nand_i32
Missing op definition for nor_i32
Missing op definition for div_i64
Missing op definition for divu_i64
Missing op definition for rem_i64
Missing op definition for remu_i64
Missing op definition for deposit_i64
Missing op definition for andc_i64
Missing op definition for orc_i64
Missing op definition for eqv_i64
Missing op definition for nand_i64
Missing op definition for nor_i64
/home/petmay01/git/qemu/tcg/tcg.c:1148: tcg fatal error
Aborted

A compile-time check that the tcg target has #defined all the
TCG_TARGET_HAS_foo to 0/1 and not left any undefined might be
useful?

[thanks to mmu_man on irc for the report.]

-- PMM



Re: [Qemu-devel] [PATCH] sheepdog: use coroutines

2011-08-23 Thread MORITA Kazutaka
At Tue, 23 Aug 2011 14:29:50 +0200,
Kevin Wolf wrote:
> 
> Am 12.08.2011 14:33, schrieb MORITA Kazutaka:
> > This makes the sheepdog block driver support bdrv_co_readv/writev
> > instead of bdrv_aio_readv/writev.
> > 
> > With this patch, Sheepdog network I/O becomes fully asynchronous.  The
> > block driver yields back when send/recv returns EAGAIN, and is resumed
> > when the sheepdog network connection is ready for the operation.
> > 
> > Signed-off-by: MORITA Kazutaka 
> > ---
> >  block/sheepdog.c |  150 
> > +
> >  1 files changed, 93 insertions(+), 57 deletions(-)
> > 
> > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > index e150ac0..e283c34 100644
> > --- a/block/sheepdog.c
> > +++ b/block/sheepdog.c
> > @@ -274,7 +274,7 @@ struct SheepdogAIOCB {
> >  int ret;
> >  enum AIOCBState aiocb_type;
> >  
> > -QEMUBH *bh;
> > +Coroutine *coroutine;
> >  void (*aio_done_func)(SheepdogAIOCB *);
> >  
> >  int canceled;
> > @@ -295,6 +295,10 @@ typedef struct BDRVSheepdogState {
> >  char *port;
> >  int fd;
> >  
> > +CoMutex lock;
> > +Coroutine *co_send;
> > +Coroutine *co_recv;
> > +
> >  uint32_t aioreq_seq_num;
> >  QLIST_HEAD(outstanding_aio_head, AIOReq) outstanding_aio_head;
> >  } BDRVSheepdogState;
> > @@ -346,19 +350,16 @@ static const char * sd_strerror(int err)
> >  /*
> >   * Sheepdog I/O handling:
> >   *
> > - * 1. In the sd_aio_readv/writev, read/write requests are added to the
> > - *QEMU Bottom Halves.
> > - *
> > - * 2. In sd_readv_writev_bh_cb, the callbacks of BHs, we send the I/O
> > - *requests to the server and link the requests to the
> > - *outstanding_list in the BDRVSheepdogState.  we exits the
> > - *function without waiting for receiving the response.
> > + * 1. In sd_co_rw_vector, we send the I/O requests to the server and
> > + *link the requests to the outstanding_list in the
> > + *BDRVSheepdogState.  The function exits without waiting for
> > + *receiving the response.
> >   *
> > - * 3. We receive the response in aio_read_response, the fd handler to
> > + * 2. We receive the response in aio_read_response, the fd handler to
> >   *the sheepdog connection.  If metadata update is needed, we send
> >   *the write request to the vdi object in sd_write_done, the write
> > - *completion function.  The AIOCB callback is not called until all
> > - *the requests belonging to the AIOCB are finished.
> > + *completion function.  We switch back to sd_co_readv/writev after
> > + *all the requests belonging to the AIOCB are finished.
> >   */
> >  
> >  static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, SheepdogAIOCB 
> > *acb,
> > @@ -398,7 +399,7 @@ static inline int free_aio_req(BDRVSheepdogState *s, 
> > AIOReq *aio_req)
> >  static void sd_finish_aiocb(SheepdogAIOCB *acb)
> >  {
> >  if (!acb->canceled) {
> > -acb->common.cb(acb->common.opaque, acb->ret);
> > +qemu_coroutine_enter(acb->coroutine, NULL);
> >  }
> >  qemu_aio_release(acb);
> >  }
> > @@ -411,7 +412,8 @@ static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
> >   * Sheepdog cannot cancel the requests which are already sent to
> >   * the servers, so we just complete the request with -EIO here.
> >   */
> > -acb->common.cb(acb->common.opaque, -EIO);
> > +acb->ret = -EIO;
> > +qemu_coroutine_enter(acb->coroutine, NULL);
> >  acb->canceled = 1;
> >  }
> >  
> > @@ -435,24 +437,12 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState 
> > *bs, QEMUIOVector *qiov,
> >  
> >  acb->aio_done_func = NULL;
> >  acb->canceled = 0;
> > -acb->bh = NULL;
> > +acb->coroutine = qemu_coroutine_self();
> >  acb->ret = 0;
> >  QLIST_INIT(&acb->aioreq_head);
> >  return acb;
> >  }
> >  
> > -static int sd_schedule_bh(QEMUBHFunc *cb, SheepdogAIOCB *acb)
> > -{
> > -if (acb->bh) {
> > -error_report("bug: %d %d", acb->aiocb_type, acb->aiocb_type);
> > -return -EIO;
> > -}
> > -
> > -acb->bh = qemu_bh_new(cb, acb);
> > -qemu_bh_schedule(acb->bh);
> > -return 0;
> > -}
> > -
> >  #ifdef _WIN32
> >  
> >  struct msghdr {
> > @@ -635,7 +625,13 @@ static int do_readv_writev(int sockfd, struct iovec 
> > *iov, int len,
> >  again:
> >  ret = do_send_recv(sockfd, iov, len, iov_offset, write);
> >  if (ret < 0) {
> > -if (errno == EINTR || errno == EAGAIN) {
> > +if (errno == EINTR) {
> > +goto again;
> > +}
> > +if (errno == EAGAIN) {
> > +if (qemu_in_coroutine()) {
> > +qemu_coroutine_yield();
> > +}
> 
> Who reenters the coroutine if we yield here?

The fd handlers, co_write_request() and co_read_response(), will call
qemu_coroutine_enter() for the coroutine.  It will be restarted after
the sheepdog network connection becomes ready.

> 
> And of course for a coroutine 

Re: [Qemu-devel] [PATCH 2/6] tcg: Always define all of the TCGOpcode enum members.

2011-08-23 Thread Richard Henderson
On 08/23/2011 10:11 AM, Peter Maydell wrote:
> A compile-time check that the tcg target has #defined all the
> TCG_TARGET_HAS_foo to 0/1 and not left any undefined might be
> useful?

That compile-time check is the uses in tcg-op.h.  If they're
not defined you'll get undefined symbol errors in that file.

I'll look into the problem...


r~



[Qemu-devel] Nous sommes là pour

2011-08-23 Thread Prod6



 
Nous sommes là pour 
vous 
faire bénéficier d’une 
étude 
GRATUITE 
et
sans engagement de votre part pour l’installation 
de :
 
• 
Climatisation réversible, 
• 
Chauffe-eau solaire
• 
Energie solaire pour chauffer votre piscine
• 
Isolation 
thermique des combles 
• 
Adoucisseur 
d’eau
• 
L’énergie 
solaire vous permet de
devenir producteur d’énergie et de revendre votre électricité à 
EDF* 
Vous 
possédez une propriété, un entrepôt, 
un gîte, un garage, un bâtiment 
industriel, une grange ou vous voulez construire un abri, un au 
vent … 
Bénéficier d’un revenue complémentaire durable pendant VINGT 
ans.
 
(*sous 
réserve d'acceptation de votre dossier par EDF
 
Contactez-nous 
***0 
2 
pste 
4 0 
pste 
5 7 
pste 
0 1pste1p3ppp
**contacte-nous par email 
ici
 
***C°A°D°O°V°I°S
 
 
Pour ne plus recevoir notre newsletter CONTACTEZ-NOUS PAR EMAIL-ICI 





[Qemu-devel] [PATCH] pci: Error on PCI capability collisions

2011-08-23 Thread Jan Kiszka
From: Alex Williamson 

Nothing good can happen when we overlap capabilities

[ Jan: rebased over qemu, minor formatting ]

Signed-off-by: Jan Kiszka 
---
 hw/pci.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 6124790..ff20631 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1952,11 +1952,25 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
uint8_t offset, uint8_t size)
 {
 uint8_t *config;
+int i;
+
 if (!offset) {
 offset = pci_find_space(pdev, size);
 if (!offset) {
 return -ENOSPC;
 }
+} else {
+for (i = offset; i < offset + size; i++) {
+if (pdev->used[i]) {
+fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
+"Attempt to add PCI capability %x at offset "
+"%x overlaps existing capability %x at offset %x\n",
+pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
+PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
+cap_id, offset, pdev->config_map[i], i);
+return -EFAULT;
+}
+}
 }
 
 config = pdev->config + offset;
-- 
1.7.3.4



Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Aaron Fabbri



On 8/23/11 10:01 AM, "Alex Williamson"  wrote:

> On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
>> 
>>> I'm not following you.
>>> 
>>> You have to enforce group/iommu domain assignment whether you have the
>>> existing uiommu API, or if you change it to your proposed
>>> ioctl(inherit_iommu) API.
>>> 
>>> The only change needed to VFIO here should be to make uiommu fd assignment
>>> happen on the groups instead of on device fds.  That operation fails or
>>> succeeds according to the group semantics (all-or-none assignment/same
>>> uiommu).
>> 
>> Ok, so I missed that part where you change uiommu to operate on group
>> fd's rather than device fd's, my apologies if you actually wrote that
>> down :-) It might be obvious ... bare with me I just flew back from the
>> US and I am badly jet lagged ...
> 
> I missed it too, the model I'm proposing entirely removes the uiommu
> concept.
> 
>> So I see what you mean, however...
>> 
>>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
>>> arbitrary mapping (satisfying group constraints) as we do today.
>>> 
>>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
>>> ability and definitely think the uiommu approach is cleaner than the
>>> ioctl(inherit_iommu) approach.  We considered that approach before but it
>>> seemed less clean so we went with the explicit uiommu context.
>> 
>> Possibly, the question that interest me the most is what interface will
>> KVM end up using. I'm also not terribly fan with the (perceived)
>> discrepancy between using uiommu to create groups but using the group fd
>> to actually do the mappings, at least if that is still the plan.
> 
> Current code: uiommu creates the domain, we bind a vfio device to that
> domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> bound to the domain)
> 
> My current proposal: "groups" are predefined.  groups ~= iommu domain.

This is my main objection.  I'd rather not lose the ability to have multiple
devices (which are all predefined as singleton groups on x86 w/o PCI
bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

> The iommu domain would probably be allocated when the first device is
> bound to vfio.  As each device is bound, it gets attached to the group.
> DMAs are done via an ioctl on the group.
> 
> I think group + uiommu leads to effectively reliving most of the
> problems with the current code.  The only benefit is the group
> assignment to enforce hardware restrictions.  We still have the problem
> that uiommu open() = iommu_domain_alloc(), whose properties are
> meaningless without attached devices (groups).  Which I think leads to
> the same awkward model of attaching groups to define the domain, then we
> end up doing mappings via the group to enforce ordering.

Is there a better way to allow groups to share an IOMMU domain?

Maybe, instead of having an ioctl to allow a group A to inherit the same
iommu domain as group B, we could have an ioctl to fully merge two groups
(could be what Ben was thinking):

A.ioctl(MERGE_TO_GROUP, B)

The group A now goes away and its devices join group B.  If A ever had an
iommu domain assigned (and buffers mapped?) we fail.

Groups cannot get smaller (they are defined as minimum granularity of an
IOMMU, initially).  They can get bigger if you want to share IOMMU
resources, though.

Any downsides to this approach?

-AF

> 
>> If the separate uiommu interface is kept, then anything that wants to be
>> able to benefit from the ability to put multiple devices (or existing
>> groups) into such a "meta group" would need to be explicitly modified to
>> deal with the uiommu APIs.
>> 
>> I tend to prefer such "meta groups" as being something you create
>> statically using a configuration interface, either via sysfs, netlink or
>> ioctl's to a "control" vfio device driven by a simple command line tool
>> (which can have the configuration stored in /etc and re-apply it at
>> boot).
> 
> I cringe anytime there's a mention of "static".  IMHO, we have to
> support hotplug.  That means "meta groups" change dynamically.  Maybe
> this supports the idea that we should be able to retrieve a new fd from
> the group to do mappings.  Any groups bound together will return the
> same fd and the fd will persist so long as any member of the group is
> open.
> 
>> That way, any program capable of exploiting VFIO "groups" will
>> automatically be able to exploit those "meta groups" (or groups of
>> groups) as well as long as they are supported on the system.
>> 
>> If we ever have system specific constraints as to how such groups can be
>> created, then it can all be handled at the level of that configuration
>> tool without impact on whatever programs know how to exploit t

[Qemu-devel] [PATCH] tcg: Update --enable-debug for TCG_OPF_NOT_PRESENT.

2011-08-23 Thread Richard Henderson
Signed-off-by: Richard Henderson 
---
 tcg/tcg.c |   15 ---
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index 06ce214..411f971 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1128,18 +1128,19 @@ void tcg_add_target_add_op_defs(const TCGTargetOpDef 
*tdefs)
 #if defined(CONFIG_DEBUG_TCG)
 i = 0;
 for (op = 0; op < ARRAY_SIZE(tcg_op_defs); op++) {
-if (op < INDEX_op_call || op == INDEX_op_debug_insn_start) {
+const TCGOpDef *def = &tcg_op_defs[op];
+if (op < INDEX_op_call
+|| op == INDEX_op_debug_insn_start
+|| (def->flags & TCG_OPF_NOT_PRESENT)) {
 /* Wrong entry in op definitions? */
-if (tcg_op_defs[op].used) {
-fprintf(stderr, "Invalid op definition for %s\n",
-tcg_op_defs[op].name);
+if (def->used) {
+fprintf(stderr, "Invalid op definition for %s\n", def->name);
 i = 1;
 }
 } else {
 /* Missing entry in op definitions? */
-if (!tcg_op_defs[op].used) {
-fprintf(stderr, "Missing op definition for %s\n",
-tcg_op_defs[op].name);
+if (!def->used) {
+fprintf(stderr, "Missing op definition for %s\n", def->name);
 i = 1;
 }
 }
-- 
1.7.4.4




[Qemu-devel] PPC* and Sparc32 crash

2011-08-23 Thread Blue Swirl
qemu-system-ppc: /src/qemu/memory.c:1183:
memory_region_add_subregion_common: Assertion `!subregion->parent'
failed.
Aborted
qemu-system-ppc64: /src/qemu/memory.c:1183:
memory_region_add_subregion_common: Assertion `!subregion->parent'
failed.
Aborted
qemu-system-sparc: /src/qemu/hw/sysbus.c:156:
sysbus_register_withprop: Assertion `info->qdev.size >=
sizeof(SysBusDevice)' failed.
Aborted

This is with b861b7419c49ad53e786062b4fbf6da53468f130. Other targets
seem to work.



Re: [Qemu-devel] PPC* and Sparc32 crash

2011-08-23 Thread Peter Maydell
On 23 August 2011 18:55, Blue Swirl  wrote:
> qemu-system-ppc: /src/qemu/memory.c:1183:
> memory_region_add_subregion_common: Assertion `!subregion->parent'
> failed.
> Aborted
> qemu-system-ppc64: /src/qemu/memory.c:1183:
> memory_region_add_subregion_common: Assertion `!subregion->parent'
> failed.
> Aborted
> qemu-system-sparc: /src/qemu/hw/sysbus.c:156:
> sysbus_register_withprop: Assertion `info->qdev.size >=
> sizeof(SysBusDevice)' failed.
> Aborted
>
> This is with b861b7419c49ad53e786062b4fbf6da53468f130. Other targets
> seem to work.

n810 is also broken with a similar assertion to ppc, which I have a
patch in progress to fix.

-- PMM



Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Alex Williamson
On Tue, 2011-08-23 at 10:33 -0700, Aaron Fabbri wrote:
> 
> 
> On 8/23/11 10:01 AM, "Alex Williamson"  wrote:
> 
> > On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> >> 
> >>> I'm not following you.
> >>> 
> >>> You have to enforce group/iommu domain assignment whether you have the
> >>> existing uiommu API, or if you change it to your proposed
> >>> ioctl(inherit_iommu) API.
> >>> 
> >>> The only change needed to VFIO here should be to make uiommu fd assignment
> >>> happen on the groups instead of on device fds.  That operation fails or
> >>> succeeds according to the group semantics (all-or-none assignment/same
> >>> uiommu).
> >> 
> >> Ok, so I missed that part where you change uiommu to operate on group
> >> fd's rather than device fd's, my apologies if you actually wrote that
> >> down :-) It might be obvious ... bare with me I just flew back from the
> >> US and I am badly jet lagged ...
> > 
> > I missed it too, the model I'm proposing entirely removes the uiommu
> > concept.
> > 
> >> So I see what you mean, however...
> >> 
> >>> I think the question is: do we force 1:1 iommu/group mapping, or do we 
> >>> allow
> >>> arbitrary mapping (satisfying group constraints) as we do today.
> >>> 
> >>> I'm saying I'm an existing user who wants the arbitrary iommu/group 
> >>> mapping
> >>> ability and definitely think the uiommu approach is cleaner than the
> >>> ioctl(inherit_iommu) approach.  We considered that approach before but it
> >>> seemed less clean so we went with the explicit uiommu context.
> >> 
> >> Possibly, the question that interest me the most is what interface will
> >> KVM end up using. I'm also not terribly fan with the (perceived)
> >> discrepancy between using uiommu to create groups but using the group fd
> >> to actually do the mappings, at least if that is still the plan.
> > 
> > Current code: uiommu creates the domain, we bind a vfio device to that
> > domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> > mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> > bound to the domain)
> > 
> > My current proposal: "groups" are predefined.  groups ~= iommu domain.
> 
> This is my main objection.  I'd rather not lose the ability to have multiple
> devices (which are all predefined as singleton groups on x86 w/o PCI
> bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
> require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

We do care, I just wasn't prioritizing it as heavily since I think the
typical model is probably closer to 1 device per guest.

> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

That's sort of the way I'm picturing it.  When groups are bound
together, they effectively form a pool, where all the groups are peers.
When the MERGE/BIND ioctl is called on group A and passed the group B
fd, A can check compatibility of the domain associated with B, unbind
devices from the B domain and attach them to the A domain.  The B domain
would then be freed and it would bump the refcnt on the A domain.  If we
need to remove A from the pool, we call UNMERGE/UNBIND on B with the A
fd, it will remove the A devices from the shared object, disassociate A
with the shared object, re-alloc a domain for A and rebind A devices to
that domain. 

This is where it seems like it might be helpful to make a GET_IOMMU_FD
ioctl so that an iommu object is ubiquitous and persistent across the
pool.  Operations on any group fd work on the pool as a whole.  Thanks,

Alex




[Qemu-devel] [Bug 821078] Re: virtio-serial-bus: Unexpected port id

2011-08-23 Thread Rick Vernam
Using qemu-kvm-0.15.0 for a few weeks now, I'd like to report that maybe
1/10 times it works as expected: ie, there is no virtio-serial-bus
error, vdagent in the windows guest starts, etc.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/821078

Title:
  virtio-serial-bus: Unexpected port id

Status in QEMU:
  New

Bug description:
  With qemu-kvm-0.15.0-rc1 virtio-serial-bus reports an error, and windows 
vdagent can not start.  qemu-0.15.0-rc1 behaves as expected, ie vdagent runs in 
the guest, mouse passes seamlessly between spicec and host and copy/paste works 
between guest and host.
  qemu-kvm has been configured with
  ./configure --target-list=x86_64-softmmu --disable-curses  --disable-curl 
--audio-drv-list=alsa --audio-card-list=sb16,ac97,hda --enable-vnc-thread 
--disable-bluez --enable-vhost-net --enable-spice
  and is started with
  qemu-system-x86_64 -cpu host -enable-kvm -pidfile /home/rick/qemu/hds/wxp.pid 
-drive file=/home/rick/qemu/hds/wxp.raw,if=virtio,aio=native -m 1536 -name 
WinXP -net nic,model -net user -localtime -usb -vga qxl -device 
virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -chardev 
spicevmc,name=vdagent,id=vdagent -device 
virtserialport,nr=1,bus=virtio-serial0.0,chardev=vdagent,name=com.redhat.spice.0
 -spice port=1234,disable-ticketing -monitor stdio

  I've also tried start qemu like
  qemu-system-x86_64 -cpu host -enable-kvm -pidfile /home/rick/qemu/hds/wxp.pid 
-drive file=/home/rick/qemu/hds/wxp.raw,if=virtio -m 768 -name WinXP -net 
nic,model=virtio -net user -localtime -usb -vga qxl -device virtio-serial 
-chardev spicevmc,name=vdagent,id=vdagent -device 
virtserialport,chardev=vdagent,name=com.redhat.spice.0 -spice 
port=1234,disable-ticketing -monitor stdio
  and observed the same results.

  the host runs 2.6.39.4 vanilla kernel.  the guest uses the most recent
  virtio-serial, vga-qxl and vdagent from spice-space.org

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/821078/+subscriptions



Re: [Qemu-devel] [PATCH] pci: Error on PCI capability collisions

2011-08-23 Thread Michael S. Tsirkin
On Tue, Aug 23, 2011 at 07:28:08PM +0200, Jan Kiszka wrote:
> From: Alex Williamson 
> 
> Nothing good can happen when we overlap capabilities
> 
> [ Jan: rebased over qemu, minor formatting ]
> 
> Signed-off-by: Jan Kiszka 

I'll stick an assert there instead. Normal devices
don't generate overlapping caps unless there's a bug,
and device assignment should do it's own checks.

I really have a mind to rip out the used array too.

> ---
>  hw/pci.c |   14 ++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index 6124790..ff20631 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -1952,11 +1952,25 @@ int pci_add_capability(PCIDevice *pdev, uint8_t 
> cap_id,
> uint8_t offset, uint8_t size)
>  {
>  uint8_t *config;
> +int i;
> +
>  if (!offset) {
>  offset = pci_find_space(pdev, size);
>  if (!offset) {
>  return -ENOSPC;
>  }
> +} else {
> +for (i = offset; i < offset + size; i++) {
> +if (pdev->used[i]) {
> +fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
> +"Attempt to add PCI capability %x at offset "
> +"%x overlaps existing capability %x at offset %x\n",
> +pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
> +PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> +cap_id, offset, pdev->config_map[i], i);
> +return -EFAULT;
> +}
> +}
>  }
>  
>  config = pdev->config + offset;
> -- 
> 1.7.3.4



Re: [Qemu-devel] [PATCH] pci: Error on PCI capability collisions

2011-08-23 Thread Alex Williamson
On Tue, 2011-08-23 at 21:17 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 23, 2011 at 07:28:08PM +0200, Jan Kiszka wrote:
> > From: Alex Williamson 
> > 
> > Nothing good can happen when we overlap capabilities
> > 
> > [ Jan: rebased over qemu, minor formatting ]
> > 
> > Signed-off-by: Jan Kiszka 
> 
> I'll stick an assert there instead. Normal devices
> don't generate overlapping caps unless there's a bug,
> and device assignment should do it's own checks.
> 
> I really have a mind to rip out the used array too.

So you'd rather kill qemu rather than have a reasonable error return
path... great :(

Alex

> > ---
> >  hw/pci.c |   14 ++
> >  1 files changed, 14 insertions(+), 0 deletions(-)
> > 
> > diff --git a/hw/pci.c b/hw/pci.c
> > index 6124790..ff20631 100644
> > --- a/hw/pci.c
> > +++ b/hw/pci.c
> > @@ -1952,11 +1952,25 @@ int pci_add_capability(PCIDevice *pdev, uint8_t 
> > cap_id,
> > uint8_t offset, uint8_t size)
> >  {
> >  uint8_t *config;
> > +int i;
> > +
> >  if (!offset) {
> >  offset = pci_find_space(pdev, size);
> >  if (!offset) {
> >  return -ENOSPC;
> >  }
> > +} else {
> > +for (i = offset; i < offset + size; i++) {
> > +if (pdev->used[i]) {
> > +fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
> > +"Attempt to add PCI capability %x at offset "
> > +"%x overlaps existing capability %x at offset 
> > %x\n",
> > +pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
> > +PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> > +cap_id, offset, pdev->config_map[i], i);
> > +return -EFAULT;
> > +}
> > +}
> >  }
> >  
> >  config = pdev->config + offset;
> > -- 
> > 1.7.3.4






[Qemu-devel] [PATCH] hw/omap_gpmc: Don't try to map CS0 twice on reset

2011-08-23 Thread Peter Maydell
Remove a spurious second map of the OMAP GPMC CS0 region on reset.
This fixes an assertion failure when we try to add the region to
its container when it was already added. (The old code did not
complain about mismatched map/unmap calls, but the new MemoryRegion
implementation does.)

Signed-off-by: Peter Maydell 
---
 hw/omap_gpmc.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/hw/omap_gpmc.c b/hw/omap_gpmc.c
index 922d622..673 100644
--- a/hw/omap_gpmc.c
+++ b/hw/omap_gpmc.c
@@ -135,7 +135,6 @@ void omap_gpmc_reset(struct omap_gpmc_s *s)
 s->cs_file[i].config[6] & 0x1f,/* MASKADDR */
 (s->cs_file[i].config[6] >> 8 & 0xf)); /* BASEADDR */
 }
-omap_gpmc_cs_map(s->cs_file, 0, 0xf);
 s->ecc_cs = 0;
 s->ecc_ptr = 0;
 s->ecc_cfg = 0x3fcff000;
-- 
1.7.1




Re: [Qemu-devel] [PATCH] pci: Error on PCI capability collisions

2011-08-23 Thread Michael S. Tsirkin
On Tue, Aug 23, 2011 at 12:21:47PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-23 at 21:17 +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 23, 2011 at 07:28:08PM +0200, Jan Kiszka wrote:
> > > From: Alex Williamson 
> > > 
> > > Nothing good can happen when we overlap capabilities
> > > 
> > > [ Jan: rebased over qemu, minor formatting ]
> > > 
> > > Signed-off-by: Jan Kiszka 
> > 
> > I'll stick an assert there instead. Normal devices
> > don't generate overlapping caps unless there's a bug,
> > and device assignment should do it's own checks.
> > 
> > I really have a mind to rip out the used array too.
> 
> So you'd rather kill qemu rather than have a reasonable error return
> path... great :(
> 
> Alex

Well that will make it possible to make pci_add_capability return void,
less work for callers :) Dev assignment is really the only place where
capability offsets need to be verified.

> > > ---
> > >  hw/pci.c |   14 ++
> > >  1 files changed, 14 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/hw/pci.c b/hw/pci.c
> > > index 6124790..ff20631 100644
> > > --- a/hw/pci.c
> > > +++ b/hw/pci.c
> > > @@ -1952,11 +1952,25 @@ int pci_add_capability(PCIDevice *pdev, uint8_t 
> > > cap_id,
> > > uint8_t offset, uint8_t size)
> > >  {
> > >  uint8_t *config;
> > > +int i;
> > > +
> > >  if (!offset) {
> > >  offset = pci_find_space(pdev, size);
> > >  if (!offset) {
> > >  return -ENOSPC;
> > >  }
> > > +} else {
> > > +for (i = offset; i < offset + size; i++) {
> > > +if (pdev->used[i]) {
> > > +fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
> > > +"Attempt to add PCI capability %x at offset "
> > > +"%x overlaps existing capability %x at offset 
> > > %x\n",
> > > +pci_find_domain(pdev->bus), 
> > > pci_bus_num(pdev->bus),
> > > +PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> > > +cap_id, offset, pdev->config_map[i], i);
> > > +return -EFAULT;
> > > +}
> > > +}
> > >  }
> > >  
> > >  config = pdev->config + offset;
> > > -- 
> > > 1.7.3.4
> 
> 



Re: [Qemu-devel] [PATCH] tcg: Update --enable-debug for TCG_OPF_NOT_PRESENT.

2011-08-23 Thread Peter Maydell
On 23 August 2011 18:43, Richard Henderson  wrote:
> Signed-off-by: Richard Henderson 

Reviewed-by: Peter Maydell 

Confirmed that this fixes the assertion on x86-64 host.

-- PMM



Re: [Qemu-devel] [PATCH] pci: Error on PCI capability collisions

2011-08-23 Thread Alex Williamson
On Tue, 2011-08-23 at 21:26 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 23, 2011 at 12:21:47PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 21:17 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Aug 23, 2011 at 07:28:08PM +0200, Jan Kiszka wrote:
> > > > From: Alex Williamson 
> > > > 
> > > > Nothing good can happen when we overlap capabilities
> > > > 
> > > > [ Jan: rebased over qemu, minor formatting ]
> > > > 
> > > > Signed-off-by: Jan Kiszka 
> > > 
> > > I'll stick an assert there instead. Normal devices
> > > don't generate overlapping caps unless there's a bug,
> > > and device assignment should do it's own checks.
> > > 
> > > I really have a mind to rip out the used array too.
> > 
> > So you'd rather kill qemu rather than have a reasonable error return
> > path... great :(
> > 
> > Alex
> 
> Well that will make it possible to make pci_add_capability return void,
> less work for callers :) Dev assignment is really the only place where
> capability offsets need to be verified.

A few issues with that... Since when is error handling so difficult that
we need to pretend that nothing ever fails just to make it easy for the
caller?  Why is device assignment such a special case?  It's actually
rather ironic that we're trying to add error checking to catch bugs that
real hardware is exposing, but assuming that emulated drivers always get
it right.  How will a return void help the emulated driver that has a
coding error?

Alex

> > > > ---
> > > >  hw/pci.c |   14 ++
> > > >  1 files changed, 14 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/hw/pci.c b/hw/pci.c
> > > > index 6124790..ff20631 100644
> > > > --- a/hw/pci.c
> > > > +++ b/hw/pci.c
> > > > @@ -1952,11 +1952,25 @@ int pci_add_capability(PCIDevice *pdev, uint8_t 
> > > > cap_id,
> > > > uint8_t offset, uint8_t size)
> > > >  {
> > > >  uint8_t *config;
> > > > +int i;
> > > > +
> > > >  if (!offset) {
> > > >  offset = pci_find_space(pdev, size);
> > > >  if (!offset) {
> > > >  return -ENOSPC;
> > > >  }
> > > > +} else {
> > > > +for (i = offset; i < offset + size; i++) {
> > > > +if (pdev->used[i]) {
> > > > +fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
> > > > +"Attempt to add PCI capability %x at offset "
> > > > +"%x overlaps existing capability %x at offset 
> > > > %x\n",
> > > > +pci_find_domain(pdev->bus), 
> > > > pci_bus_num(pdev->bus),
> > > > +PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> > > > +cap_id, offset, pdev->config_map[i], i);
> > > > +return -EFAULT;
> > > > +}
> > > > +}
> > > >  }
> > > >  
> > > >  config = pdev->config + offset;
> > > > -- 
> > > > 1.7.3.4
> > 
> > 






  1   2   >