from:"Luigi Rizzo"

[Qemu-devel] new pc-bios/bios.bin breaks freebsd booting

2012-12-12 Thread Luigi Rizzo

I am not sure if it has been reported already but this commit

http://git.qemu.org/?p=qemu.git;a=commitdiff;h=d7a51dbbaa70677846453f8c961590913052dd86

(replacing pc-bios/bios.bin with a newer version)
breaks booting of FreeBSD on recent qemu (starting roughly with qemu-
1.3.0-rc2).

Using a FreeBSD host, and a FreeBSD guest,
the qemu thread runs at 100% and the console is stuck
after the 'pci0' probe:


...
hpet0:  iomem 0xfed0-0xfed003ff on acpi0

Timecounter "HPET" frequency 1 Hz quality 950

Timecounter "ACPI-fast" frequency 3579545 Hz quality 900

acpi_timer0: <24-bit timer at 3.579545MHz> port 0xb008-0xb00b on acpi0

pcib0:  port 0xcf8-0xcff on acpi0

pci0:  on pcib0

Reverting the bios fixes things.
I wonder if it isn't the case of reverting this change ?

cheers
luigi



-- 
-+-------
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] new pc-bios/bios.bin breaks freebsd booting

2012-12-12 Thread Luigi Rizzo

On Wed, Dec 12, 2012 at 06:47:58PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 17:04, Luigi Rizzo ha scritto:
> > I am not sure if it has been reported already but this commit
> > 
> > http://git.qemu.org/?p=qemu.git;a=commitdiff;h=d7a51dbbaa70677846453f8c961590913052dd86
> > 
> > (replacing pc-bios/bios.bin with a newer version)
> > breaks booting of FreeBSD on recent qemu (starting roughly with
> > qemu-1.3.0-rc2).
> > 
> > Using a FreeBSD host, and a FreeBSD guest,
> > the qemu thread runs at 100% and the console is stuck
> > after the 'pci0' probe:
> > 
> >
> > ...
> > hpet0:  iomem 0xfed0-0xfed003ff on acpi0
> >
> > Timecounter "HPET" frequency 1 Hz quality 950  
> > 
> > Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
> >
> > acpi_timer0: <24-bit timer at 3.579545MHz> port 0xb008-0xb00b on acpi0  
> >
> > pcib0:  port 0xcf8-0xcff on acpi0
> > 
> > pci0:  on pcib0
> > 
> > Reverting the bios fixes things.
> > I wonder if it isn't the case of reverting this change ?
> 
> Not reverting the change (which fixes other things), but yes---we should
> get the fix into SeaBIOS.
> 
> I don't have a FreeBSD VM handy, can you try the attached BIOS so I can
> have your Tested-by?  The patch I used is after my signature.

thanks, the attached bios successfully boots a FreeBSD guest

Tested-by: Luigi Rizzo 


cheers
luigi

> Paolo
> 
> diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
> index 8019b71..b58ef62 100644
> --- a/src/acpi-dsdt.dsl
> +++ b/src/acpi-dsdt.dsl
> @@ -187,7 +187,7 @@ DefinitionBlock (
>  
>  prt_slot0(0x),
>  /* Device 1 is power mgmt device, and can only use irq 9 */
> -Package() { 0x1, 0,0, 9 },
> +Package() { 0x1, 0, LNKS, 0 },
>  Package() { 0x1, 1, LNKB, 0 },
>  Package() { 0x1, 2, LNKC, 0 },
>  Package() { 0x1, 3, LNKD, 0 },
> @@ -278,6 +278,22 @@ DefinitionBlock (
>  define_link(LNKB, 1, PRQ1)
>  define_link(LNKC, 2, PRQ2)
>  define_link(LNKD, 3, PRQ3)
> +
> +Device(LNKS) {
> +Name(_HID, EISAID("PNP0C0F"))
> +Name(_UID, 4)
> +Name(_PRS, ResourceTemplate() {
> +Interrupt(, Level, ActiveHigh, Shared) { 9 }
> +})
> +
> +// The SCI cannot be disabled and is always attached to GSI 9,
> +// so these are no-ops.  We only need this link to override the
> +// polarity to active high and match the content of the MADT.
> +Method(_STA, 0, NotSerialized) { Return (0x0b) }
> +Method(_DIS, 0, NotSerialized) { }
> +Method(_CRS, 0, NotSerialized) { Return (_PRS) }
> +Method(_SRS, 1, NotSerialized) { }
> +}
>  }
>  
>  #include "acpi-dsdt-cpu-hotplug.dsl"
>

[Qemu-devel] [RFC] updated e1000 mitigation patch

2012-12-27 Thread Luigi Rizzo

Before submitting a proper patch, I'd like to hear feedback on the
following proposed change to hw/e1000.c to properly implement
interrupt mitigation. 
This is joint work with Vincenzo Maffione and Giuseppe Lettieri (in Cc),
and is a followup to a similar patch i posted in july

https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03195.html


The patch models three of the five (sic!) e1000 register that control
moderation, and uses qemu timers for that (the minimum delay for
these timers affects the fidelity of the emulation).

Right now there is a static variable that controls whether the
feature is enabled. I would probably like to make it a parameter
accessible from the command line in qemu, possibly extending it to
override the mitigation delay set by the device driver.

Right now we reached transmit speeds in the order of 2-300Kpps
(if matched with a proper guest device driver optimization, see
https://groups.google.com/forum/?fromgroups=#!topic/mailing.freebsd.emulator/xQHR_hleCuc
 )

Some performance data using a FreeBSD guest, for udp transmissions:

KVM QEMU
standard KVM, standard driver20 Kpps 6.3 Kpps
modified KVM, standard driver37 Kpps28 Kpps
modified KVM, modified driver   200 Kpps34 Kpps
 
As you can see, on kvm this change gives a 5x speedup to the tx path,
which combines nicely with the 2x speedup that comes from supporting
interrupt mitigation alone in the hypervisor. Without kvm (or kqemu ?)
the benefits are much lower, as the guest becomes too slow.

cheers
luigi

diff -urp qemu-1.3.0-orig/hw/e1000.c qemu-1.3.0/hw/e1000.c
--- qemu-1.3.0-orig/hw/e1000.c  2012-12-03 20:37:05.0 +0100
+++ qemu-1.3.0/hw/e1000.c   2012-12-27 09:47:16.0 +0100
@@ -35,6 +35,8 @@
 
 #include "e1000_hw.h"
 
+static int mit_on = 1; /* interrupt mitigation enable */
+
 #define E1000_DEBUG
 
 #ifdef E1000_DEBUG
@@ -129,6 +131,9 @@ typedef struct E1000State_st {
 } eecd_state;
 
 QEMUTimer *autoneg_timer;
+QEMUTimer *mit_timer;  // handle for the timer
+uint32_t mit_timer_on; // mitigation timer active
+uint32_t mit_cause;// pending interrupt cause
 } E1000State;
 
 #definedefreg(x)   x = (E1000_##x>>2)
@@ -144,6 +149,7 @@ enum {
 defreg(TPR),   defreg(TPT),defreg(TXDCTL), defreg(WUFC),
 defreg(RA),defreg(MTA),defreg(CRCERRS),defreg(VFTA),
 defreg(VET),
+defreg(RDTR),  defreg(RADV),   defreg(TADV),   defreg(ITR),
 };
 
 static void
@@ -639,6 +645,68 @@ static uint64_t tx_desc_base(E1000State
 return (bah << 32) + bal;
 }
 
+/* helper function, 0 means the value is not set */
+static inline void
+mit_update_delay(uint32_t *curr, uint32_t value)
+{
+if (value && (*curr == 0 || value < *curr))
+   *curr = value;
+}
+
+/*
+ * If necessary, rearm the timer and post an interrupt.
+ * Called at the end of tx/rx routines (mit_timer_on == 0),
+ * and when the timer fires (mit_timer_on == 1).
+ * We provide a partial implementation of interrupt mitigation,
+ * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for
+ * RADV and TADV, 256ns units for ITR). RDTR is only used to enable RADV;
+ * relative timers based on TIDV and RDTR are not implemented.
+ */
+static void
+mit_rearm_and_int(void *opaque)
+{
+E1000State *s = opaque;
+uint32_t mit_delay = 0;
+
+/*
+ * Clear the flag. It is only set when the callback fires,
+ * and we need to clear it anyways.
+ */
+s->mit_timer_on = 0;
+if (s->mit_cause == 0) /* no events pending, we are done */
+   return;
+/*
+ * Compute the next mitigation delay according to pending interrupts
+ * and the current values of RADV (provided RDTR!=0), TADV and ITR.
+ * Then rearm the timer.
+ */
+if (s->mit_cause & (E1000_ICR_TXQE | E1000_ICR_TXDW))
+   mit_update_delay(&mit_delay, s->mac_reg[TADV] * 4);
+if (s->mac_reg[RDTR] && (s->mit_cause & E1000_ICS_RXT0))
+   mit_update_delay(&mit_delay, s->mac_reg[RADV] * 4);
+mit_update_delay(&mit_delay, s->mac_reg[ITR]);  
+
+if (mit_delay) {
+   s->mit_timer_on = 1;
+   qemu_mod_timer(s->mit_timer,
+   qemu_get_clock_ns(vm_clock) + mit_delay * 256);
+}
+set_ics(s, 0, s->mit_cause);
+s->mit_cause = 0;
+}
+
+static void
+mit_set_ics(E1000State *s, uint32_t cause)
+{
+if (mit_on == 0) {
+   set_ics(s, 0, cause);
+   return;
+}
+s->mit_cause |= cause;
+if (!s->mit_timer_on)
+   mit_rearm_and_int(s);
+}
+
 static void
 start_xmit(E1000State *s)
 {
@@ -676,7 +744,7 @@ start_xmit(E1000State *s)
 break;
 }
 }
-set_ics(s, 0, cause);
+mit_set_ics(s, cause);
 }
 
 static int
@@ -894,7 +962,7 @@ e1000_receive(NetClientState *nc, const
 s->rxbuf_min_shift)
 n |= E1000_ICS_RXDMT0;
 
-set_ics(s, 0, n);
+mit_set_ics(s, n);

Re: [Qemu-devel] [PULL 0/1] update seabios

2013-01-02 Thread Luigi Rizzo

are you going to distribute a 1.3.x snapshot with the updated bios that
lets FreeBSD boot ?

thanks
luigi

On Wed, Jan 2, 2013 at 5:57 PM, Anthony Liguori wrote:

> Gerd Hoffmann  writes:
>
> >   Hi,
> >
> > One more seabios update, fixing the FreeBSD build failure.
> >
> > please pull,
> >   Gerd
>
>

Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-24 Thread Luigi Rizzo

On Thu, Jan 24, 2013 at 09:54:19AM +0100, Stefan Hajnoczi wrote:
> On Wed, Jan 23, 2013 at 06:55:59PM -0800, Luigi Rizzo wrote:
> > On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo  wrote:
> > 
> > > > I'm even doubtful that it's always a win on FreeBSD.  You have a
> > > > threshold to fall back to bcopy() and who knows what the "best" value
> > > > for various CPUs is.
> > >
> > > indeed.
> > > With the attached program (which however might be affected by the
> > > fact that data is not used after copying) it seems that on a recent
> > > linux (using gcc 4.6.2) the fastest is __builtin_memcpy()
> > >
> > > ./testlock -m __builtin_memcpy -l 64
> > >
> > > (by a factor of 2 or more) whereas all the other methods have
> > > approximately the same speed.
> > >
> > 
> > never mind, pilot error. in my test program i had swapped the
> > arguments to __builtin_memcpy(). With the correct ones,
> > __builtin_memcpy()  == bcopy == memcpy on both machines,
> > and never faster than the pkt_copy().
> 
> Are the bcopy()/memcpy() calls given a length that is a multiple of 64 bytes?
> 
> IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization
> can matches with memcpy(dst, src, (len + 63) & ~63).  Maybe it helps and
> at least ensures they are doing equal amounts of byte copying.

the length is a parameter from the command line.
For short packets, at least on the i7-2600 and freebsd the pkt_copy()
is only slightly faster than memcpy on multiples of 64, and *a lot*
faster when the length is not a multiple.
Again i am not sure whether it depends on the compiler/glibc or
simply on the CPU, unfortunately i have no way to swap machines.

luigi

Re: [Qemu-devel] (change FUNCTION to func according to qemu coding style)

2013-01-25 Thread Luigi Rizzo

On Fri, Jan 25, 2013 at 9:23 AM, Michael Tokarev  wrote:

>
> ---
> v2: change __FUNCTION__ to __func__ according to qemu coding style
>
>
will do. However it does not seem a highly followed standard :)

 lrizzo@netmap:~/work-qemu/qemu-git-head-20130121$ grep -r __FUNCTION__ . |
wc
3442119   25898
lrizzo@netmap:~/work-qemu/qemu-git-head-20130121$ grep -r __func__ . | wc
7595035   58455

cheers
luigi

Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-28 Thread Luigi Rizzo

On Thu, Jan 24, 2013 at 11:24 PM, Stefan Hajnoczi wrote:

> On Thu, Jan 24, 2013 at 6:35 PM, Luigi Rizzo  wrote:
>


> >> >
> >> > never mind, pilot error. in my test program i had swapped the
> >> > arguments to __builtin_memcpy(). With the correct ones,
> >> > __builtin_memcpy()  == bcopy == memcpy on both machines,
> >> > and never faster than the pkt_copy().
> >>
> >> Are the bcopy()/memcpy() calls given a length that is a multiple of 64
> bytes?
> >>
> >> IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization
> >> can matches with memcpy(dst, src, (len + 63) & ~63).  Maybe it helps and
> >> at least ensures they are doing equal amounts of byte copying.
> >
> > the length is a parameter from the command line.
> > For short packets, at least on the i7-2600 and freebsd the pkt_copy()
> > is only slightly faster than memcpy on multiples of 64, and *a lot*
> > faster when the length is not a multiple.
>
> How about dropping pkt_copy() and instead rounding the memcpy() length
> up to the next 64 byte multiple?
>

> Using memcpy() is more future-proof IMO, that's why I'm pushing for this.
>
>
fair enough, i'll make this conditional and enable memcpy() rounded to 64
bytes
multiples by default (though as i said the pkt_copy() is always at least
as fast as memcpy() on all machines i tried.

cheers
luigi

[Qemu-devel] [PATCH v2] fix qemu_flush_queued_packets() in presence of a hub

2013-02-05 Thread Luigi Rizzo

On Tue, Jan 29, 2013 at 02:08:27PM +0100, Stefan Hajnoczi wrote:

> It's cleaner to add a bool (*flush)(NetClientState *nc) function to
> NetClientInfo but given that net.c already knows about hub.c
> specifically we don't gain much by trying to decouple this callback.  If
> you want to leave it hardcoded that's fine by me.

I'd leave it like this for now. Here is the version with fixed style bugs.



DESCRIPTION:

When frontend and backend are connected through a hub as below
(showing only one direction), and the frontend (or in general, all
output ports of the hub) cannot accept more traffic, the backend
queues packets in queue-A.

When the frontend (or in general, one output port) becomes ready again,
quemu tries to flush packets from queue-B, which is unfortunately empty.

  e1000.0 <--[queue B]-- hub0port0(hub)hub0port1 <--[queue A]-- tap.0

To fix this i propose to introduce a new function net_hub_flush()
which is called when trying to flush a queue connected to a hub.

cheers
luigi

Signed-off-by: Luigi Rizzo 

diff --git a/net/hub.c b/net/hub.c
index a24c9d1..df32074 100644
--- a/net/hub.c
+++ b/net/hub.c
@@ -338,3 +338,17 @@ void net_hub_check_clients(void)
 }
 }
 }
+
+bool net_hub_flush(NetClientState *nc)
+{
+NetHubPort *port;
+NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc);
+int ret = 0;
+
+QLIST_FOREACH(port, &source_port->hub->ports, next) {
+if (port != source_port) {
+ret += qemu_net_queue_flush(port->nc.send_queue);
+}
+}
+return ret ? true : false;
+}
diff --git a/net/hub.h b/net/hub.h
index 583ada8..a625eff 100644
--- a/net/hub.h
+++ b/net/hub.h
@@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char 
*name);
 NetClientState *net_hub_find_client_by_name(int hub_id, const char *name);
 void net_hub_info(Monitor *mon);
 void net_hub_check_clients(void);
+bool net_hub_flush(NetClientState *nc);
 
 #endif /* NET_HUB_H */
diff --git a/net/net.c b/net/net.c
index 9806862..9489702 100644
--- a/net/net.c
+++ b/net/net.c
@@ -439,6 +439,12 @@ void qemu_flush_queued_packets(NetClientState *nc)
 {
 nc->receive_disabled = 0;
 
+if (nc->peer && nc->peer->info->type == NET_CLIENT_OPTIONS_KIND_HUBPORT) {
+if (net_hub_flush(nc->peer)) {
+qemu_notify_event();
+}
+return;
+}
 if (qemu_net_queue_flush(nc->send_queue)) {
 /* We emptied the queue successfully, signal to the IO thread to repoll
  * the file descriptor (for tap, for example).

[Qemu-devel] [PATCH v3] fix qemu_flush_queued_packets() in presence of a hub

2013-02-05 Thread Luigi Rizzo

DESCRIPTION:

When frontend and backend are connected through a hub as below
(showing only one direction), and the frontend (or in general, all
output ports of the hub) cannot accept more traffic, the backend
queues packets in queue-A.

When the frontend (or in general, one output port) becomes ready again,
quemu tries to flush packets from queue-B, which is unfortunately empty.

  e1000.0 <--[queue B]-- hub0port0(hub)hub0port1 <--[queue A]-- tap.0

To fix this i propose to introduce a new function net_hub_flush()
which is called when trying to flush a queue connected to a hub.

cheers
luigi

Signed-off-by: Luigi Rizzo 

diff --git a/net/hub.c b/net/hub.c
index a24c9d1..df32074 100644
--- a/net/hub.c
+++ b/net/hub.c
@@ -338,3 +338,17 @@ void net_hub_check_clients(void)
 }
 }
 }
+
+bool net_hub_flush(NetClientState *nc)
+{
+NetHubPort *port;
+NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc);
+int ret = 0;
+
+QLIST_FOREACH(port, &source_port->hub->ports, next) {
+if (port != source_port) {
+ret += qemu_net_queue_flush(port->nc.send_queue);
+}
+}
+return ret ? true : false;
+}
diff --git a/net/hub.h b/net/hub.h
index 583ada8..a625eff 100644
--- a/net/hub.h
+++ b/net/hub.h
@@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char 
*name);
 NetClientState *net_hub_find_client_by_name(int hub_id, const char *name);
 void net_hub_info(Monitor *mon);
 void net_hub_check_clients(void);
+bool net_hub_flush(NetClientState *nc);
 
 #endif /* NET_HUB_H */
diff --git a/net/net.c b/net/net.c
index 9806862..9489702 100644
--- a/net/net.c
+++ b/net/net.c
@@ -439,6 +439,12 @@ void qemu_flush_queued_packets(NetClientState *nc)
 {
 nc->receive_disabled = 0;
 
+if (nc->peer && nc->peer->info->type == NET_CLIENT_OPTIONS_KIND_HUBPORT) {
+if (net_hub_flush(nc->peer)) {
+qemu_notify_event();
+}
+return;
+}
 if (qemu_net_queue_flush(nc->send_queue)) {
 /* We emptied the queue successfully, signal to the IO thread to repoll
  * the file descriptor (for tap, for example).

[Qemu-devel] [PATCH v2] fix unbounded NetQueue

2013-02-05 Thread Luigi Rizzo

In the current implementation of qemu, running without a network
backend will cause the queue to grow unbounded when the guest is
transmitting traffic.

This patch fixes the problem by implementing bounded size NetQueue,
used with an arbitrary limit of 1 packets, and dropping packets
when the queue is full _and_ the sender does not pass a callback.

The second condition makes sure that we never drop packets that
contains a callback (which would be tricky, because the producer
expects the callback to be run when all previous packets have been
consumed; so we cannot run it when the packet is dropped).

If documentation is correct, producers that submit a callback should
stop sending when their packet is queued, so there is no real risk
that the queue exceeds the max size by large values.

Signed-off-by: Luigi Rizzo 

diff --git a/net/queue.c b/net/queue.c
index 6eaf5b6..859d02a 100644
--- a/net/queue.c
+++ b/net/queue.c
@@ -50,6 +50,8 @@ struct NetPacket {
 
 struct NetQueue {
 void *opaque;
+uint32_t nq_maxlen;
+uint32_t nq_count;
 
 QTAILQ_HEAD(packets, NetPacket) packets;
 
@@ -63,6 +65,8 @@ NetQueue *qemu_new_net_queue(void *opaque)
 queue = g_malloc0(sizeof(NetQueue));
 
 queue->opaque = opaque;
+queue->nq_maxlen = 1;
+queue->nq_count = 0;
 
 QTAILQ_INIT(&queue->packets);
 
@@ -92,6 +96,9 @@ static void qemu_net_queue_append(NetQueue *queue,
 {
 NetPacket *packet;
 
+if (queue->nq_count >= queue->nq_maxlen && !sent_cb) {
+return; /* drop if queue full and no callback */
+}
 packet = g_malloc(sizeof(NetPacket) + size);
 packet->sender = sender;
 packet->flags = flags;
@@ -99,6 +106,7 @@ static void qemu_net_queue_append(NetQueue *queue,
 packet->sent_cb = sent_cb;
 memcpy(packet->data, buf, size);
 
+queue->nq_count++;
 QTAILQ_INSERT_TAIL(&queue->packets, packet, entry);
 }
 
@@ -113,6 +121,9 @@ static void qemu_net_queue_append_iov(NetQueue *queue,
 size_t max_len = 0;
 int i;
 
+if (queue->nq_count >= queue->nq_maxlen && !sent_cb) {
+return; /* drop if queue full and no callback */
+}
 for (i = 0; i < iovcnt; i++) {
 max_len += iov[i].iov_len;
 }
@@ -130,6 +141,7 @@ static void qemu_net_queue_append_iov(NetQueue *queue,
 packet->size += len;
 }
 
+queue->nq_count++;
 QTAILQ_INSERT_TAIL(&queue->packets, packet, entry);
 }
 
@@ -220,6 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, NetClientState 
*from)
 QTAILQ_FOREACH_SAFE(packet, &queue->packets, entry, next) {
 if (packet->sender == from) {
 QTAILQ_REMOVE(&queue->packets, packet, entry);
+queue->nq_count--;
 g_free(packet);
 }
 }
@@ -233,6 +246,7 @@ bool qemu_net_queue_flush(NetQueue *queue)
 
 packet = QTAILQ_FIRST(&queue->packets);
 QTAILQ_REMOVE(&queue->packets, packet, entry);
+queue->nq_count--;
 
 ret = qemu_net_queue_deliver(queue,
  packet->sender,
@@ -240,6 +254,7 @@ bool qemu_net_queue_flush(NetQueue *queue)
  packet->data,
  packet->size);
 if (ret == 0) {
+queue->nq_count++;
 QTAILQ_INSERT_HEAD(&queue->packets, packet, entry);
 return false;
 }

[Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-06 Thread Luigi Rizzo

The following patch implements interrupt moderation registers
for the e1000 adapter. These feature is normally used by OS
drivers, and their implementation improves performance significantly,
especially on the transmit path.
The feature can be disabled through a command line option.
We have seen only benefits in throughput, while latency slightly
increases (that is an inherent feature of interrupt moderation)
because very short delays cannot be emulated precisely.

For those curious on performance, there is a set of measurements
(of this and other changes that we will post separately) at

http://info.iet.unipi.it/~luigi/research.html#qemu

cheers
luigi


Signed-off-by: Luigi Rizzo 

diff --git a/hw/e1000.c b/hw/e1000.c
index bb150c6..b4c0f4a 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -131,6 +131,10 @@ typedef struct E1000State_st {
 } eecd_state;
 
 QEMUTimer *autoneg_timer;
+QEMUTimer *mit_timer;   /* handle for the timer  */
+uint32_t  mit_timer_on; /* mitigation timer active   */
+uint32_t  mit_cause;/* pending interrupt cause   */
+uint32_t  mit_on;   /* mitigation enable */
 } E1000State;
 
 #definedefreg(x)   x = (E1000_##x>>2)
@@ -146,6 +150,7 @@ enum {
 defreg(TPR),   defreg(TPT),defreg(TXDCTL), defreg(WUFC),
 defreg(RA),defreg(MTA),defreg(CRCERRS),defreg(VFTA),
 defreg(VET),
+defreg(RDTR),   defreg(RADV),   defreg(TADV),   defreg(ITR),
 };
 
 static void
@@ -652,6 +657,73 @@ static uint64_t tx_desc_base(E1000State *s)
 return (bah << 32) + bal;
 }
 
+/* helper function, 0 means the value is not set */
+static inline void
+mit_update_delay(uint32_t *curr, uint32_t value)
+{
+if (value && (*curr == 0 || value < *curr)) {
+*curr = value;
+}
+}
+
+/*
+ * If necessary, rearm the timer and post an interrupt.
+ * Called at the end of tx/rx routines (mit_timer_on == 0),
+ * and when the timer fires (mit_timer_on == 1).
+ * We provide a partial implementation of interrupt mitigation,
+ * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for
+ * RADV and TADV, 256ns units for ITR). RDTR is only used to enable RADV;
+ * relative timers based on TIDV and RDTR are not implemented.
+ */
+static void
+mit_rearm_and_int(void *opaque)
+{
+E1000State *s = opaque;
+uint32_t mit_delay = 0;
+
+/*
+ * Clear the flag. It is only set when the callback fires,
+ * and we need to clear it anyways.
+ */
+s->mit_timer_on = 0;
+if (s->mit_cause == 0) { /* no events pending, we are done */
+return;
+}
+/*
+ * Compute the next mitigation delay according to pending interrupts
+ * and the current values of RADV (provided RDTR!=0), TADV and ITR.
+ * Then rearm the timer.
+ */
+if (s->mit_cause & (E1000_ICR_TXQE | E1000_ICR_TXDW)) {
+mit_update_delay(&mit_delay, s->mac_reg[TADV] * 4);
+}
+if (s->mac_reg[RDTR] && (s->mit_cause & E1000_ICS_RXT0)) {
+mit_update_delay(&mit_delay, s->mac_reg[RADV] * 4);
+}
+mit_update_delay(&mit_delay, s->mac_reg[ITR]);
+
+if (mit_delay) {
+s->mit_timer_on = 1;
+qemu_mod_timer(s->mit_timer,
+qemu_get_clock_ns(vm_clock) + mit_delay * 256);
+}
+
+set_ics(s, 0, s->mit_cause);
+s->mit_cause = 0;
+}
+
+static void
+mit_set_ics(E1000State *s, uint32_t cause)
+{
+if (s->mit_on == 0) {
+set_ics(s, 0, cause);
+return;
+}
+s->mit_cause |= cause;
+if (!s->mit_timer_on)
+mit_rearm_and_int(s);
+}
+
 static void
 start_xmit(E1000State *s)
 {
@@ -689,7 +761,7 @@ start_xmit(E1000State *s)
 break;
 }
 }
-set_ics(s, 0, cause);
+mit_set_ics(s, cause);
 }
 
 static int
@@ -908,7 +980,7 @@ e1000_receive(NetClientState *nc, const uint8_t *buf, 
size_t size)
 s->rxbuf_min_shift)
 n |= E1000_ICS_RXDMT0;
 
-set_ics(s, 0, n);
+mit_set_ics(s, n);
 
 return size;
 }
@@ -1013,6 +1085,7 @@ static uint32_t (*macreg_readops[])(E1000State *, int) = {
 getreg(RDH),   getreg(RDT),getreg(VET),getreg(ICS),
 getreg(TDBAL), getreg(TDBAH),  getreg(RDBAH),  getreg(RDBAL),
 getreg(TDLEN), getreg(RDLEN),
+getreg(RDTR),   getreg(RADV),   getreg(TADV),   getreg(ITR),
 
 [TOTH] = mac_read_clr8,[TORH] = mac_read_clr8, [GPRC] = mac_read_clr4,
 [GPTC] = mac_read_clr4,[TPR] = mac_read_clr4,  [TPT] = mac_read_clr4,
@@ -1029,6 +1102,8 @@ static void (*macreg_writeops[])(E1000State *, int, 
uint32_t) = {
 putreg(PBA),   putreg(EERD),   putreg(SWSM),   putreg(WUFC),
 putreg(TDBAL), putreg(TDBAH),  putreg(TXDCTL), putreg(RDBAH),
 putreg(RDBAL), putreg(LEDCTL), putreg(VET),
+[RDTR] = set_16bit, [RADV] = set_16bit, [TADV] = set_16bit,
+[ITR] = set_16bit,

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 8, 2013 at 2:02 AM, Stefan Hajnoczi  wrote:

> On Wed, Feb 06, 2013 at 03:23:41PM +0100, Luigi Rizzo wrote:
> > The following patch implements interrupt moderation registers
> > for the e1000 adapter. These feature is normally used by OS
> > drivers, and their implementation improves performance significantly,
> > especially on the transmit path.
> > The feature can be disabled through a command line option.
> > We have seen only benefits in throughput, while latency slightly
> > increases (that is an inherent feature of interrupt moderation)
> > because very short delays cannot be emulated precisely.
> >
> > For those curious on performance, there is a set of measurements
> > (of this and other changes that we will post separately) at
> >
> > http://info.iet.unipi.it/~luigi/research.html#qemu
>
> http://info.iet.unipi.it/~luigi/papers/20130206-qemu.pdf is 404.
>

sorry, fixed now.
And, will resubmit a fixed patch with uninit and fixed braces in the if()
statement.

I am happy to make this default to off. But it would be good if you could
actually give it a try. Note that linux and FreeBSD (and presumably windows)
do use moderation by default so enabling the emulation of the
registers makes the guest OS behave differently (and closer to bare metal).

To test that the patch itself does not cause regression in the emulator
one should also turn off moderation in the guest (one of the tests i have
run).

cheers
luigi

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 08, 2013 at 11:46:34AM +0100, Paolo Bonzini wrote:
> Il 08/02/2013 11:20, Luigi Rizzo ha scritto:
...
> > I am happy to make this default to off. But it would be good if you could
> > actually give it a try. Note that linux and FreeBSD (and presumably windows)
> > do use moderation by default so enabling the emulation of the
> > registers makes the guest OS behave differently (and closer to bare metal).
> > 
> > To test that the patch itself does not cause regression in the emulator
> > one should also turn off moderation in the guest (one of the tests i
> > have run).
> 
> I think making the default on is fine, but you need to add compatibility
> options to leave it off in older machine types (pc-1.4 and earlier).

I am unclear on what you mean by "older machine types" ?
The hardware (E1000_DEV_ID_82540EM) does have these registers,
and it is entirely up to the guest OS to use them or not.

cheers
luigi

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 08, 2013 at 11:59:12AM +0100, Stefan Hajnoczi wrote:
...
> >> http://info.iet.unipi.it/~luigi/papers/20130206-qemu.pdf is 404.
> >>
> >>
> >> sorry, fixed now.
> >> And, will resubmit a fixed patch with uninit and fixed braces in the
> >> if() statement.
> >>
> >> I am happy to make this default to off. But it would be good if you could
> >> actually give it a try. Note that linux and FreeBSD (and presumably 
> >> windows)
> >> do use moderation by default so enabling the emulation of the
> >> registers makes the guest OS behave differently (and closer to bare metal).
> >>
> >> To test that the patch itself does not cause regression in the emulator
> >> one should also turn off moderation in the guest (one of the tests i
> >> have run).
> >
> > I think making the default on is fine, but you need to add compatibility
> > options to leave it off in older machine types (pc-1.4 and earlier).
> 
> Latency regression.  Would need to see real results to understand how bad it 
> is.

most of the numbers in the paper are for FreeBSD,
not sure if they qualify as "real" here :)

For Linux, in guest-host configuration, we have these numbers for TCP_RR

CONFIGURATION   TCP_RR (KTps)

1 VCPU TX itr=0 17.7
1 VCPU TX itr=1007.3
2 VCPU TX itr=0 13.9
2 VCPU TX itr=1007.4

These are computed forcing the value in the itr register.
However, by default,
linux seems to dynamically adjust the itr values depending
on the load so it is difficult to make repeatable experiments,
This is why I'd encourage you to run some tests with what you
think is an appropriate configuration.

cheers
luigi

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 08, 2013 at 12:52:12PM +0100, Paolo Bonzini wrote:
> Il 07/02/2106 07:28, Luigi Rizzo ha scritto:
> > On Fri, Feb 08, 2013 at 11:46:34AM +0100, Paolo Bonzini wrote:
> >> Il 08/02/2013 11:20, Luigi Rizzo ha scritto:
> > ...
> >>> I am happy to make this default to off. But it would be good if you could
> >>> actually give it a try. Note that linux and FreeBSD (and presumably 
> >>> windows)
> >>> do use moderation by default so enabling the emulation of the
> >>> registers makes the guest OS behave differently (and closer to bare 
> >>> metal).
> >>>
> >>> To test that the patch itself does not cause regression in the emulator
> >>> one should also turn off moderation in the guest (one of the tests i
> >>> have run).
> >>
> >> I think making the default on is fine, but you need to add compatibility
> >> options to leave it off in older machine types (pc-1.4 and earlier).
> > 
> > I am unclear on what you mean by "older machine types" ?
> > The hardware (E1000_DEV_ID_82540EM) does have these registers,
> > and it is entirely up to the guest OS to use them or not.
> 
> Yes, but suppose a guest has a bug that was masked by the old
> non-implementation.  "-M pc-1.4" is supposed to bring back the old
> version's behavior as much as possible, including making the guest bug
> latent again.

ok i see. Is there some example code that already handles
a similar '-M' option so i can look at it and use to set
the mit_on parameter ?

cheers
luigi

[Qemu-devel] RFC: handling "backend too fast" in virtio-net

2013-02-14 Thread Luigi Rizzo

virtio-style network devices (where the producer and consumer chase
each other through a shared memory block) can enter into a
bad operating regime when the consumer is too fast.

I am hitting this case right now when virtio is used on top of the
netmap/VALE backend that I posted a few weeks ago: what happens is that
the backend is so fast that the io thread keeps re-enabling notifications
every few packets.  This results, on my test machine, in a throughput of
250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps).

I'd like to get some feedback on the following trivial patch to have
the thread keep spinning for a bounded amount of time when the producer
is slower than the consumer. This gives a relatively stable throughput
between 700 and 800 Kpps (we have something similar in our paravirtualized
e1000 driver, which is slightly faster at 900-1100 Kpps).

EXISTING LOGIC: reschedule the bh when at least a burst of packets has
   been received. Otherwise enable notifications and do another
   race-prevention check.

NEW LOGIC: when the bh is scheduled the first time, establish a
   budget (currently 20) of reschedule attempts.
   Every time virtio_net_flush_tx() returns 0 packets [this could 
   be changed to 'less than a full burst'] the counter is increased.
   The bh is rescheduled until the counter reaches the budget,
   at which point we re-enable notifications as above.

I am not 100% happy with this patch because there is a "magic"
constant (the maximum number of retries) which should be really
adaptive, or perhaps we should even bound the amount of time the
bh runs, rather than the number of retries.
In practice, having the thread spin (or sleep) for 10-20us 
even without new packets is probably preferable to 
re-enable notifications and request a kick.

opinions ?

cheers
luigi

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 573c669..5389088 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -49,6 +51,7 @@ typedef struct VirtIONet
 NICState *nic;
 uint32_t tx_timeout;
 int32_t tx_burst;
+int32_t tx_retries; /* keep spinning a bit on tx */
 uint32_t has_vnet_hdr;
 size_t host_hdr_len;
 size_t guest_hdr_len;
@@ -1062,7 +1065,9 @@ static void virtio_net_tx_bh(void *opaque)

 /* If we flush a full burst of packets, assume there are
  * more coming and immediately reschedule */
-if (ret >= n->tx_burst) {
+if (ret == 0)
+   n->tx_retries++;
+if (n->tx_retries < 20) {
 qemu_bh_schedule(q->tx_bh);
 q->tx_waiting = 1;
 return;
@@ -1076,6 +1082,8 @@ static void virtio_net_tx_bh(void *opaque)
 virtio_queue_set_notification(q->tx_vq, 0);
 qemu_bh_schedule(q->tx_bh);
 q->tx_waiting = 1;
+} else {
+   n->tx_retries = 0;
 }
 }

Re: [Qemu-devel] RFC: handling "backend too fast" in virtio-net

2013-02-15 Thread Luigi Rizzo

On Fri, Feb 15, 2013 at 11:24:29AM +0100, Stefan Hajnoczi wrote:
> On Thu, Feb 14, 2013 at 07:21:57PM +0100, Luigi Rizzo wrote:
> 
> CCed Michael Tsirkin
> 
> > virtio-style network devices (where the producer and consumer chase
> > each other through a shared memory block) can enter into a
> > bad operating regime when the consumer is too fast.
> > 
> > I am hitting this case right now when virtio is used on top of the
> > netmap/VALE backend that I posted a few weeks ago: what happens is that
> > the backend is so fast that the io thread keeps re-enabling notifications
> > every few packets.  This results, on my test machine, in a throughput of
> > 250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps).
> > 
> > I'd like to get some feedback on the following trivial patch to have
> > the thread keep spinning for a bounded amount of time when the producer
> > is slower than the consumer. This gives a relatively stable throughput
> > between 700 and 800 Kpps (we have something similar in our paravirtualized
> > e1000 driver, which is slightly faster at 900-1100 Kpps).
> 
> Did you experiment with tx timer instead of bh?  It seems that
> hw/virtio-net.c has two tx mitigation strategies - the bh approach that
> you've tweaked and a true timer.
> 
> It seems you don't really want tx batching but you do want to avoid
> guest->host notifications?

i have just tried:
bh (with my tweaks): 700-800Kpps (large variance)
timer (150us, 256 slots):  ~490Kpps (very stable)

As expected The throughput in the timer version seems proportional
to ring_size/timeout , e.g. 1ms and 256slots
give approx 210Kpps, 1ms and 128 slots give 108Kpps.
but it saturates around 490Kpps.

cheers
luigi

> > EXISTING LOGIC: reschedule the bh when at least a burst of packets has
> >been received. Otherwise enable notifications and do another
> >race-prevention check.
> > 
> > NEW LOGIC: when the bh is scheduled the first time, establish a
> >budget (currently 20) of reschedule attempts.
> >Every time virtio_net_flush_tx() returns 0 packets [this could 
> >be changed to 'less than a full burst'] the counter is increased.
> >The bh is rescheduled until the counter reaches the budget,
> >at which point we re-enable notifications as above.
> > 
> > I am not 100% happy with this patch because there is a "magic"
> > constant (the maximum number of retries) which should be really
> > adaptive, or perhaps we should even bound the amount of time the
> > bh runs, rather than the number of retries.
> > In practice, having the thread spin (or sleep) for 10-20us 
> > even without new packets is probably preferable to 
> > re-enable notifications and request a kick.
> > 
> > opinions ?
> > 
> > cheers
> > luigi
> > 
> > diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> > index 573c669..5389088 100644
> > --- a/hw/virtio-net.c
> > +++ b/hw/virtio-net.c
> > @@ -49,6 +51,7 @@ typedef struct VirtIONet
> >  NICState *nic;
> >  uint32_t tx_timeout;
> >  int32_t tx_burst;
> > +int32_t tx_retries; /* keep spinning a bit on tx */
> >  uint32_t has_vnet_hdr;
> >  size_t host_hdr_len;
> >  size_t guest_hdr_len;
> > @@ -1062,7 +1065,9 @@ static void virtio_net_tx_bh(void *opaque)
> > 
> >  /* If we flush a full burst of packets, assume there are
> >   * more coming and immediately reschedule */
> > -if (ret >= n->tx_burst) {
> > +if (ret == 0)
> > +   n->tx_retries++;
> > +if (n->tx_retries < 20) {
> >  qemu_bh_schedule(q->tx_bh);
> >  q->tx_waiting = 1;
> >  return;
> > @@ -1076,6 +1082,8 @@ static void virtio_net_tx_bh(void *opaque)
> >  virtio_queue_set_notification(q->tx_vq, 0);
> >  qemu_bh_schedule(q->tx_bh);
> >  q->tx_waiting = 1;
> > +} else {
> > +   n->tx_retries = 0;
> >  }
> >  }
> > 
> > 
> >

Re: [Qemu-devel] RFC: handling "backend too fast" in virtio-net

2013-02-18 Thread Luigi Rizzo

On Mon, Feb 18, 2013 at 1:50 AM, Stefan Hajnoczi  wrote:

> On Fri, Feb 15, 2013 at 11:24:29AM +0100, Stefan Hajnoczi wrote:
> > On Thu, Feb 14, 2013 at 07:21:57PM +0100, Luigi Rizzo wrote:
> >
> > CCed Michael Tsirkin
> >
> > > virtio-style network devices (where the producer and consumer chase
> > > each other through a shared memory block) can enter into a
> > > bad operating regime when the consumer is too fast.
> > >
> > > I am hitting this case right now when virtio is used on top of the
> > > netmap/VALE backend that I posted a few weeks ago: what happens is that
> > > the backend is so fast that the io thread keeps re-enabling
> notifications
> > > every few packets.  This results, on my test machine, in a throughput
> of
> > > 250-300Kpps (and extremely unstable, oscillating between 200 and
> 600Kpps).
> > >
> > > I'd like to get some feedback on the following trivial patch to have
> > > the thread keep spinning for a bounded amount of time when the producer
> > > is slower than the consumer. This gives a relatively stable throughput
> > > between 700 and 800 Kpps (we have something similar in our
> paravirtualized
> > > e1000 driver, which is slightly faster at 900-1100 Kpps).
> >
> > Did you experiment with tx timer instead of bh?  It seems that
> > hw/virtio-net.c has two tx mitigation strategies - the bh approach that
> > you've tweaked and a true timer.
> >
> > It seems you don't really want tx batching but you do want to avoid
> > guest->host notifications?
>
> One more thing I forgot: virtio-net does not use ioeventfd by default.
> ioeventfd changes the cost of guest->host notifications because the
> notification becomes an eventfd signal inside the kernel and kvm.ko then
> re-enters the guest.
>
> This means a guest->host notification becomes a light-weight exit and we
> don't return from ioctl(KVM_RUN).
>
> Perhaps -device virtio-blk-pci,ioeventfd=on will give similar results to
> your patch?
>

is the ioeventfd the mechanism used by vhostnet ?
If so, Giuseppe Lettieri (in Cc) has tried that with
a modified netmap backend and experienced the same
problem -- making the io thread (user or kernel)
spin a bit more has great benefit on the throughput.

cheers
luigi

[Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-30 Thread Luigi Rizzo

Hi,
while investigating rx performance for emulated network devices
(i am looking at the userspace version, relying on net=tap
or similar approaches) i noticed the code
in net/queue.c :: qemu_net_queue_send()
which look strange to me (same goes for the iov version).

The whole function is below, just for reference.
My impression is that the call to qemu_net_queue_flush()
is misplaced, and it should be before qemu_net_queue_deliver()
otherwise the current packet is pushed out before anything
was already in the queue.

Does it make sense ?

cheers
luigi

ssize_t qemu_net_queue_send(NetQueue *queue,
VLANClientState *sender,
unsigned flags,
const uint8_t *data,
size_t size,
NetPacketSent *sent_cb)
{
ssize_t ret;

if (queue->delivering) {
return qemu_net_queue_append(queue, sender, flags, data, size, 
NULL);
}

ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
if (ret == 0) {
qemu_net_queue_append(queue, sender, flags, data, size, sent_cb);
return 0;
}

qemu_net_queue_flush(queue);

return ret;
}

[Qemu-devel] Q: frame reordering in qemu_net_queue_send() ?

2012-05-30 Thread Luigi Rizzo

Hi,
while investigating rx performance for emulated network devices
(i am looking at the userspace version, relying on net=tap
or similar approaches) i noticed the code
in net/queue.c :: qemu_net_queue_send()
which look strange to me (same goes for the iov version).

The whole function is below, just for reference.
My impression is that the call to qemu_net_queue_flush()
is misplaced, and it should be before qemu_net_queue_deliver()
otherwise the current packet is pushed out before anything
was already in the queue.

Does it make sense ?

cheers
luigi

   ssize_t qemu_net_queue_send(NetQueue *queue,
   VLANClientState *sender,
   unsigned flags,
   const uint8_t *data,
   size_t size,
   NetPacketSent *sent_cb)
   {
   ssize_t ret;

   if (queue->delivering) {
   return qemu_net_queue_append(queue, sender, flags, data, size,
NULL);
   }

   ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
   if (ret == 0) {
   qemu_net_queue_append(queue, sender, flags, data, size, sent_cb);
   return 0;
   }

   qemu_net_queue_flush(queue);

   return ret;
   }

[Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-30 Thread Luigi Rizzo

Hi,
while investigating rx performance for emulated network devices
(i am looking at the userspace version, relying on net=tap
or similar approaches) i noticed the code
in net/queue.c :: qemu_net_queue_send()
which look strange to me (same goes for the iov version).

The whole function is below, just for reference.
My impression is that the call to qemu_net_queue_flush()
is misplaced, and it should be before qemu_net_queue_deliver()
otherwise the current packet is pushed out before anything
was already in the queue.

Does it make sense ?

cheers
luigi

ssize_t qemu_net_queue_send(NetQueue *queue,
VLANClientState *sender,
unsigned flags,
const uint8_t *data,
size_t size,
NetPacketSent *sent_cb)
{
ssize_t ret;

if (queue->delivering) {
return qemu_net_queue_append(queue, sender, flags, data, size, 
NULL);
}

ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
if (ret == 0) {
qemu_net_queue_append(queue, sender, flags, data, size, sent_cb);
return 0;
}

qemu_net_queue_flush(queue);

return ret;
}

[Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

Hi,
while testing qemu with netmap (see [Note 1] for details) on e1000
emulation, i noticed that my sender program using a custom backend
[Note 2] could reach 1 Mpps (million packets per second) but on the
receive side i was limited to 50 Kpps (and CPU always below 5%).

The problem was fixed by the following one-line addition to
hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and
check that some buffers might be available.

--- hw/e1000.c.orig  2012-02-17 20:45:39.0 +0100
+++ hw/e1000.c  2012-05-30 20:01:52.0 +0200
@@ -919,6 +926,7 @@
 DBGOUT(UNKNOWN, "MMIO unknown write 
addr=0x%08x,val=0x%08"PRIx64"\n",
index<<2, val);
 }
+qemu_notify_event();
 }

 static uint64_t

With this fix, the read throughput reaches 1 Mpps matching the write
speed. Now the system becomes CPU-bound, but this opens the way to
more optimizations in the emulator.

The same problem seems to exist on other network drivers, e.g.
hw/rtl8139.c and others. The only one that seems to get it
right is virtio-net.c

I think it would be good if this change could make it into
the tree.

[Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap )
is an efficient mechanism for packet I/O that bypasses
the network stack and provides protected access to the
network adapter from userspace.
It works especially well on top of qemu because the
kernel needs only to trap a single register access
for each batch of packets.

[Note 2] the custom backend is a virtual local ethernet
called VALE, implemented as a kernel module on the host,
that extends netmap to implement communication
between virtual machines.
VALE is extremely efficient, currently delivering about
10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames.
The 1 Mpps rates i mentioned are obtained between qemu instances
running in userspace on FreeBSD (no kernel acceleration whatsoever)
and using VALE as a communication mechanism.


cheers
luigi
-+-------
  Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
  http://www.iet.unipi.it/~luigi/. Universita` di Pisa
-+---

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

> On Wed, May 30, 2012 at 10:23:11PM +0200, Luigi Rizzo wrote:
...
> > The problem was fixed by the following one-line addition to
> > hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and
> > check that some buffers might be available.
> > 
> > --- hw/e1000.c.orig  2012-02-17 20:45:39.0 +0100
> > +++ hw/e1000.c  2012-05-30 20:01:52.0 +0200
> > @@ -919,6 +926,7 @@
> >  DBGOUT(UNKNOWN, "MMIO unknown write 
> > addr=0x%08x,val=0x%08"PRIx64"\n",
> > index<<2, val);
> >  }
> > +qemu_notify_event();
> >  }
> > 
> >  static uint64_t
> > 
> > With this fix, the read throughput reaches 1 Mpps matching the write
> > speed. Now the system becomes CPU-bound, but this opens the way to
> > more optimizations in the emulator.
> > 
> > The same problem seems to exist on other network drivers, e.g.
> > hw/rtl8139.c and others. The only one that seems to get it
> > right is virtio-net.c
> > 
> > I think it would be good if this change could make it into
> > the tree.
> > 
> > [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap )
> > is an efficient mechanism for packet I/O that bypasses
> > the network stack and provides protected access to the
> > network adapter from userspace.
> > It works especially well on top of qemu because the
> > kernel needs only to trap a single register access
> > for each batch of packets.
> > 
> > [Note 2] the custom backend is a virtual local ethernet
> > called VALE, implemented as a kernel module on the host,
> > that extends netmap to implement communication
> > between virtual machines.
> > VALE is extremely efficient, currently delivering about
> > 10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames.
> > The 1 Mpps rates i mentioned are obtained between qemu instances
> > running in userspace on FreeBSD (no kernel acceleration whatsoever)
> > and using VALE as a communication mechanism.
> 
> "Custom backend" == you patched QEMU? Or what backend are you using?
> 
> This sounds a lot like [1] and suggests that you are either a) using
> slirp in a version that doesn't contain that fix yet (before 1.1-rcX) or
> b) wrote a backend that suffers from a similar bug.
> 
> Jan
> 
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/144433

my custom backend is the one in [Note 2] above.
It replaces the -net pcap/user/tap/socket option which defines
how qemu communicate with the host network device.

The problem is not in my module, but rather in the emulation
device exposed to the guest, and i presume this is the same thing
you fixed in the "slirp" patch.
I checked the git version http://git.qemu.org/qemu.git
and most guest-side devices have the same problem,
only virtio-net does the notification.

cheers
luigi

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

On Wed, May 30, 2012 at 11:39 PM, Jan Kiszka  wrote:

> Please keep CCs.
>
> ok


> On 2012-05-30 23:23, Luigi Rizzo wrote:
> >> On Wed, May 30, 2012 at 10:23:11PM +0200, Luigi Rizzo wrote:
> > ...
> >>> The problem was fixed by the following one-line addition to
> >>> hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and
> >>> check that some buffers might be available.
> >>>
> >>> --- hw/e1000.c.orig  2012-02-17 20:45:39.0 +0100
> >>> +++ hw/e1000.c  2012-05-30 20:01:52.0 +0200
> >>> @@ -919,6 +926,7 @@
> >>>  DBGOUT(UNKNOWN, "MMIO unknown write
> addr=0x%08x,val=0x%08"PRIx64"\n",
> >>> index<<2, val);
> >>>  }
> >>> +qemu_notify_event();
> >>>  }
> >>>
> >>>  static uint64_t
> >>>
> >>> With this fix, the read throughput reaches 1 Mpps matching the write
> >>> speed. Now the system becomes CPU-bound, but this opens the way to
> >>> more optimizations in the emulator.
> >>>
> >>> The same problem seems to exist on other network drivers, e.g.
> >>> hw/rtl8139.c and others. The only one that seems to get it
> >>> right is virtio-net.c
> >>>
> >>> I think it would be good if this change could make it into
> >>> the tree.
> >>>
> >>> [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap )
> >>> is an efficient mechanism for packet I/O that bypasses
> >>> the network stack and provides protected access to the
> >>> network adapter from userspace.
> >>> It works especially well on top of qemu because the
> >>> kernel needs only to trap a single register access
> >>> for each batch of packets.
> >>>
> >>> [Note 2] the custom backend is a virtual local ethernet
> >>> called VALE, implemented as a kernel module on the host,
> >>> that extends netmap to implement communication
> >>> between virtual machines.
> >>> VALE is extremely efficient, currently delivering about
> >>> 10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames.
> >>> The 1 Mpps rates i mentioned are obtained between qemu instances
> >>> running in userspace on FreeBSD (no kernel acceleration whatsoever)
> >>> and using VALE as a communication mechanism.
> >>
> >> "Custom backend" == you patched QEMU? Or what backend are you using?
> >>
> >> This sounds a lot like [1] and suggests that you are either a) using
> >> slirp in a version that doesn't contain that fix yet (before 1.1-rcX) or
> >> b) wrote a backend that suffers from a similar bug.
> >>
> >> Jan
> >>
> >> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/144433
> >
> > my custom backend is the one in [Note 2] above.
> > It replaces the -net pcap/user/tap/socket option which defines
> > how qemu communicate with the host network device.
>
> Any code to share? It's hard to discuss just concepts.


you can take the freebsd image from the netmap page in my link and run it
in qemu, and then run the pkt-gen program in the image in either
send or receive mode. But this is overkill, as you have described the
problem exactly in your post: when the guest reads the packets from
the emulated device (e1000 in my case, but most of them have the
problem) it fails to wake up the thread blocked in main_loop_wait().

I am unclear on the terminology (what is frontend and what is backend ?)
but  it is the guest side that has to wake up the qemu process: the file
descriptor talking to the host side (tap, socket, bpf ...) has already
fired its events and the only thing it could do is cause a busy wait
if it keeps passing a readable file descriptor to select.

I thought your slirp.c patch was also on the same "side" as e1000.c

cheers
luigi

>
> > The problem is not in my module, but rather in the emulation
> > device exposed to the guest, and i presume this is the same thing
> > you fixed in the "slirp" patch.
> > I checked the git version http://git.qemu.org/qemu.git
> > and most guest-side devices have the same problem,
> > only virtio-net does the notification.
>
> And that is most likely wrong. The bug I cited was not a front-end issue
> but clearly one of the backend. It lacked kicking of the io-thread once
> its queue state changed in a way that was not reported otherwise (via
> some file descriptor the io-thread is subscribed to). If your backend
> creates such states as well, it has to fix it similarly.
>
> Again, discussing this abstractly is not very efficient.
>
> Jan
>
>
>

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

On Thu, May 31, 2012 at 12:11 AM, Jan Kiszka  wrote:

> On 2012-05-30 23:55, Luigi Rizzo wrote:
> > you can take the freebsd image from the netmap page in my link and run it
> > in qemu, and then run the pkt-gen program in the image in either
> > send or receive mode. But this is overkill, as you have described the
> > problem exactly in your post: when the guest reads the packets from
> > the emulated device (e1000 in my case, but most of them have the
> > problem) it fails to wake up the thread blocked in main_loop_wait().
>
> OK, so there are no QEMU code changes involved? Then, what is your
> command line and what is the QEMU version?
>
> # x86_64-softmmu/qemu-system-x86_64 -version
QEMU emulator version 1.0,1, Copyright (c) 2003-2008 Fabrice Bellard

the command line for my test is (this is on FreeBSD 9, with a pure userland
qemu
without any kqemu support)

x86_64-softmmu/qemu-system-x86_64 -m 512 -hda picobsd.bin -net
nic,model=e1000 -net netmap,ifname=valeXX

-net netmap is my fast bridge, code will be available in a couple of days,
in the meantime you can have a look at netmap link in my original post to
see how
qemu accesses the host adapter and how fast it is.
The problem exists als you use -net tap, ... just at lower speed, and maybe
too low to notice.

The image contains my fast packet generator "pkt-gen" (a stock traffic
generator
such as netperf etc. is too slow to show the problem). pkt-gen can send
about 1Mpps
in this configuration using -net netmap in the backend. The qemu process in
this
case takes 100% CPU.
On the receive side, i cannot receive more than 50Kpps, even if i flood the
bridge with a a huge amount of traffic. The qemu process stays at 5% cpu or
less.

Then i read on the docs in main-loop.h which says that one case where
the qemu_notify_event() is needed is when using
qemu_set_fd_handler2() , which is exactly what my backend uses
(similar to tap.c)

Once i add the notify, the receiver can do 1 Mpps and can use 100% CPU
(and it is not in a busy wait, if the offered traffic goes down, the qemu
process
uses less and less cpu).



> >
> > I am unclear on the terminology (what is frontend and what is backend ?)
>
> Frontend is the emulated NIC here, backend the host-side interface, i.e.
> slirp ("user"), tap, socket.
>

thanks for the clarification.

>
> > but  it is the guest side that has to wake up the qemu process: the file
> > descriptor talking to the host side (tap, socket, bpf ...) has already
> > fired its events and the only thing it could do is cause a busy wait
> > if it keeps passing a readable file descriptor to select.
>
> Can't follow. How did you analyze your delays?
>
> see above.


> >
> > I thought your slirp.c patch was also on the same "side" as e1000.c
>
> It was only in the backend, not any frontend driver. It's generally no
> business of the frontend driver to kick (virtio has some exceptional path).
>
> BTW, your patch is rather untargeted as it kicks on every MMIO write of
> the e1000. Hard to asses what actually makes the difference for you.
>

in my case it is easy said - the process that reads from the interface on
the guest
is only doing one read and one write access to the RDT register for each
batch of packets. The RDT change is the one that frees the rx buffers.

Surely the patch could be made more specific, but this requires a lot of
investigation on which register accesses may require attention (and there
are
so many of them, say if the guest decides to reset the ring, the card, etc.,
that I don't think it is worth the trouble.)

cheers
luigi

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 9:38 AM, Paolo Bonzini  wrote:

> Il 31/05/2012 00:53, Luigi Rizzo ha scritto:
> > The image contains my fast packet generator "pkt-gen" (a stock
> > traffic generator such as netperf etc. is too slow to show the
> > problem). pkt-gen can send about 1Mpps in this configuration using
> > -net netmap in the backend. The qemu process in this case takes 100%
> > CPU. On the receive side, i cannot receive more than 50Kpps, even if i
> > flood the bridge with a a huge amount of traffic. The qemu process stays
> > at 5% cpu or less.
> >
> > Then i read on the docs in main-loop.h which says that one case where
> > the qemu_notify_event() is needed is when using
> > qemu_set_fd_handler2(), which is exactly what my backend uses
> > (similar to tap.c)
>
> The path is a bit involved, but I think Luigi is right.  The docs say
> "Remember to call qemu_notify_event whenever the [return value of the
> fd_read_poll callback] may change from false to true."  Now net/tap.c has
>
...

> So as a conservative approximation, you need to fire qemu_notify_event
> whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes
> zero.  In practice, only RDT, RCTL and check_rxov matter.  Luigi, does this
> patch work for you?
>

it almost surely works (cannot check today), as my (guest) driver modifies
RDT to notify the
hardware of  read packets. I know/mentioned that the notification can be
optimized and sent only in specific case, as you describe above
(i might be missing the check_rxov).

But i think it would be more robust to make fewer assumptions on
what the guest does, and send the notify on all register writes
(those are relatively rare in a driver, and the datapath touches
exactly the registers we ought to be interested in),
and possibly on those reads that may have side effects, such as interrupt
registers, together with a big comment in the code explaining
why we need to call qemu_notify_event().

Actually, what would pay even more is devise a way to avoid the
write() syscall in qemu_notify_event if the other process (or is it a
thread ?)
has data queued.
This is not so important in my netmap driver, but a standard driver is
likely to
update the RDT on every single packet, which would pretty much kill
performance.

cheers
luigi
-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 10:23 AM, Jan Kiszka  wrote:

> On 2012-05-31 09:38, Paolo Bonzini wrote
>
...

>
> This still looks like the wrong tool: Packets that can't be delivered
> are queued. So we need to flush the queue and clear the blocked delivery
> there. qemu_flush_queued_packets appears more appropriate for this.
>
> Conceptually, the backend should be responsible for kicking the iothread
> as needed.
>
>
as i understand the code, the backend _is_ the iothread, and it is
sleeping when the frontend becomes able to receive again.

cheers
luigi

-- 
-+-------
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 11:18 AM, Stefan Hajnoczi wrote:

> On Wed, May 30, 2012 at 8:59 AM, Luigi Rizzo  wrote:
> > Hi,
> > while investigating rx performance for emulated network devices
> > (i am looking at the userspace version, relying on net=tap
> > or similar approaches) i noticed the code
> > in net/queue.c :: qemu_net_queue_send()
> > which look strange to me (same goes for the iov version).
> >
> > The whole function is below, just for reference.
> > My impression is that the call to qemu_net_queue_flush()
> > is misplaced, and it should be before qemu_net_queue_deliver()
> > otherwise the current packet is pushed out before anything
> > was already in the queue.
> >
> > Does it make sense ?
> >
> > cheers
> > luigi
> >
> >ssize_t qemu_net_queue_send(NetQueue *queue,
> >VLANClientState *sender,
> >unsigned flags,
> >const uint8_t *data,
> >size_t size,
> >NetPacketSent *sent_cb)
> >{
> >ssize_t ret;
> >
> >if (queue->delivering) {
> >return qemu_net_queue_append(queue, sender, flags, data,
> size, NULL);
> >}
> >
> >ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
> >if (ret == 0) {
> >qemu_net_queue_append(queue, sender, flags, data, size,
> sent_cb);
> >return 0;
> >}
> >
> >qemu_net_queue_flush(queue);
>
> This of the case where delivering a packet causes additional
> (re-entrant) qemu_net_queue_send() calls to this queue.  We'll be in
> ->delivering state and queue those packets.  After we've finished
> delivering the initial packet we flush the queue.
>
> This is a weird case but this is how I read the intention of the code.
>

i was under the impression that qemu_net_queue_deliver() may also return 0
if the other side
(frontend network device) has no room for the packet. This would cause the
queue to
contain data on entry in the next call, even without reentrancy. Consider
this:

t0. qemu_net_queue_send(pkt-A)
 qemu_net_queue_deliver() returns 0, pkt-A is queued and we return
t1. qemu_net_queue_send(pkt-B)
   qemu_net_queue_deliver() returns successfully, pkt-B is sent to the
frontend
   then we call qemu_net_queue_flush() and this sends pkt-B to the
frontend,
   in the wrong order

cheers
luigi

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 03:23:12PM +0200, Jan Kiszka wrote:
> On 2012-05-31 15:19, Paolo Bonzini wrote:
> > Il 31/05/2012 15:07, Zhi Yong Wu ha scritto:
> >> Yeah, this case actually exists, but tcp/ip protocol stack in guest
> >> will make sure this ordering will finally be correct.
> > 
> > Nevertheless it's not good, and the latest Windows Logo tests will fail
> > if you reorder frames.
> 
> Can we really enter qemu_net_queue_send with delivering == 0 and a
> non-empty queue?

i have no idea.

it doesn't help that comments are using very very
sparingly throughout the code...

cheers
luigi

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-06-01 Thread Luigi Rizzo

Works me. I can now receive at 1.15 Mpps, slightly
faster than my previous patch which generated unnecessary
writes to the signalling socket.

Tested-by: Luigi Rizzo 

On Thu, May 31, 2012 at 12:03 PM, Paolo Bonzini  wrote:

> Il 31/05/2012 10:23, Jan Kiszka ha scritto:
> >> > @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val)
> >> >  {
> >> >  s->check_rxov = 0;
> >> >  s->mac_reg[index] = val & 0x;
> >> > +qemu_notify_event();
> > This still looks like the wrong tool: Packets that can't be delivered
> > are queued.
>
> Packets that are read from the tap but can't be delivered are queued;
> packets that are left on the tap need qemu_notify_event to be flushed.
>
> > So we need to flush the queue and clear the blocked delivery
> > there. qemu_flush_queued_packets appears more appropriate for this.
>
> Right, and qemu_flush_queued_packets needs to call qemu_notify_event
> which makes the call in virtio-net unnecessary.
>
> Paolo
>
> diff --git a/hw/e1000.c b/hw/e1000.c
> index 4573f13..43d933a 100644
> --- a/hw/e1000.c
> +++ b/hw/e1000.c
> @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val)
> s->rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) & 3) + 1;
> DBGOUT(RX, "RCTL: %d, mac_reg[RCTL] = 0x%x\n", s->mac_reg[RDT],
>s->mac_reg[RCTL]);
> +qemu_flush_queued_packets(&s->nic->nc);
>  }
>
>  static void
> @@ -926,6 +927,9 @@ set_rdt(E1000State *s, int index, uint32_t val)
>  {
> s->check_rxov = 0;
> s->mac_reg[index] = val & 0x;
> +if (e1000_has_rxbufs(s, 1)) {
> +qemu_flush_queued_packets(&s->nic->nc);
> +}
>  }
>
>  static void
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index 3f190d4..0974945 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -447,10 +447,6 @@ static void virtio_net_handle_rx(VirtIODevice *vdev,
> VirtQueue *vq)
> VirtIONet *n = to_virtio_net(vdev);
>
> qemu_flush_queued_packets(&n->nic->nc);
> -
> -/* We now have RX buffers, signal to the IO thread to break out of the
> - * select to re-poll the tap file descriptor */
> -qemu_notify_event();
>  }
>
>  static int virtio_net_can_receive(VLANClientState *nc)
> diff --git a/net.c b/net.c
> index 1922d8a..fa846ae 100644
> --- a/net.c
> +++ b/net.c
> @@ -491,7 +491,12 @@ void qemu_flush_queued_packets(VLANClientState *vc)
> queue = vc->send_queue;
> }
>
> -qemu_net_queue_flush(queue);
> +if (qemu_net_queue_flush(queue)) {
> +/* We emptied the queue successfully, signal to the IO thread to
> repoll
> + * the file descriptor (for tap, for example).
> + */
> +qemu_notify_event();
> +}
>  }
>
>  static ssize_t qemu_send_packet_async_with_flags(VLANClientState *sender,
> diff --git a/net/queue.c b/net/queue.c
> index 1ab5247..fd1c7e6 100644
> --- a/net/queue.c
> +++ b/net/queue.c
> @@ -232,7 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue,
> VLANClientState *from)
> }
>  }
>
> -void qemu_net_queue_flush(NetQueue *queue)
> +bool qemu_net_queue_flush(NetQueue *queue)
>  {
> while (!QTAILQ_EMPTY(&queue->packets)) {
> NetPacket *packet;
> @@ -248,7 +248,7 @@ void qemu_net_queue_flush(NetQueue *queue)
>  packet->size);
> if (ret == 0) {
> QTAILQ_INSERT_HEAD(&queue->packets, packet, entry);
> -break;
> +return 0;
> }
>
> if (packet->sent_cb) {
> @@ -257,4 +257,5 @@ void qemu_net_queue_flush(NetQueue *queue)
>
> g_free(packet);
> }
> +return 1;
>  }
> diff --git a/net/queue.h b/net/queue.h
> index a31958e..4bf6d3c 100644
> --- a/net/queue.h
> +++ b/net/queue.h
> @@ -66,6 +66,6 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue,
> NetPacketSent *sent_cb);
>
>  void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from);
> -void qemu_net_queue_flush(NetQueue *queue);
> +bool qemu_net_queue_flush(NetQueue *queue);
>
>  #endif /* QEMU_NET_QUEUE_H */
>



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

[Qemu-devel] VALE, a Virtual Local Ethernet. http://info.iet.unipi.it/~luigi/vale/

2012-06-08 Thread Luigi Rizzo

We have just completed a netmap extensions that let you build a
local high speed switch called VALE which i think can be very useful.

   http://info.iet.unipi.it/~luigi/vale/

VALE is a software Virtual Local Ethernet whose ports are accessible
using the netmap API. Designed to be used as the interconnect between
virtual machines (or as a fast local bus), it works as a learning
bridge and supports speeds of up to 20 Mpps with short frames, and
an aggregate 70 Gbit/s with 1514-byte packets. The VALE paper
contains more details and performance measurements.

VALE is implemented as a small extension of the netmap module, and
is available for FreeBSD and Linux. The source code includes a
backend for qemu and KVM, so you can use VALE to interconnect virtual
machines launching them with

   qemu -net nic -net netmap,ifname=vale0 ...
   qemu -net nic -net netmap,ifname=vale1 ...
   ...

Processes can talk to a VALE switch too, so you can use the pkt-gen
or bridge tools that are part of the netmap distribution, or even
the pcap.c module that maps libpcap calls into netmap equivalents.
This lets you use VALE for all sort of pcap-based applications.

More details, code, bootable images on the VALE page,

   http://info.iet.unipi.it/~luigi/vale/

feedback welcome, as usual.

cheers
luigi

Re: [Qemu-devel] net: Adding netmap network backend

2014-02-14 Thread Luigi Rizzo

On Fri, Feb 14, 2014 at 2:20 AM, Vincenzo Maffione wrote:

> Yes, for sure we need to do a check.
>
> However, this would involve - I think - some non-trivial modifications to
> net/netmap.c, because without NS_MOREFRAG you cannot split a packet over
> more "netmap slots/descriptors" (both tx and rx side)
>
> Therefore I would ask (manly Luigi, since netmap is in-tree in FreeBSD):
> Wouldn't it be better to force --disable-netmap when we realize that
> NETMAP_API (version number for the netmap API) is less than the value
> required by QEMU? This can be done in ./configure. In this way we keep the
> QEMU code cleaner.
>

yes we should do exactly what vincenzo suggests.

cheers
luigi

Re: [Qemu-devel] [PATCH] net: Disable netmap backend when not supported

2014-02-19 Thread Luigi Rizzo

On Wed, Feb 19, 2014 at 7:30 AM, Stefan Hajnoczi  wrote:

> On Fri, Feb 14, 2014 at 05:40:24PM +0100, Vincenzo Maffione wrote:
> > This patch fixes configure so that netmap is not compiled in if the
> > host doesn't support an API version >= 11.
> >
> > Moreover, some modifications have been done to net/netmap.c in
> > order to reflect the current netmap API (11).
> >
> > Signed-off-by: Vincenzo Maffione 
> > ---
> >  configure|  3 +++
> >  net/netmap.c | 57
> ++---
> >  2 files changed, 17 insertions(+), 43 deletions(-)
> >
> > diff --git a/configure b/configure
> > index 88133a1..61eb932 100755
> > --- a/configure
> > +++ b/configure
> > @@ -2118,6 +2118,9 @@ if test "$netmap" != "no" ; then
> >  #include 
> >  #include 
> >  #include 
> > +#if (NETMAP_API < 11) || (NETMAP_API > 15)
> > +#error
> > +#endif
>
> Why error when NETMAP_API > 15?
>

this is meant to simulate a minor/major version number.
We will mark minor new features with values up to 15,
and if something happens that requires a change in the
backend we will move above 15, at which point we
will submit backend fixes and also the necessary
update to ./configure

> > -ring->cur = NETMAP_RING_NEXT(ring, i);
> > -ring->avail--;
> > +ring->cur = ring->head = nm_ring_next(ring, i);
> >  ioctl(s->me.fd, NIOCTXSYNC, NULL);
> >
> >  return size;
>
> Are these changes related to the NETMAP_WITH_LIBS macro?  Please do that
> in a separate patch so we keep the version checking change separate from
> the NETMAP_WITH_LIBS change.
>

netmap version 11 and above has NETMAP_WITH_LIBS,
while previous versions do not, so this ./configure
change has to go together with the change in the backend.

The netmap 11 code has already been committed to the FreeBSD
source repositories (for HEAD, 10 and 9) and to
code.google.com/p/netmap/ (for those who want it on linux).

So there is really no point, in my opinion, to make one
intermediate commit just to ./configure to disable
netmap detection on FreeBSD (it is already off on linux),
immediately followed by this one that uses the new feature.

Just to clarify: with one exception (fields in struct netmap_ring)
the netmap changes that we have are not at the kernel/user boundary
but in a library which avoids replicating long and boring code
(interface name parsing, parameter setting) in applications.

Avoiding the single incompatible struct change would have
been of course possible, but at the cost
extra complexity in the kernel and in userspace
(to support two slightly different interfaces).
Since netmap is (so far) relatively little used I thought it
was more important to fix the API and keep it simple.

cheers
luigi

-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] [PATCH] net: Disable netmap backend when not supported

2014-02-21 Thread Luigi Rizzo

On Thu, Feb 20, 2014 at 10:49:52AM +0100, Stefan Hajnoczi wrote:
> On Wed, Feb 19, 2014 at 04:57:28PM +0100, Vincenzo Maffione wrote:
...
> Please mention that in the commit description.
> 
> (I guess it's too late to give them a NETMAP_* prefix since defining D()
> and RD() in a system header has a fair chance of causing namespace
> conflicts.)

you are right that the naming is unfortunate and we will try
to address that at a later time, removing the macros.

As partial mitigation of the problem, they are intended only for
debugging (so they should only be used by lazy programmers;
applications should prefer application-specific logging functions),
and D(), RD() and ND() are only defined when 'NETMAP_WITH_LIBS' is
defined and this should happen only in the single file that implements
the netmap backend.

cheers
luigi

Re: [Qemu-devel] [PATCH] net: QEMU_NET_PACKET_FLAG_MORE introduced

2013-12-09 Thread Luigi Rizzo

On Mon, Dec 9, 2013 at 3:02 PM, Michael S. Tsirkin  wrote:

> On Mon, Dec 09, 2013 at 01:36:54PM +0100, Stefan Hajnoczi wrote:
> > On Fri, Dec 06, 2013 at 03:44:33PM +0100, Vincenzo Maffione wrote:
> > > - This patch is against the net-next tree (
> https://github.com/stefanha/qemu.git)
> > >   because the first netmap patch is not in the qemu master (AFAIK).
> >
> > You are right.  I am sending a pull request now to get those patches
> > into qemu.git/master.
>
> This only arrived over the weekend and affects all
> net devices. Whats the rush?
> Why not give people a chance to review and discuss
> properly?
>

as i understand the pull request is for the netmap backend,
not for this batching patch

cheers
luigi


>
> > >  hw/net/cadence_gem.c|  3 ++-
> > >  hw/net/dp8393x.c|  5 +++--
> > >  hw/net/e1000.c  | 21 -
> > >  hw/net/eepro100.c   |  5 +++--
> > >  hw/net/etraxfs_eth.c|  5 +++--
> > >  hw/net/lan9118.c|  2 +-
> > >  hw/net/mcf_fec.c|  5 +++--
> > >  hw/net/mipsnet.c|  6 --
> > >  hw/net/ne2000.c |  5 +++--
> > >  hw/net/ne2000.h |  3 ++-
> > >  hw/net/opencores_eth.c  |  2 +-
> > >  hw/net/pcnet.c  |  8 +---
> > >  hw/net/pcnet.h  |  3 ++-
> > >  hw/net/rtl8139.c|  7 ---
> > >  hw/net/smc91c111.c  |  5 +++--
> > >  hw/net/spapr_llan.c |  2 +-
> > >  hw/net/stellaris_enet.c |  3 ++-
> > >  hw/net/virtio-net.c | 10 --
> > >  hw/net/vmxnet3.c|  3 ++-
> > >  hw/net/vmxnet_tx_pkt.c  |  4 ++--
> > >  hw/net/xgmac.c  |  2 +-
> > >  hw/net/xilinx_axienet.c |  2 +-
> > >  hw/usb/dev-network.c|  8 +---
> > >  include/net/net.h   | 20 +---
> > >  include/net/queue.h |  1 +
> > >  net/dump.c  |  3 ++-
> > >  net/hub.c   | 10 ++
> > >  net/net.c   | 39 +++
> > >  net/netmap.c| 17 -
> > >  net/slirp.c |  5 +++--
> > >  net/socket.c| 10 ++
> > >  net/tap-win32.c |  2 +-
> > >  net/tap.c   | 12 +++-
> > >  net/vde.c   |  5 +++--
> > >  savevm.c|  2 +-
> > >  35 files changed, 155 insertions(+), 90 deletions(-)
> >
> > Please split this into multiple patches:
> >
> > 1. net subsystem API change that touches all files (if necessary)
> > 2. e1000 MORE support
> > 3. virtio-net MORE support
> > 4. netmap MORE support
> >
> > This makes it easier to review and bisect.
> >
> > Thanks,
> > Stefan
>



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend

2013-11-04 Thread Luigi Rizzo

On Mon, Nov 4, 2013 at 9:41 AM, Anthony Liguori wrote:

> On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione 
> wrote:
> > This patch adds support for a network backend based on netmap.
> > netmap is a framework for high speed packet I/O. You can use it
> > to build extremely fast traffic generators, monitors, software
> > switches or network middleboxes. Its companion software switch
> > VALE lets you interconnect virtual machines.
> > netmap and VALE are implemented as a non intrusive kernel module,
> > support NICs from multiple vendors, are part of standard FreeBSD
> > distributions and available in source format for Linux too.
>
> I don't think it's a good idea to support this on Linux hosts.  This
> is an out of tree module that most likely will never go upstream.
>
> I don't want to live through another kqemu with this if it eventually
> starts to bit-rot.
>

I believe this is very different from kqemu.

For first, it is just a one-file backend (the patches
to other files are just because there is not yet a way
to automatically generate them; but i am sure qemu
will get there). Getting rid of it, should the code
bit-rot, is completely trivial.

Second, there is nothing linux specific here. Unless configure
determines that the (possibly out of tree, as in Linux,
or in-tree, as in FreeBSD) netmap headers are
installed, it just won't build the backend.

I am not sure if you do not want to support netmap _in general_
(despite it is already upstream in FreeBSD),
or you'd like to put extra checks in ./configure to actually
prevent its use on linux hosts.

Both cases it seems to me a loss of a useful feature with no
real return

> 
configure |  31 
>
 hmp-commands.hx   |   4 +-
>
 net/Makefile.objs |   1 +
>
 net/clients.h |   5 +
>
 net/net.c |   6 +
>
 net/netmap.c  | 423 ++

>
 qapi-schema.json  |  19 ++-
>
 qemu-options.hx   |   8 ++
>
 8 files changed, 494 insertions(+), 3 deletions(-)
>
 create mode 100644 net/netmap.c

cheers
luigi

Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend

2013-11-04 Thread Luigi Rizzo

On Mon, Nov 04, 2013 at 10:20:12AM -0800, Anthony Liguori wrote:
> On Mon, Nov 4, 2013 at 10:08 AM, Luigi Rizzo  wrote:
...
> >> On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione 
> >> wrote:
> >> > This patch adds support for a network backend based on netmap.
> >> > netmap is a framework for high speed packet I/O. You can use it
> >> > to build extremely fast traffic generators, monitors, software
> >> > switches or network middleboxes. Its companion software switch
> >> > VALE lets you interconnect virtual machines.
> >> > netmap and VALE are implemented as a non intrusive kernel module,
> >> > support NICs from multiple vendors, are part of standard FreeBSD
> >> > distributions and available in source format for Linux too.
> >>
> >> I don't think it's a good idea to support this on Linux hosts.  This
> >> is an out of tree module that most likely will never go upstream.
> >>
> >> I don't want to live through another kqemu with this if it eventually
> >> starts to bit-rot.
> >
> >
> > I believe this is very different from kqemu.
> >
> > For first, it is just a one-file backend (the patches
> > to other files are just because there is not yet a way
> > to automatically generate them; but i am sure qemu
> > will get there). Getting rid of it, should the code
> > bit-rot, is completely trivial.
> >
> > Second, there is nothing linux specific here. Unless configure
> > determines that the (possibly out of tree, as in Linux,
> > or in-tree, as in FreeBSD) netmap headers are
> > installed, it just won't build the backend.
> 
> Without being in upstream Linux, we have no guarantee that the API/ABI
> will be stable over time.  I suspect it's also very unlikely that any
> many stream distro will include these patches meaning that the number
> of users that will test this is very low.
> 
> I don't think just adding another backend because we can helps us out
> in the long term.  Either this is the Right Approach to networking and
> we should focus on getting proper kernel support or if that's not
> worth it, then there's no reason to include this in QEMU either.

anthony,
i'd still like you to answer the question that i asked before:

are you opposed to netmap support just for linux, or you
oppose to it in general (despite netmap being already
upstream in FreeBSD) ?

Your reasoning seems along the lines "if feature X is not upstream
in linux we do not want to support it".

While I can buy it (hey, it may save a lot of maintenance effort),
it contrasts with the presence in qemu of a ton of conditional code
to support other host platforms, as well as multiple backends for
non-linux features (mostly audio drivers; some GUI too, think of
Cocoa).

Also the agendas of Qemu, linux, FreeBSD and other hosts may be
different (and it does not mean that there is One Right Way or that
others are wrong), so having "inclusion in linux" as a prerequisite
for support seems a bit restrictive.

Specifically, the networking requirements for qemu (a fast virtual
switch, tunnels etc.) are different from that of more typical apps
or the OS itself.

Also regarding API stability: qemu uses a lot of user libraries
whose APIs are a moving target without apparent problems.
If you are worried about API stability, netmap is just two
small headers, and no library. There isn't really much
that can go wrong there...

cheers
luigi

> Regards,
> 
> Anthony Liguori
> 
> > I am not sure if you do not want to support netmap _in general_
> > (despite it is already upstream in FreeBSD),
> > or you'd like to put extra checks in ./configure to actually
> > prevent its use on linux hosts.
> >

Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend

2013-11-04 Thread Luigi Rizzo

On Mon, Nov 4, 2013 at 12:54 PM, Anthony Liguori wrote:

> On Mon, Nov 4, 2013 at 11:51 AM, Luigi Rizzo  wrote:
> > On Mon, Nov 04, 2013 at 10:20:12AM -0800, Anthony Liguori wrote:
> >> On Mon, Nov 4, 2013 at 10:08 AM, Luigi Rizzo 
> wrote:
> > ...
> >> >> On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione <
> v.maffi...@gmail.com>
> >> >> wrote:
> >> >> > This patch adds support for a network backend based on netmap.
> >> >> > netmap is a framework for high speed packet I/O. You can use it
> >> >> > to build extremely fast traffic generators, monitors, software
> >> >> > switches or network middleboxes. Its companion software switch
> >> >> > VALE lets you interconnect virtual machines.
> >> >> > netmap and VALE are implemented as a non intrusive kernel module,
> >> >> > support NICs from multiple vendors, are part of standard FreeBSD
> >> >> > distributions and available in source format for Linux too.
> >> >>
> >> >> I don't think it's a good idea to support this on Linux hosts.  This
> >> >> is an out of tree module that most likely will never go upstream.
> >> >>
> >> >> I don't want to live through another kqemu with this if it eventually
> >> >> starts to bit-rot.
> >> >
> >> >
> >> > I believe this is very different from kqemu.
> >> >
> >> > For first, it is just a one-file backend (the patches
> >> > to other files are just because there is not yet a way
> >> > to automatically generate them; but i am sure qemu
> >> > will get there). Getting rid of it, should the code
> >> > bit-rot, is completely trivial.
> >> >
> >> > Second, there is nothing linux specific here. Unless configure
> >> > determines that the (possibly out of tree, as in Linux,
> >> > or in-tree, as in FreeBSD) netmap headers are
> >> > installed, it just won't build the backend.
> >>
> >> Without being in upstream Linux, we have no guarantee that the API/ABI
> >> will be stable over time.  I suspect it's also very unlikely that any
> >> many stream distro will include these patches meaning that the number
> >> of users that will test this is very low.
> >>
> >> I don't think just adding another backend because we can helps us out
> >> in the long term.  Either this is the Right Approach to networking and
> >> we should focus on getting proper kernel support or if that's not
> >> worth it, then there's no reason to include this in QEMU either.
> >
> > anthony,
> > i'd still like you to answer the question that i asked before:
> >
> > are you opposed to netmap support just for linux, or you
> > oppose to it in general (despite netmap being already
> > upstream in FreeBSD) ?
> >
> > Your reasoning seems along the lines "if feature X is not upstream
> > in linux we do not want to support it".
>
> Yes.  This is the historic policy we have taken for any feature.  I
> have no problem with netmap being used on FreeBSD hosts but I think it
> should not be enabled on Linux hosts.
>

ok thanks for the clarification.
 S
o I misunderstood
,
 the policy is
"if not upstream in linux, we do not want to support it _on linux_".
A fundamental difference :)

So in ./configure we must change to 'netmap="no"' in the initial
section to disable it by default, and add a line 'netmap=""' in the
FreeBSD section to enable autodetect there.

cheers
luigi

[Qemu-devel] [PATCH] better interpreter specification for scripts

2013-11-14 Thread Luigi Rizzo

some of the new scripts in scripts/ specify a interpreter path
which is not always pointing to the right place.
I see that an alternative format is also being used
#!/usr/bin/env 
which seems to work better.

The patch below is merely to let master compile on FreeBSD,
but there are other offending files that can be found with
 grep '#!' scripts/*

Signed-off-by: Luigi Rizzo 
---
 scripts/acpi_extract.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/acpi_extract.py b/scripts/acpi_extract.py
index 22ea468..66c1b3e 100755
--- a/scripts/acpi_extract.py
+++ b/scripts/acpi_extract.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python
+#!/usr/bin/env python
 # Copyright (C) 2011 Red Hat, Inc., Michael S. Tsirkin 
 #
 # This program is free software; you can redistribute it and/or modify
--
1.8.1.2

[Qemu-devel] commit b1bbfe72 causes huge slowdown with no kvm

2013-11-19 Thread Luigi Rizzo

I recently found out that without kvm enabled, and especially
with -smp 2 or greater, qemu becomes incredibly slow
(to the point that you can see kernel messages in the
quest print one character at a time).

This happens with a Linux host (even with -smp 1)
and with FreeBSD host (in this case -smp 2 or greater;
-smp 1 seems to work fine there).

Bisecting points to this commit as the culprit:


commit b1bbfe72ec1ebf302d97f886cc646466c0abd679
Author: Alex Bligh 
Date:   Wed Aug 21 16:02:55 2013 +0100

aio / timers: On timer modification, qemu_notify or aio_notify

On qemu_mod_timer_ns, ensure qemu_notify or aio_notify is called to
end the appropriate poll(), irrespective of use_icount value.

On qemu_clock_enable, ensure qemu_notify or aio_notify is called for
all QEMUTimerLists attached to the QEMUClock.

Signed-off-by: Alex Bligh 
Signed-off-by: Stefan Hajnoczi 

I could not revert this individual commit into master because
of other changes, but i notice that
one key
changes of the commit

was
 to
 ma
k
e a
 
call to timerlist_notify() unconditional, whereas
before
 
it was controlled by the
"
use_icount
"
variable.

So I tried the small patch below and it seems to restore the original
performance, but I have no idea what use_icount does and
whether the change makes sense.

If not, there is an open problem.

Any ideas ?

cheers
luigi

d
iff --git a/qemu-timer.c b/qemu-timer.c
index e15ce47..4180a86 100644
--- a/qemu-timer.c
+++ b/qemu-timer.c
@@ -380,6 +380,7 @@ static void timerlist_rearm(QEMUTimerList *timer_list)
 {
 /* Interrupt execution to force deadline recalculation.  */
 qemu_clock_warp(timer_list->clock->type);
+if (use_icount) // XXX
 timerlist_notify(timer_list);
 }

Re: [Qemu-devel] commit b1bbfe72 causes huge slowdown with no kvm

2013-11-20 Thread Luigi Rizzo

On Wed, Nov 20, 2013 at 10:41:22AM +0100, Paolo Bonzini wrote:
> Il 20/11/2013 00:00, Luigi Rizzo ha scritto:
> > I recently found out that without kvm enabled, and especially
> > with -smp 2 or greater, qemu becomes incredibly slow
> > (to the point that you can see kernel messages in the
> > quest print one character at a time).
> > 
> > This happens with a Linux host (even with -smp 1)
> > and with FreeBSD host (in this case -smp 2 or greater;
> > -smp 1 seems to work fine there).

Here is my test case that shows the problem:

HOW TO REPRODUCE:

boot a (small) FreeBSD image below:

http://info.iet.unipi.it/~luigi/20131119-picobsd.bin 

with this command (note no kvm, also i disconnect the nic
so i can generate traffic without causing trouble)

x86_64-softmmu/qemu-system-x86_64 -m 1024 -hda 20131119-picobsd.bin 
-curses -monitor tcp::,server,nowait -net nic,model=e1000 -smp 2

The image is readonly so you can kill the qemu without problems.

SYMPTOMS:
as soon as the kernel starts its timer services (there is an
active kernel thread woken up every millisecons) booting and
runtime performance in the guest becomes terribly slow.
See details on the test below.

HOST / OS
Tried on a linux host (ubuntu 13.10 i believe, with kernel 3.11
and CPU Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, but similar
things occur also on a FreeBSD host. With a FreeBSD host,
using -smp 1 seems to bypass the problem, whereas on Linux hosts
the -smp setting does not seem to make a difference.

QEMU VERSION and MODIFICATIONS:

> git branch -v
* master 5c5432e Merge remote-tracking branch 'luiz/queue/qmp' into 
staging

No changes other than the small diff i proposed to fix the problem:
> git diff

diff --git a/qemu-timer.c b/qemu-timer.c
index e15ce47..4180a86 100644
--- a/qemu-timer.c
+++ b/qemu-timer.c
@@ -380,6 +380,7 @@ static void timerlist_rearm(QEMUTimerList *timer_list)
 {
 /* Interrupt execution to force deadline recalculation.  */
 qemu_clock_warp(timer_list->clock->type);
+if (use_icount) // XXX
 timerlist_notify(timer_list);
 }

DETAILED SYMPTOMS:

WITH THE ABOVE PATCH, the kernel boots to a login prompt in 32
seconds (10 of which are the initial pause, you can skip it
by pressing enter). Then you can run the "pkt-gen" program which i
normally use to test netmap performance in the guest, and you
see between 2 and 3 million packets per second (the NIC is disconnected
form the host so this should be harmles). Example:

login: root
Password: setup
# pkt-gen -i em0 -f tx

WITHOUT THE PATCH, booting becomes slow as soon as the timer tick starts
and we load dummynet (which also starts a kernel thread every millisecond).
You should be able to see how the printing of kernel messages slows down
terribly around this point:

...
Timecounters tick every 1.000 msec
  ipfw2 initialized, divert loadable, nat loadable, default to accept, 
logging disabled
  DUMMYNET 0 with IPv6 initialized (100409)

and then it takes a long/varible time to reach the login prompt,
easily a couple of minutes or more.
If you start pkt-gen now, you should see a much lower rate,
around 300-500Kpps

Since the problem seems timer-related, it makes sense that you are
not seeing much difference in linux which is probably tickless,
and that the trouble comes out on FreeBSD after the timers are
initialized.

But again I have no idea if my patch (which simply reverts part of
the offending commit) makes sense.

cheers
luigi

> 
> I could not reproduce it; the profile here seems normal, too:
> 
>  24,69%  perf-3281.map   [.] 0x7f4ab5ac9903
>  14,18%  qemu-system-x86_64  [.] cpu_x86_exec
>   2,67%  libglib-2.0.so.0.3800.1 [.] g_source_iter_next
>   2,63%  qemu-system-x86_64  [.] tcg_optimize
>   2,47%  qemu-system-x86_64  [.] helper_pcmpeqb_xmm
>   2,28%  qemu-system-x86_64  [.] phys_page_find
>   1,92%  qemu-system-x86_64  [.] 
> address_space_translate_internal
>   1,53%  qemu-system-x86_64  [.] tcg_liveness_analysis
>   1,40%  qemu-system-x86_64  [.] tcg_reg_alloc_op
>   1,17%  qemu-system-x86_64  [.] helper_psubb_xmm
>   0,97%  qemu-system-x86_64  [.] disas_insn
>   0,96%  qemu-system-x86_64  [.] cpu_x86_handle_mmu_fault
>   0,92%  qemu-system-x86_64  [.] tcg_out_opc
>   0,92%  qemu-system-x86_64  [.] helper_pmovmskb_xmm
>   0,91%  qemu-system-x86_64  [.] tlb_set_page
>   0,7

Re: [Qemu-devel] [PATCH for 1.7] target-i386: yield to another VCPU on PAUSE

2013-11-20 Thread Luigi Rizzo

On Wed, Nov 20, 2013 at 12:54:02PM +0100, Paolo Bonzini wrote:
> After commit b1bbfe7 (aio / timers: On timer modification, qemu_notify
> or aio_notify, 2013-08-21) FreeBSD guests report a huge slowdown.
> 
> The problem shows up as soon as FreeBSD turns out its periodic (~1 ms)
> tick, but the timers are only the trigger for a pre-existing problem.
> 
> Before the offending patch, setting a timer did a timer_settime system call.
> 
> After, setting the timer exits the event loop (which uses poll) and
> reenters it with a new deadline.  This does not cause any slowdown; the
> difference is between one system call (timer_settime and a signal
> delivery (SIGALRM) before the patch, and two system calls afterwards
> (write to a pipe or eventfd + calling poll again when re-entering the
> event loop).
> 
> Unfortunately, the exit/enter causes the main loop to grab the iothread
> lock, which in turns kicks the VCPU thread out of execution.  This
> causes TCG to execute the next VCPU in its round-robin scheduling of
> VCPUS.  When the second VCPU is mostly unused, FreeBSD runs a "pause"
> instruction in its idle loop which only burns cycles without any
> progress.  As soon as the timer tick expires, the first VCPU runs
> the interrupt handler but very soon it sets it again---and QEMU
> then goes back doing nothing in the second VCPU.
> 
> The fix is to make the pause instruction do "cpu_loop_exit".

thank you.

This fixes the slow booting problem, but the runtime performance
with my test program in the other mail thread

"commit b1bbfe72 causes huge slowdown with no kvm"

is still about 50% than pre- commit b1bbfe7

So i am still wondering if there is a better way to deliver the
timerlist_notify() in timerlist_rearm() .

I have tried to suppress the entire timerlist_rearm() when the newly
inserted event is close to the previous one, but this does not
seem to help -- actually it is harmful, presumably because qemu_clock_warp()
is also skipped. Conversely, filtering out timerlist_notify() as in
my previous (incorrect) patch seems to speed up things considerably,
perhaps because there are other timerlist_notify() calls elsewhere
that keep the ball rolling.


Just as a side note:

I am not sure whether you were seeing use of the 'pause' instruction
in profiling, but by default the idle loop in FreeBSD uses the "acpi" method
(to enter C1 state).

Other options are:
- "mwait" (unavailable on qemu due to some missing VCPU feature);
- "hlt" (which i tried and gives the same behaviour as "acpi");
- "spin" (which indeed does use the "pause" instruction, but is not
  used unless you force it with "sysctl machdep.idle=spin").

"pause" instructions are however used within spinlocks, and when
invoking the scheduler, which happens right before and after the idle loop.

cheers
luigi

> Cc: Richard Henderson 
> Reported-by: Luigi Rizzo 
> Signed-off-by: Paolo Bonzini 
> ---
>  target-i386/helper.h  |  1 +
>  target-i386/misc_helper.c | 22 --
>  target-i386/translate.c   |  5 -
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/target-i386/helper.h b/target-i386/helper.h
> index d6974df..3775abe 100644
> --- a/target-i386/helper.h
> +++ b/target-i386/helper.h
> @@ -58,6 +58,7 @@ DEF_HELPER_2(sysret, void, env, int)
>  DEF_HELPER_2(hlt, void, env, int)
>  DEF_HELPER_2(monitor, void, env, tl)
>  DEF_HELPER_2(mwait, void, env, int)
> +DEF_HELPER_2(pause, void, env, int)
>  DEF_HELPER_1(debug, void, env)
>  DEF_HELPER_1(reset_rf, void, env)
>  DEF_HELPER_3(raise_interrupt, void, env, int, int)
> diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
> index 93933fd..b6307ca 100644
> --- a/target-i386/misc_helper.c
> +++ b/target-i386/misc_helper.c
> @@ -566,6 +566,15 @@ void helper_rdmsr(CPUX86State *env)
>  }
>  #endif
>  
> +static void do_pause(X86CPU *cpu)
> +{
> +CPUX86State *env = &cpu->env;
> +
> +/* Just let another CPU run.  */
> +env->exception_index = EXCP_INTERRUPT;
> +cpu_loop_exit(env);
> +}
> +
>  static void do_hlt(X86CPU *cpu)
>  {
>  CPUState *cs = CPU(cpu);
> @@ -611,13 +620,22 @@ void helper_mwait(CPUX86State *env, int next_eip_addend)
>  cs = CPU(cpu);
>  /* XXX: not complete but not completely erroneous */
>  if (cs->cpu_index != 0 || CPU_NEXT(cs) != NULL) {
> -/* more than one CPU: do not sleep because another CPU may
> -   wake this one */
> +do_pause(cpu);
>  } else {
>  do_hlt(cpu);
>  }
>  }
>  
> +void helper_pause(CPUX86State *env, int next_eip_addend)
> +{
> +X86CPU *cpu = x86_e

Re: [Qemu-devel] [PATCH V5 16/18] virtio-pci: increase the maximum number of virtqueues to 513

2015-04-07 Thread Luigi Rizzo

On Tue, Apr 7, 2015 at 6:54 PM, Alexander Graf  wrote:

> On 04/01/2015 10:15 AM, Jason Wang wrote:
>
>> This patch increases the maximum number of virtqueues for pci from 64
>> to 513. This will allow booting a virtio-net-pci device with 256 queue
>> pairs.
>> ...
>>
>>* configuration space */
>>   #define VIRTIO_PCI_CONFIG_SIZE(dev) VIRTIO_PCI_CONFIG_OFF(msix_
>> enabled(dev))
>>   -#define VIRTIO_PCI_QUEUE_MAX 64
>> +#define VIRTIO_PCI_QUEUE_MAX 513
>>
>
> 513 is an interesting number. Any particular reason for it? Maybe this was
> mentioned before and I just missed it ;)
>
>
quite large, too. I thought multiple queue pairs were useful
to split the load for multicore machines, but targeting VMs with
up to 256 cores (and presumably an equal number in the host)
seems really forward looking.

On the other hand, the message is dated april first...

cheers
luigi

[Qemu-devel] unbounded qemu NetQue's ?

2013-01-06 Thread Luigi Rizzo

Hi,
while testing the tx path in qemu without a network backend connected,
i noticed that qemu_net_queue_send() builds up an unbounded
queue, because QTAILQ* have no notion of a queue length.

As a result, memory usage of a qemu instance can grow to extremely
large values.

I wonder if it makes sense to implement a hard limit on size of
NetQue's. The patch below is only a partial implementation
but probably sufficient for these purposes.

cheers
luigi

diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c
--- qemu-1.3.0-orig/net/queue.c 2012-12-03 20:37:05.0 +0100
+++ qemu-1.3.0/net/queue.c  2013-01-06 19:38:12.0 +0100
@@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue
 {
 NetPacket *packet;
 
+if (queue->packets.tqh_count > 1)
+   return; // XXX drop
+
 packet = g_malloc(sizeof(NetPacket) + size);
 packet->sender = sender;
 packet->flags = flags;
diff -urp qemu-1.3.0-orig/qemu-queue.h qemu-1.3.0/qemu-queue.h
--- qemu-1.3.0-orig/qemu-queue.h2012-12-03 20:37:05.0 +0100
+++ qemu-1.3.0/qemu-queue.h 2013-01-06 19:34:01.0 +0100
@@ -320,11 +320,12 @@ struct {
 struct name {   \
 qual type *tqh_first;   /* first element */ \
 qual type *qual *tqh_last;  /* addr of last next element */ \
+   int32_t tqh_count;  \
 }
 #define QTAILQ_HEAD(name, type)  Q_TAILQ_HEAD(name, struct type,)
 
 #define QTAILQ_HEAD_INITIALIZER(head)   \
-{ NULL, &(head).tqh_first }
+{ NULL, &(head).tqh_first, 0 }
 
 #define Q_TAILQ_ENTRY(type, qual)   \
 struct {\
@@ -339,6 +340,7 @@ struct {
 #define QTAILQ_INIT(head) do {  \
 (head)->tqh_first = NULL;   \
 (head)->tqh_last = &(head)->tqh_first;  \
+(head)->tqh_count = 0; \
 } while (/*CONSTCOND*/0)
 
 #define QTAILQ_INSERT_HEAD(head, elm, field) do {   \
@@ -348,6 +350,7 @@ struct {
 else\
 (head)->tqh_last = &(elm)->field.tqe_next;  \
 (head)->tqh_first = (elm);  \
+(head)->tqh_count++;   \
 (elm)->field.tqe_prev = &(head)->tqh_first; \
 } while (/*CONSTCOND*/0)
 
@@ -356,6 +359,7 @@ struct {
 (elm)->field.tqe_prev = (head)->tqh_last;   \
 *(head)->tqh_last = (elm);  \
 (head)->tqh_last = &(elm)->field.tqe_next;  \
+(head)->tqh_count++;   \
 } while (/*CONSTCOND*/0)
 
 #define QTAILQ_INSERT_AFTER(head, listelm, elm, field) do { \
@@ -381,6 +385,7 @@ struct {
 (elm)->field.tqe_prev;  \
 else\
 (head)->tqh_last = (elm)->field.tqe_prev;   \
+(head)->tqh_count--;   \
 *(elm)->field.tqe_prev = (elm)->field.tqe_next; \
 } while (/*CONSTCOND*/0)

Re: [Qemu-devel] unbounded qemu NetQue's ?

2013-01-07 Thread Luigi Rizzo

On Mon, Jan 7, 2013 at 2:49 PM, Stefan Hajnoczi  wrote:

> On Sun, Jan 06, 2013 at 08:23:56PM +0100, Luigi Rizzo wrote:
> > Hi,
> > while testing the tx path in qemu without a network backend connected,
> > i noticed that qemu_net_queue_send() builds up an unbounded
> > queue, because QTAILQ* have no notion of a queue length.
> >
> > As a result, memory usage of a qemu instance can grow to extremely
> > large values.
> >
> > I wonder if it makes sense to implement a hard limit on size of
> > NetQue's. The patch below is only a partial implementation
> > but probably sufficient for these purposes.
> >
> >   cheers
> >   luigi
>
> Hi Luigi,
> Good idea, we should bound the queues to prevent rare situations or bugs
> from hogging memory.
>
> Ideally we would do away with queues completely and make net clients
> hand buffers to each other ahead of time to impose flow control.  I've
> thought about this a few times and always reached the conclusion that
> it's not quite possible.
>

given that implementing flow control on the inbound (host->guest) path
is impossible, i tend to agree that removing queues is probably
not worth the effort even if possible at all.
...

Your comments below also remind me of a more general issues with the code:

 > diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c
> > --- qemu-1.3.0-orig/net/queue.c   2012-12-03 20:37:05.0+0100
> > +++ qemu-1.3.0/net/queue.c2013-01-06 19:38:12.0 +0100
> > @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue
>
> Please also do it for qemu_net_queue_append_iov().
>
>
the qemu code has many duplicate functions of the form foo() and foo_iov() .
Not only here, but even in the backents (e.g. see net/tap.c)
I think it would be good to settle on a single version of the function
and remove or convert the non-iov instances.

 >  {
> >  NetPacket *packet;
> >
> > +if (queue->packets.tqh_count > 1)
> > + return; // XXX drop
> > +
>
> sent_cb() must be invoked.  It's best to do this in a QEMUBH in case the
> caller is not prepared for reentrancy.
>
>
i forgot that, but the use of sent_cb() is interesting:
the only place that actually uses it seems to be net/tap.c,
and the way it uses it only makes sense if the queue has
room!

>  packet = g_malloc(sizeof(NetPacket) + size);
> >  packet->sender = sender;
> >  packet->flags = flags;
> > diff -urp qemu-1.3.0-orig/qemu-queue.h qemu-1.3.0/qemu-queue.h
> > --- qemu-1.3.0-orig/qemu-queue.h  2012-12-03 20:37:05.0+0100
> > +++ qemu-1.3.0/qemu-queue.h   2013-01-06 19:34:01.0 +0100
>
> Please don't modify qemu-queue.h.  It's a generic queue data structure
> used by all of QEMU.  Instead, keep a counter in the NetQueue.
>
>
good suggestion, that also makes the change smaller.

cheers
luigi


> Stefan
>



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

[Qemu-devel] nic-specific options ? (Re: [RFC] updated e1000 mitigation patch)

2013-01-11 Thread Luigi Rizzo

On Thu, Jan 10, 2013 at 01:25:48PM +0100, Stefan Hajnoczi wrote:
> On Thu, Dec 27, 2012 at 11:06:58AM +0100, Luigi Rizzo wrote:
> > diff -urp qemu-1.3.0-orig/hw/e1000.c qemu-1.3.0/hw/e1000.c
> > --- qemu-1.3.0-orig/hw/e1000.c  2012-12-03 20:37:05.0 +0100
> > +++ qemu-1.3.0/hw/e1000.c   2012-12-27 09:47:16.0 +0100
> > @@ -35,6 +35,8 @@
> >  
> >  #include "e1000_hw.h"
> >  
> > +static int mit_on = 1; /* interrupt mitigation enable */
> 
> If you want to make this optional then please put it inside E1000State
> so it can be toggled per NIC.

what is the simplest way to add NIC-specific options ?
I have added one line to e1000_properties, as below

 static Property e1000_properties[] = {
 DEFINE_NIC_PROPERTIES(E1000State, conf),
+DEFINE_PROP_UINT32("mit_on, E1000State, mit_on, 0),
 DEFINE_PROP_END_OF_LIST(),
 };

and this way i can do recognise this on the command line
qemu ... -device e1000,mit_on=1 ...

but i do not know how to set the property for the NIC i am using.
Specifically, i normally run qemu with "-net nic,model=1000"
(leaving the nic unconnected to the host network, so i can
test the tx path without being constrained by the backend's speed)

Any suggestion ?

thanks
luigi

Re: [Qemu-devel] unbounded qemu NetQue's ?

2013-01-16 Thread Luigi Rizzo

On Mon, Jan 7, 2013 at 5:49 AM, Stefan Hajnoczi  wrote:

> On Sun, Jan 06, 2013 at 08:23:56PM +0100, Luigi Rizzo wrote:
> > Hi,
> > while testing the tx path in qemu without a network backend connected,
> > i noticed that qemu_net_queue_send() builds up an unbounded
> > queue, because QTAILQ* have no notion of a queue length.
> >
> > As a result, memory usage of a qemu instance can grow to extremely
> > large values.
> >
> > I wonder if it makes sense to implement a hard limit on size of
> > NetQue's. The patch below is only a partial implementation
> > but probably sufficient for these purposes.
> >
> >   cheers
> >   luigi
>
> Hi Luigi,
> Good idea, we should bound the queues to prevent rare situations or bugs
> from hogging memory.
>

...

>  > diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c
> > --- qemu-1.3.0-orig/net/queue.c   2012-12-03 20:37:05.0+0100
> > +++ qemu-1.3.0/net/queue.c2013-01-06 19:38:12.0 +0100
> > @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue
>
> Please also do it for qemu_net_queue_append_iov().
>
> >  {
> >  NetPacket *packet;
> >
> > +if (queue->packets.tqh_count > 1)
> > + return; // XXX drop
> > +
>
> sent_cb() must be invoked.  It's best to do this in a QEMUBH in case the
> caller is not prepared for reentrancy.
>

Stefan, the semantic of callbacks makes it difficult to run them on drops:
they are supposed to run only when the queue has been drained,
and apparently only once per sender, according to the comment
in the header of net/queue.c:

/* The delivery handler may only return zero if it will call
 * qemu_net_queue_flush() when it determines that it is once again able
 * to deliver packets. It must also call qemu_net_queue_purge() in its
 * cleanup path.
 *
 * If a sent callback is provided to send(), the caller must handle a
 * zero return from the delivery handler by not sending any more packets
 * until we have invoked the callback. Only in that case will we queue
 * the packet.
 *
 * If a sent callback isn't provided, we just drop the packet to avoid
 * unbounded queueing.
 */

This seems to suggest that a packet to qemu_net_queue_send()
should never be queued unless it has a callback
(hence the existing code is buggy, because it never ever drops packets),
so the queue can only hold one packet per sender,
hence there is no real risk of overflow.

cheers
luigi

[Qemu-devel] [PATCH] fix unbounded qemu NetQueue

2013-01-16 Thread Luigi Rizzo

The comment at the beginning of net/queue.c says that packets that
cannot be sent by qemu_net_queue_send() should not be enqueued
unless a callback is set.

This patch implements this behaviour, that prevents a queue to grow
unbounded (e.g. when a network backend is not connected).

Also for good measure the patch implements bounded size queues
(though it should not be necessary now because each source can only have
one packet queued). When a packet is dropped because excessive
queue size the callback is not supposed to be called.

cheers
luigi

Signed-off-by: Luigi Rizzo 
--- ../orig/net/queue.c 2012-12-03 11:37:05.0 -0800
+++ ./net/queue.c   2013-01-16 21:37:20.109732443 -0800
@@ -50,6 +50,8 @@ struct NetPacket {
 
 struct NetQueue {
 void *opaque;
+uint32_t nq_maxlen;
+uint32_t nq_count;
 
 QTAILQ_HEAD(packets, NetPacket) packets;
 
@@ -63,6 +65,8 @@ NetQueue *qemu_new_net_queue(void *opaqu
 queue = g_malloc0(sizeof(NetQueue));
 
 queue->opaque = opaque;
+queue->nq_maxlen = 1; /* arbitrary limit */
+queue->nq_count = 0;
 
 QTAILQ_INIT(&queue->packets);
 
@@ -92,6 +96,8 @@ static void qemu_net_queue_append(NetQue
 {
 NetPacket *packet;
 
+if (!sent_cb || queue->nq_count >= queue->nq_maxlen)
+   return;
 packet = g_malloc(sizeof(NetPacket) + size);
 packet->sender = sender;
 packet->flags = flags;
@@ -99,6 +105,7 @@ static void qemu_net_queue_append(NetQue
 packet->sent_cb = sent_cb;
 memcpy(packet->data, buf, size);
 
+queue->nq_count++;
 QTAILQ_INSERT_TAIL(&queue->packets, packet, entry);
 }
 
@@ -113,6 +120,8 @@ static void qemu_net_queue_append_iov(Ne
 size_t max_len = 0;
 int i;
 
+if (!sent_cb || queue->nq_count >= queue->nq_maxlen)
+   return;
 for (i = 0; i < iovcnt; i++) {
 max_len += iov[i].iov_len;
 }
@@ -130,6 +139,7 @@ static void qemu_net_queue_append_iov(Ne
 packet->size += len;
 }
 
+queue->nq_count++;
 QTAILQ_INSERT_TAIL(&queue->packets, packet, entry);
 }
 
@@ -220,6 +230,7 @@ void qemu_net_queue_purge(NetQueue *queu
 QTAILQ_FOREACH_SAFE(packet, &queue->packets, entry, next) {
 if (packet->sender == from) {
 QTAILQ_REMOVE(&queue->packets, packet, entry);
+queue->nq_count--;
 g_free(packet);
 }
 }
@@ -233,6 +244,7 @@ bool qemu_net_queue_flush(NetQueue *queu
 
 packet = QTAILQ_FIRST(&queue->packets);
 QTAILQ_REMOVE(&queue->packets, packet, entry);
+queue->nq_count--;
 
 ret = qemu_net_queue_deliver(queue,
  packet->sender,
@@ -240,6 +252,7 @@ bool qemu_net_queue_flush(NetQueue *queu
  packet->data,
  packet->size);
 if (ret == 0) {
+queue->nq_count++;
 QTAILQ_INSERT_HEAD(&queue->packets, packet, entry);
 return false;
 }

Re: [Qemu-devel] [PATCH] fix unbounded qemu NetQueue

2013-01-17 Thread Luigi Rizzo

On Thu, Jan 17, 2013 at 2:21 AM, Stefan Hajnoczi  wrote:

> On Thu, Jan 17, 2013 at 07:07:11AM +0100, Luigi Rizzo wrote:
> > The comment at the beginning of net/queue.c says that packets that
> > cannot be sent by qemu_net_queue_send() should not be enqueued
> > unless a callback is set.
> >
> > This patch implements this behaviour, that prevents a queue to grow
> > unbounded (e.g. when a network backend is not connected).
> >
> > Also for good measure the patch implements bounded size queues
> > (though it should not be necessary now because each source can only have
> > one packet queued). When a packet is dropped because excessive
> > queue size the callback is not supposed to be called.
>
> Although I appreciate the semantics that the comment tries to establish,
> the code doesn't behave like this today and we cannot drop packets in
> cases where we relied on queuing them.
>
> More changes will be required to make the hub, USB, pcap scenario I
> described previously work.
>

i see. then the other option would be to drop packets only
if the queue is oversize AND the callback is not set:

+if (queue->nq_count >= queue->nq_maxlen && !sent_cb)
+   return;

so we should be able to use the queue to store packets when the
USB guest is slow, and avoid dropping precious packet with the callback set
(there should be only one of them for each source) ?
The queue might grow slightly overlimit but still be kept under control.

I cannot think of another way to handle the callback. I think we cannot run
it on a drop as it would cause a premature restart of the sender.
(at a cursory inspection of the code, just tap.c and virtio-net.c use
callbacks)

cheers
luigi

Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?

2013-01-18 Thread Luigi Rizzo

On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
...
> > This relies on the assumption that the ring (which is contiguous in the
> > guest's physical address space) is also contiguous in the host's virtual
> > address space.  In principle the property could be easily verified once
> > the ring is set up.
> 
> IIRC, the amount of contiguous memory is written by address_space_map in
> the plen parameter.
> 
> In your case:
> 
> > +   s->txring = address_space_map(pci_dma_context(&s->dev)->as,
> > +   base, &desclen, 0 /* is_write */);
> 
> that would be desclen on return from address_space_map.

ok thanks.

> > And of course, am i missing some important detail ?
> 
> Unfortunately yes.
> 
> First, host memory mappings could change (though they rarely do on PC).
>  The result of address_space_map is not guaranteed to be stable.  To
> avoid problems with this, however, you could use something like
> hw/dataplane/hostmem.c and even avoid address_space_map altogether.

I'll look into that. Hopefully there is something that i can
use as a notification that the mapping has changed...

> Second, that pci_dma_*() could have the addresses translated by an
> IOMMU.  virtio is documented to have "real" physical memory addresses,
> but this does not apply to other devices.

I see. I suppose the ability to have an iommu depends on the
specific NIC ? I am only planning to use the above shortcut for
e1000.

thanks a lot for the quick feedback
luigi

[Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?

2013-01-18 Thread Luigi Rizzo

Hi,
with a bunch of e1000 improvements we are at a point where we are
doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
between two guests, and two things that are high in the "perf top"
stats are phys_page_find() and related memory copies.

Both are triggered by the pci_dma_read() and pci_dma_write(),
which on e1000 (and presumably other frontends) are called on
every single descriptor and every single buffer.

I have then tried to access the guest memory without going every
time through the page lookup.

For the tx and rx rings i have a partial workaround, which tracks changes
to the base address of the ring, converts it to a host virtual address

 
+#ifdef MAP_RING
+base = tx_desc_base(s);
+if (base != s->txring_phi) {
+   hwaddr desclen = s->mac_reg[TDLEN];
+   s->txring_phi = base;
+   s->txring = address_space_map(pci_dma_context(&s->dev)->as,
+   base, &desclen, 0 /* is_write */);
+}
+#endif /* MAP_RING */
 ...

and then accesses the descriptor directly into guest memory

desc = s->txring[s->mac_reg[TDH]];

(sprinkle with memory barriers as needed).

This relies on the assumption that the ring (which is contiguous in the
guest's physical address space) is also contiguous in the host's virtual
address space.  In principle the property could be easily verified once
the ring is set up.

I have not done this for buffers because I am not sure how to verify
that the same mapping holds for all packet buffers.
One way could be the following:
on the first buffer access, make the address translation and try to
determine the boundaries of the contiguous (in virtual host memory)
region that holds the buffer. Then subsequent buffers can be easily
validated against this region.

So the problem is now the following: given a guest physical
address, is there a quick way to determine the contiguous
region of memory in the host that contains it ?

And of course, am i missing some important detail ?
Of course the above could be used conditionally if the required
conditions hold, and then revert to the pci_dma_*()
in other cases.

cheers
luigi

Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?

2013-01-18 Thread Luigi Rizzo

On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
> Il 18/01/2013 17:04, Luigi Rizzo ha scritto:
> > Hi,
> > with a bunch of e1000 improvements we are at a point where we are
> > doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
> > between two guests, and two things that are high in the "perf top"
> > stats are phys_page_find() and related memory copies.
> > 
> > Both are triggered by the pci_dma_read() and pci_dma_write(),
> > which on e1000 (and presumably other frontends) are called on
> > every single descriptor and every single buffer.
> > 
> > I have then tried to access the guest memory without going every
> > time through the page lookup. [...]
> > 
> > This relies on the assumption that the ring (which is contiguous in the
> > guest's physical address space) is also contiguous in the host's virtual
> > address space.  In principle the property could be easily verified once
> > the ring is set up.
> 
> IIRC, the amount of contiguous memory is written by address_space_map in
> the plen parameter.

unfortunately the plen parameter is modified only if the area
is smaller than the request, and there is no method that i can
find that returns [base,len] of a RAMBlock.

What I came up with, also to check for invalid addresses,
IOMMU and the like, is something like this:

// addr is the address we want to map into host VM
int mappable_addr(PCIDevice *dev, hwaddr addr, 
uint64_t *guest_ha_low, uint64_t *guest_ha_high,
uint64_t *gpa_to_hva)
{
AddressSpace *as = dev->as;
AddressSpaceDispatch *d = as->dispatch;
MemoryRegionSection *section;
RAMBlock *block;

if (dma_has_iommu(pci_dma_context(dev)))
return 0;   // no direct access

section = phys_page_find(d, addr >> TARGET_PAGE_BITS);
if (!memory_region_is_ram(section->mr) || section->readonly)
return 0;   // no direct access

QLIST_FOREACH(block, &ram_list.blocks, next) {
if (addr - block->offset < block->length) {
/* set 3 variables indicating the valid range
 * and the offset between the two address spaces.
 */
*guest_ha_low =  block->offset;
*guest_ha_high = block->offset + block->length;
*gpa_to_hva = (uint64_t)block->host - block->offset;
return 1;
}
}
return 0;
}

(this probably needs to be put in exec.c or some other place
that can access phys_page_find() and RAMBlock)

The interested client (hw/e1000.c in my case) could then do a
memory_listener_register() to be notified of changes,
invoke mappable_addr() on the first data buffer it has to
translate, and cache the result and use it to speed up the
translation subsequently in case of a hit (with the
pci_dma_read/write being the fallback methods in case
of a miss).

The cache is invalidated on updates arriving from the
memory listener, and refreshed at the next access.

Is this more sound ?
The only missing piece then is the call to
invalidate_and_set_dirty() 


cheers
luigi

[Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?

2013-01-20 Thread Luigi Rizzo

While running qemu 1.3.0 with the following network-related flags:

-net nic -net tap,ifname=tap0,script=''

I encountered the same problem (should be common to several
frontends, e.g. e100, eepro100, virtio-net, xen_nic):

in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0
(e.g. because the NIC has no buffers available)
traffic stops, despite the fact that the frontend will try to pull
queued packets when the receive ring is updated.

Upon investigation, it turns out that the backend code does

size = qemu_send_packet_async(&s->nc, buf, size, tap_send_completed);
if (size == 0) {
tap_read_poll(s, 0);

and the arguments are

s->nc.name = "tap.0"
s->nc.peer->name = "hub0port1"
s->nc.send_queue = 0x7f40b2f61e20
s->nc.peer->send_queue = 0x7f40b2f63690 <--- enqueued here

whereis the frontend is trying to pull from a different queue

qemu_flush_queued_packets(&s->nic->nc);

with arguments

s->nic->nc.name = "e1000.0"
s->nic->nc.peer->name = "hub0port0" <--- try to flush this
s->nic->nc.send_queue = 0x7f40b3008ae0
s->nic->nc.peer->send_queue = 0x7f40b2f63660


Note, regular traffic flows correctly across the hub,
but qemu_flush_queued_packets() seems to try and pull
from the wrong place.

Any idea how to fix this (other than the inefficient solution
of leaving read_poll=1 in the frontend)

cheers
luigi

Re: [Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?

2013-01-20 Thread Luigi Rizzo

small correction:

On Sun, Jan 20, 2013 at 6:50 PM, Luigi Rizzo  wrote:

> While running qemu 1.3.0 with the following network-related flags:
>
> -net nic -net tap,ifname=tap0,script=''
>
> I encountered the same problem (should be common to several
> frontends, e.g. e100, eepro100, virtio-net, xen_nic):
>
> in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0
> (e.g. because the NIC has no buffers available)
> traffic stops, despite the fact that the frontend will try to pull
> queued packets when the receive ring is updated.
>
> Upon investigation, it turns out that the backend code does
>
> size = qemu_send_packet_async(&s->nc, buf, size, tap_send_completed);
> if (size == 0) {
> tap_read_poll(s, 0);
>
> and the arguments are
>
> s->nc.name = "tap.0"
> s->nc.peer->name = "hub0port1"
> s->nc.send_queue = 0x7f40b2f61e20
> s->nc.peer->send_queue = 0x7f40b2f63690 <--- enqueued here
>
> whereis the frontend is trying to pull from a different queue
>
> qemu_flush_queued_packets(&s->nic->nc);
>
> with arguments
>
> s->nic->nc.name = "e1000.0"
> s->nic->nc.peer->name = "hub0port0" <--- try to flush this
> s->nic->nc.send_queue = 0x7f40b3008ae0
>

the queue that is actually flushed is  s->nic->nc.send_queue or
0x7f40b3008ae0

s->nic->nc.peer->send_queue = 0x7f40b2f63660
>
>
> Note, regular traffic flows correctly across the hub,
> but qemu_flush_queued_packets() seems to try and pull
> from the wrong place.
>
> Any idea how to fix this (other than the inefficient solution
> of leaving read_poll=1 in the frontend)
>
> cheers
> luigi
>

cheers
luigi

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?

2013-01-20 Thread Luigi Rizzo

... and upon closer inspection, the problem described below (frontend
blocks the backend, then tries to drain the wrong queue causing a stall)
occurs because the hub in the middle breaks the flow of events.
In the configuration below ( -net nic -net tap,ifname=tap0,... ) we have

e1000.0 <--> hub0port0 [hub] hub0port1 <--> tap.0

The hub0port1 reports as non-writable when all other ports
(just one in this case) are full, and the packet is queued
on hub0port1. However when the e1000 frontend tries to drain
the queue, it directly accesses the queue attached to hub0port0,
which is empty.
So it appears that the only fix is the following:
when a node is attached to a hub, instead of draining the
queue on the node one should drain all queues attached to the hub.
A new function qemu_flush_hub() would be handy, something like

QLIST_FOREACH(port, &hub->ports, next) {
if (port != source_port)
   qemu_flush_queued_packets(&port->nc);
}

The other option (queueing on the output ports of the hub)
would require a bit more attention to make sure that
the callback is only executed once (and also, avoid exceeding
data replication). Not impossible, but it requires reference
counting the packet.

What do you think, which way do you prefer ?

cheers
luigi

On Sun, Jan 20, 2013 at 6:50 PM, Luigi Rizzo  wrote:

> While running qemu 1.3.0 with the following network-related flags:
>
> -net nic -net tap,ifname=tap0,script=''
>
> I encountered the same problem (should be common to several
> frontends, e.g. e100, eepro100, virtio-net, xen_nic):
>
> in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0
> (e.g. because the NIC has no buffers available)
> traffic stops, despite the fact that the frontend will try to pull
> queued packets when the receive ring is updated.
>
> Upon investigation, it turns out that the backend code does
>
> size = qemu_send_packet_async(&s->nc, buf, size, tap_send_completed);
> if (size == 0) {
> tap_read_poll(s, 0);
>
> and the arguments are
>
> s->nc.name = "tap.0"
> s->nc.peer->name = "hub0port1"
> s->nc.send_queue = 0x7f40b2f61e20
> s->nc.peer->send_queue = 0x7f40b2f63690 <--- enqueued here
>
> whereis the frontend is trying to pull from a different queue
>
> qemu_flush_queued_packets(&s->nic->nc);
>
> with arguments
>
> s->nic->nc.name = "e1000.0"
> s->nic->nc.peer->name = "hub0port0" <--- try to flush this
> s->nic->nc.send_queue = 0x7f40b3008ae0
> s->nic->nc.peer->send_queue = 0x7f40b2f63660
>
>
> Note, regular traffic flows correctly across the hub,
> but qemu_flush_queued_packets() seems to try and pull
> from the wrong place.
>
> Any idea how to fix this (other than the inefficient solution
> of leaving read_poll=1 in the frontend)
>
> cheers
> luigi
>

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

[Qemu-devel] [PATCH] netmap backend

2013-01-21 Thread Luigi Rizzo

Hi,
the attached patch implements a qemu backend for the "netmap" API
thus allowing machines to attach to the VALE software switch as
well as netmap-supported cards (links below).

http://info.iet.unipi.it/~luigi/netmap/
http://info.iet.unipi.it/~luigi/vale/

This is a cleaned up version of code written last summer.

guest-guest speed using an e1000 frontend (with some modifications
related to interrupt moderation, will repost an updated version later):
up to 700 Kpps using sockets, and up to 5 Mpps using netmap within
the guests. I have not tried with virtio.

cheers
luigi



Signed-off-by: Luigi Rizzo 
--
 configure |   31 +
 net/Makefile.objs |1 +
 net/clients.h |4 +
 net/net.c |3 +
 net/qemu-netmap.c |  353 +
 net/queue.c   |   15 +++
 qapi-schema.json  |8 +-
 7 files changed, 414 insertions(+), 1 deletions(-)

diff --git a/configure b/configure
index c6172ef..cfdf8a6 100755
--- a/configure
+++ b/configure
@@ -146,6 +146,7 @@ curl=""
 curses=""
 docs=""
 fdt=""
+netmap=""
 nptl=""
 pixman=""
 sdl=""
@@ -739,6 +740,10 @@ for opt do
   ;;
   --enable-vde) vde="yes"
   ;;
+  --disable-netmap) netmap="no"
+  ;;
+  --enable-netmap) netmap="yes"
+  ;;
   --disable-xen) xen="no"
   ;;
   --enable-xen) xen="yes"
@@ -1112,6 +1117,8 @@ echo "  --disable-uuid   disable uuid support"
 echo "  --enable-uuidenable uuid support"
 echo "  --disable-vdedisable support for vde network"
 echo "  --enable-vde enable support for vde network"
+echo "  --disable-netmap disable support for netmap network"
+echo "  --enable-netmap  enable support for netmap network"
 echo "  --disable-linux-aio  disable Linux AIO support"
 echo "  --enable-linux-aio   enable Linux AIO support"
 echo "  --disable-cap-ng disable libcap-ng support"
@@ -1914,6 +1921,26 @@ EOF
 fi
 
 ##
+# netmap headers probe
+if test "$netmap" != "no" ; then
+  cat > $TMPC << EOF
+#include 
+#include 
+#include 
+#include 
+int main(void) { return 0; }
+EOF
+  if compile_prog "" "" ; then
+netmap=yes
+  else
+if test "$netmap" = "yes" ; then
+  feature_not_found "netmap"
+fi
+netmap=no
+  fi
+fi
+
+##
 # libcap-ng library probe
 if test "$cap_ng" != "no" ; then
   cap_libs="-lcap-ng"
@@ -3314,6 +3341,7 @@ echo "NPTL support  $nptl"
 echo "GUEST_BASE$guest_base"
 echo "PIE   $pie"
 echo "vde support   $vde"
+echo "netmap support$netmap"
 echo "Linux AIO support $linux_aio"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs $blobs"
@@ -3438,6 +3466,9 @@ fi
 if test "$vde" = "yes" ; then
   echo "CONFIG_VDE=y" >> $config_host_mak
 fi
+if test "$netmap" = "yes" ; then
+  echo "CONFIG_NETMAP=y" >> $config_host_mak
+fi
 if test "$cap_ng" = "yes" ; then
   echo "CONFIG_LIBCAP=y" >> $config_host_mak
 fi
diff --git a/net/Makefile.objs b/net/Makefile.objs
index a08cd14..068253f 100644
--- a/net/Makefile.objs
+++ b/net/Makefile.objs
@@ -10,3 +10,4 @@ common-obj-$(CONFIG_AIX) += tap-aix.o
 common-obj-$(CONFIG_HAIKU) += tap-haiku.o
 common-obj-$(CONFIG_SLIRP) += slirp.o
 common-obj-$(CONFIG_VDE) += vde.o
+common-obj-$(CONFIG_NETMAP) += qemu-netmap.o
diff --git a/net/clients.h b/net/clients.h
index 7793294..952d076 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -52,4 +52,8 @@ int net_init_vde(const NetClientOptions *opts, const char 
*name,
  NetClientState *peer);
 #endif
 
+#ifdef CONFIG_NETMAP
+int net_init_netmap(const NetClientOptions *opts, const char *name,
+NetClientState *peer);
+#endif
 #endif /* QEMU_NET_CLIENTS_H */
diff --git a/net/net.c b/net/net.c
index cdd9b04..816c987 100644
--- a/net/net.c
+++ b/net/net.c
@@ -618,6 +618,9 @@ static int (* const 
net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
 [NET_CLIENT_OPTIONS_KIND_BRIDGE]= net_init_bridge,
 #endif
 [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
+#ifdef CONFIG_NETMAP
+   [NET_CLIENT_OPTIONS_KIND_NETMAP]= net_init_netmap,
+#endif
 };
 
 
diff --git a/net/qemu-netmap.c b/net/qemu-netmap.c
new file mode 100644
index 000..79d7c09
--- /dev/null
+++ b/net/qemu-netmap.c
@@ -0,0 +1,353 @@
+/*
+ * netmap access for qemu
+ *
+ * Copyright (c) 2012-2013 Luigi Rizzo
+ *
+ * Permission is hereby granted, free of c

[Qemu-devel] [PATCH] fix qemu_flush_queued_packets() in presence of a hub

2013-01-21 Thread Luigi Rizzo

when frontend and backend are connected through a hub as below
(showing only one direction), and the frontend (or in general, all
output ports of the hub) cannot accept more traffic, the backend
queues packets in queue-A.

When the frontend (or in general, one output port) becomes ready again,
quemu tries to flush packets from queue-B, which is unfortunately empty.

   e1000.0 <--[queue B]-- hub0port0(hub)hub0port1 <--[queue A]-- tap.0

To fix this i propose to introduce a new function net_hub_flush()
which is called when trying to flush a queue connected to a hub.

cheers
luigi

Signed-off-by: Luigi Rizzo 
 net/hub.c |   13 +
 net/hub.h |1 +
 net/net.c |6 ++

diff --git a/net/hub.c b/net/hub.c
index a24c9d1..08f95d0 100644
--- a/net/hub.c
+++ b/net/hub.c
@@ -338,3 +338,16 @@ void net_hub_check_clients(void)
 }
 }
 }
+
+bool net_hub_flush(NetClientState *nc)
+{
+NetHubPort *port;
+NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc);
+int ret = 0;
+
+QLIST_FOREACH(port, &source_port->hub->ports, next) {
+   if (port != source_port)
+   ret += qemu_net_queue_flush(port->nc.send_queue);
+}
+return ret ? true : false;
+}
diff --git a/net/hub.h b/net/hub.h
index 583ada8..a625eff 100644
--- a/net/hub.h
+++ b/net/hub.h
@@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char 
*name);
 NetClientState *net_hub_find_client_by_name(int hub_id, const char *name);
 void net_hub_info(Monitor *mon);
 void net_hub_check_clients(void);
+bool net_hub_flush(NetClientState *nc);
 
 #endif /* NET_HUB_H */
diff --git a/net/net.c b/net/net.c
index 816c987..8caa368 100644
--- a/net/net.c
+++ b/net/net.c
@@ -355,6 +355,12 @@ void qemu_flush_queued_packets(NetClientState *nc)
 {
 nc->receive_disabled = 0;
 
+if (nc->peer && nc->peer->info->type == NET_CLIENT_OPTIONS_KIND_HUBPORT) {
+   if (net_hub_flush(nc->peer)) {
+   qemu_notify_event();
+   }
+   return;
+}
 if (qemu_net_queue_flush(nc->send_queue)) {
 /* We emptied the queue successfully, signal to the IO thread to repoll
  * the file descriptor (for tap, for example).

[Qemu-devel] [PATCH v2] netmap backend (revised)

2013-01-21 Thread Luigi Rizzo

reposting a version without changes that implement bounded
queues in net/queue.c

Hi,
the attached patch implements a qemu backend for the "netmap" API
thus allowing machines to attach to the VALE software switch as
well as netmap-supported cards (links below).

http://info.iet.unipi.it/~luigi/netmap/
http://info.iet.unipi.it/~luigi/vale/

This is a cleaned up version of code written last summer.

guest-guest speed using an e1000 frontend (with some modifications
related to interrupt moderation, will repost an updated version later):
up to 700 Kpps using sockets, and up to 5 Mpps using netmap within
the guests. I have not tried with virtio.

cheers
luigi



Signed-off-by: Luigi Rizzo 
--
 configure |   31 +
 net/Makefile.objs |1 +
 net/clients.h |4 +
 net/net.c |3 +
 net/qemu-netmap.c |  353 +
 qapi-schema.json  |8 +-

diff --git a/configure b/configure
index c6172ef..cfdf8a6 100755
--- a/configure
+++ b/configure
@@ -146,6 +146,7 @@ curl=""
 curses=""
 docs=""
 fdt=""
+netmap=""
 nptl=""
 pixman=""
 sdl=""
@@ -739,6 +740,10 @@ for opt do
   ;;
   --enable-vde) vde="yes"
   ;;
+  --disable-netmap) netmap="no"
+  ;;
+  --enable-netmap) netmap="yes"
+  ;;
   --disable-xen) xen="no"
   ;;
   --enable-xen) xen="yes"
@@ -1112,6 +1117,8 @@ echo "  --disable-uuid   disable uuid support"
 echo "  --enable-uuidenable uuid support"
 echo "  --disable-vdedisable support for vde network"
 echo "  --enable-vde enable support for vde network"
+echo "  --disable-netmap disable support for netmap network"
+echo "  --enable-netmap  enable support for netmap network"
 echo "  --disable-linux-aio  disable Linux AIO support"
 echo "  --enable-linux-aio   enable Linux AIO support"
 echo "  --disable-cap-ng disable libcap-ng support"
@@ -1914,6 +1921,26 @@ EOF
 fi
 
 ##
+# netmap headers probe
+if test "$netmap" != "no" ; then
+  cat > $TMPC << EOF
+#include 
+#include 
+#include 
+#include 
+int main(void) { return 0; }
+EOF
+  if compile_prog "" "" ; then
+netmap=yes
+  else
+if test "$netmap" = "yes" ; then
+  feature_not_found "netmap"
+fi
+netmap=no
+  fi
+fi
+
+##
 # libcap-ng library probe
 if test "$cap_ng" != "no" ; then
   cap_libs="-lcap-ng"
@@ -3314,6 +3341,7 @@ echo "NPTL support  $nptl"
 echo "GUEST_BASE$guest_base"
 echo "PIE   $pie"
 echo "vde support   $vde"
+echo "netmap support$netmap"
 echo "Linux AIO support $linux_aio"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs $blobs"
@@ -3438,6 +3466,9 @@ fi
 if test "$vde" = "yes" ; then
   echo "CONFIG_VDE=y" >> $config_host_mak
 fi
+if test "$netmap" = "yes" ; then
+  echo "CONFIG_NETMAP=y" >> $config_host_mak
+fi
 if test "$cap_ng" = "yes" ; then
   echo "CONFIG_LIBCAP=y" >> $config_host_mak
 fi
diff --git a/net/Makefile.objs b/net/Makefile.objs
index a08cd14..068253f 100644
--- a/net/Makefile.objs
+++ b/net/Makefile.objs
@@ -10,3 +10,4 @@ common-obj-$(CONFIG_AIX) += tap-aix.o
 common-obj-$(CONFIG_HAIKU) += tap-haiku.o
 common-obj-$(CONFIG_SLIRP) += slirp.o
 common-obj-$(CONFIG_VDE) += vde.o
+common-obj-$(CONFIG_NETMAP) += qemu-netmap.o
diff --git a/net/clients.h b/net/clients.h
index 7793294..952d076 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -52,4 +52,8 @@ int net_init_vde(const NetClientOptions *opts, const char 
*name,
  NetClientState *peer);
 #endif
 
+#ifdef CONFIG_NETMAP
+int net_init_netmap(const NetClientOptions *opts, const char *name,
+NetClientState *peer);
+#endif
 #endif /* QEMU_NET_CLIENTS_H */
diff --git a/net/net.c b/net/net.c
index cdd9b04..816c987 100644
--- a/net/net.c
+++ b/net/net.c
@@ -618,6 +618,9 @@ static int (* const 
net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
 [NET_CLIENT_OPTIONS_KIND_BRIDGE]= net_init_bridge,
 #endif
 [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
+#ifdef CONFIG_NETMAP
+   [NET_CLIENT_OPTIONS_KIND_NETMAP]= net_init_netmap,
+#endif
 };
 
 
diff --git a/net/qemu-netmap.c b/net/qemu-netmap.c
new file mode 100644
index 000..79d7c09
--- /dev/null
+++ b/net/qemu-netmap.c
@@ -0,0 +1,353 @@
+/*
+ * netmap access for qemu
+ *
+ * Copyright (c) 2012-2013 Luigi Rizzo
+ *
+ * Permission is hereby gra

Re: [Qemu-devel] [PATCH v2] netmap backend (revised)

2013-01-23 Thread Luigi Rizzo

On Tue, Jan 22, 2013 at 2:50 PM, Anthony Liguori wrote:

> Hi,
>
> Thank you for submitting your patch series.  checkpatch.pl has
> detected that one or more of the patches in this series violate
> the QEMU coding style.
>
> If you believe this message was sent in error, please ignore it
> or respond here with an explanation.
>
> Otherwise, please correct the coding style issues and resubmit a
> new version of the patch.
>
>
will do, thanks for the feedback
luigi

Re: [Qemu-devel] [PATCH v2] netmap backend (revised)

2013-01-23 Thread Luigi Rizzo

On Wed, Jan 23, 2013 at 12:10:55PM +0100, Stefan Hajnoczi wrote:
> On Tue, Jan 22, 2013 at 08:12:15AM +0100, Luigi Rizzo wrote:
> > reposting a version without changes that implement bounded
> > queues in net/queue.c
> > 
> > Hi,
> > the attached patch implements a qemu backend for the "netmap" API
> > thus allowing machines to attach to the VALE software switch as
> > well as netmap-supported cards (links below).
> > 
> > http://info.iet.unipi.it/~luigi/netmap/
> > http://info.iet.unipi.it/~luigi/vale/
> > 
> > This is a cleaned up version of code written last summer.
> 
> Looks like a clean software approach to low-level packet I/O.  I guess
> the biggest competitor would be a userspace library with NIC drivers
> using modern IOMMUs to avoid the security/reliability problem that
> previous userspace approaches suffered.  Pretty cool that netmap reuses
> kernel NIC driver implementations to avoid duplicating all that code.
> 
> I wonder how/if netmaps handles advanced NIC features like multiqueue
> and offloads?  But I've only read the webpage, not the papers or source
> code :).

there are definitely more details in the papers :)

IOMMU is not strictly necessary because userspace only sees packet
buffers and a device independent replica of the descriptor ring.
NIC registers and rings are only manipulated by the kernel within
system calls.

multiqueue is completely straightforward -- netmap exposes as many
queues as the hardware implements, you can attach file descriptors to
individual queues, bind threads to queues by just using pthread_setaffinity().

offloading so far is not supported, and that's part of the design:
it adds complexity at runtime to check the various possible
combinations of offloading in various places in the stack.
A single packet format makes the driver extremely efficient.

The thing that may make a difference is tcp segmentation and reassembly,
we will look at how to support it at some point.

I'apply the other changes you suggest below, thanks.
Only some comments:

The debugging macros (D, RD() )  are taken from our existing code,
> 
> > +#define ND(fd, ... )   // debugging
> > +#define D(format, ...)  \
...
> > +/* rate limited, lps indicates how many per second */
> > +#define RD(lps, format, ...)\
...

> Have you seen docs/tracing.txt?  It can do fprintf(stderr) but also
> SystemTap/DTrace and a simple built-in binary tracer.

will look at it. These debugging macros are the same we use in other
netmap code so i'd prefer to keep them.

> > +// a fast copy routine only for multiples of 64 bytes, non overlapped.
> > +static inline void
> > +pkt_copy(const void *_src, void *_dst, int l)
...
> > +*dst++ = *src++;
> > +}
> > +}
> 
> I wonder how different FreeBSD bcopy() is from glibc memcpy() and if the
> optimization is even a win.  The glibc code is probably hand-written
> assembly that CPU vendors have contributed for specific CPU model
> families.
> 
> Did you compare glibc memcpy() against pkt_copy()?

I haven't tried in detail on glibc but will run some tests.  In any
case not all systems have glibc, and on FreeBSD this pkt_copy was
a significant win for small packets (saving some 20ns each; of
course this counts only when you approach the 10 Mpps range, which
is what you get with netmap, and of course when data is in cache).

One reason pkt_copy gains something is that if it can assume there
is extra space in the buffer, it can work on large chunks avoiding the extra
jumps and instructions for the remaining 1-2-4 bytes.

> > +   ring->slot[i].len = size;
> > +   pkt_copy(buf, dst, size);
> 
> How does netmap guarantee that the buffer is large enough for this
> packet?

the netmap buffers are 2k, i'll make sure there is a check that the
size is not exceeded.

> > +close(s->me.fd);
> 
> Missing munmap?

yes you are correct.

cheers
luigi

[Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-23 Thread Luigi Rizzo

On Wed, Jan 23, 2013 at 02:03:17PM +0100, Stefan Hajnoczi wrote:
> On Wed, Jan 23, 2013 at 12:50:26PM +0100, Luigi Rizzo wrote:
> > On Wed, Jan 23, 2013 at 12:10:55PM +0100, Stefan Hajnoczi wrote:
> > > On Tue, Jan 22, 2013 at 08:12:15AM +0100, Luigi Rizzo wrote:
...
> > > > +// a fast copy routine only for multiples of 64 bytes, non overlapped.
> > > > +static inline void
> > > > +pkt_copy(const void *_src, void *_dst, int l)
> > ...
> > > > +*dst++ = *src++;
> > > > +}
> > > > +}
> > > 
> > > I wonder how different FreeBSD bcopy() is from glibc memcpy() and if the
> > > optimization is even a win.  The glibc code is probably hand-written
> > > assembly that CPU vendors have contributed for specific CPU model
> > > families.
> > > 
> > > Did you compare glibc memcpy() against pkt_copy()?
> > 
> > I haven't tried in detail on glibc but will run some tests.  In any
> > case not all systems have glibc, and on FreeBSD this pkt_copy was
> > a significant win for small packets (saving some 20ns each; of
> > course this counts only when you approach the 10 Mpps range, which
> > is what you get with netmap, and of course when data is in cache).
> > 
> > One reason pkt_copy gains something is that if it can assume there
> > is extra space in the buffer, it can work on large chunks avoiding the extra
> > jumps and instructions for the remaining 1-2-4 bytes.
> 
> I'd like to drop this code or at least make it FreeBSD-specific since
> there's no guarantee that this is a good idea on any other libc.
> 
> I'm even doubtful that it's always a win on FreeBSD.  You have a
> threshold to fall back to bcopy() and who knows what the "best" value
> for various CPUs is.

indeed.
With the attached program (which however might be affected by the
fact that data is not used after copying) it seems that on a recent
linux (using gcc 4.6.2) the fastest is __builtin_memcpy()

./testlock -m __builtin_memcpy -l 64

(by a factor of 2 or more) whereas all the other methods have
approximately the same speed.

On FreeBSD (with clang, gcc 4.2.1, gcc 4.6.4) the pkt_copy() above

./testlock -m fastcopy -l 64

is largely better than other methods. I am a bit puzzled why
the builtin method on FreeBSD is not effective, but i will check
on some other forum...

cheers
luigi

/*
 * Copyright (C) 2012 Luigi Rizzo. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *   1. Redistributions of source code must retain the above copyright
 *  notice, this list of conditions and the following disclaimer.
 *   2. Redistributions in binary form must reproduce the above copyright
 *  notice, this list of conditions and the following disclaimer in the
 *documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

/*
 * $Id: testlock.c 12015 2013-01-23 15:51:17Z luigi $
 *
 * Test program to study various ops and concurrency issues.
 * Create multiple threads, possibly bind to cpus, and run a workload.
 *
 * cc -O2 -Werror -Wall testlock.c -o testlock -lpthread
 *  you might need -lrt
 */

#include 
#include 
#include /* pthread_* */

#if defined(__APPLE__)

#include 
#define atomic_add_int(p, n) OSAtomicAdd32(n, (int *)p)
#define atomic_cmpset_32(p, o, n)   OSAtomicCompareAndSwap32(o, n, (int *)p)

#elif defined(linux)

int atomic_cmpset_32(volatile uint32_t *p, uint32_t old, uint32_t new)
{
int ret = *p == old;
*p = new;
return ret;
}

#if defined(HAVE_GCC_ATOMICS)
int atomic_add_int(volatile int *p, int v)
{
return __sync_fetch_and_add(p, v);
}
#else
inline
uint32_t atomic_add_int(uint32_t *p, int v)
{
__asm __volatile (
"   lock   xaddl   %0, %1 ;"
: "+r" (v), /* 0 (result) */
  "=m" (*p)

Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-23 Thread Luigi Rizzo

On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo  wrote:

> > I'm even doubtful that it's always a win on FreeBSD.  You have a
> > threshold to fall back to bcopy() and who knows what the "best" value
> > for various CPUs is.
>
> indeed.
> With the attached program (which however might be affected by the
> fact that data is not used after copying) it seems that on a recent
> linux (using gcc 4.6.2) the fastest is __builtin_memcpy()
>
> ./testlock -m __builtin_memcpy -l 64
>
> (by a factor of 2 or more) whereas all the other methods have
> approximately the same speed.
>

never mind, pilot error. in my test program i had swapped the
arguments to __builtin_memcpy(). With the correct ones,
__builtin_memcpy()  == bcopy == memcpy on both machines,
and never faster than the pkt_copy().

In fact, on the machine with FreeBSD the unrolled loop
still beats all other methods at small packet sizes.

(e.g. (memcin my test program I had swapped the
source and destination operands for __builtin_memcpy(), and
this substantially changed the memory access pattern.

With the correct operands, __builtin_memcpy == memcpy == bcopy
on both FreeBSD and Linux.
On FreeBSD pkt_copy is still faster than the other methods for
small packets, whereas on Linux they are equivalent.

If you are curious why swapping source and dst changed things
so dramatically:

the test was supposed to read from a large chunk of
memory (over 1GB) to avoid always hitting L1 or L2.
Swapping operands causes reads to hit always the same line,
thus saving a lot of misses. The difference between the two
machine then probably is due to how the cache is used on writes.

cheers
luigi

-- 
-----+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] [PATCH] e1000: cleanup process_tx_desc

2013-06-04 Thread Luigi Rizzo

On Tue, Jun 4, 2013 at 9:34 AM, Andrew Jones  wrote:

>
>
> - Original Message -
> > On 06/03/2013 10:20 AM, Andrew Jones wrote:
> > > Coverity complains about two overruns in process_tx_desc(). The
> > > complaints are false positives, but we might as well eliminate
> > > them. The problem is that "hdr" is defined as an unsigned int,
> > > but then used to offset an array of size 65536, and another of
> > > size 256 bytes. hdr will actually never be greater than 255
> > > though, as it's assigned only once and to the value of
> > > tp->hdr_len, which is an uint8_t. This patch simply gets rid of
> > > hdr, replacing it with tp->hdr_len, which makes it consistent
> > > with all other tp member use in the function.
> > >
> > > Signed-off-by: Andrew Jones 
> > > ---
> > >  hw/net/e1000.c | 16 
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > >
> >
> > The logic looks sound, but checkpatch detects some style issues. See
> below.
>
> ...


> > Although the style issues were present to begin with, we may as well take
> > the opportunity to fix them.
>
> Running checkpatch on the file yields
>
> 142 errors, 41 warnings
>
> I could send a v2 that fixes the 1 error and 2 warnings found in the
> context
> of this patch, but why? It's out of the scope of the patch (although I did
> use "cleanup" in the summary...), and it would hardly make a dent in this
> file's problems.
> 
>
>
Indeed. I find it slightly annoying (and a waste of everyone's time)
that patches are bounced on issues that they are not responsible for.
(this happens for several other opensource projects I have been
involved with).

I think that a much more effective approach would be to take the patch
(on the grounds that it improves the codebase),
and then if the committer feels like fixing the pre-existing
style issue he can do it separately, or make a comment in the
commit log if he has no time (and by the same reasoning, the original
submitter may be short of time).

cheers
luigi

Many projects i have been involved with have this 

>
> drew
>
> >
> > Sincerely,
> >
> > Jesse Larrew
> > Software Engineer, KVM Team
> > IBM Linux Technology Center
> > Phone: (512) 973-2052 (T/L: 363-2052)
> > jlar...@linux.vnet.ibm.com
> >
> >
>
>


-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

[Qemu-devel] difference between receive_raw() and receive() NetClientInfo methods ?

2013-06-06 Thread Luigi Rizzo

Can someone clarify what is the difference between the two methods
r
eceive_raw() and receive() in NetClientInfo ?

tap seems to be the only backend actually implementing receive_raw(),
but apart from prepending a vnet_hdr i cannot tell what is this for,
and whether receive_raw() is a custom addon for tap, or it can be used
by other backends to implement different features.

The reason I am asking is that we are working on a "fewer copies"
path between guests, and one of the things we need is an
asynchronous "receive" so that a backend can notify the frontend
when a buffer has been actually consumed.
Right now nc->info->receive() and friends are all synchronous,
and there is no mechanism to support asynchronous completion,
which forces the backend to make a copy if it cannot
complete the request inline.
Hence i probably need to add another method to NetClientInfo
so that the frontend can register the callback, or pass it to
the backend together with the buffer/iov

thanks
luigi

[Qemu-devel] qemu 1.1.0 slow as hell booting ?

2012-07-19 Thread Luigi Rizzo

hi,
while playing with various qemu versions i noticed that
qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when
booting FreeBSD (take for instance the netmap image at

http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin

while 0.11.x and 1.1.1 and 1.0.1 seem to work well.

I should mention that i had the same exact problem a few months
ago with a qemu version compiled on my Mac using macports,
which was fixed by manually building one myself from the sources
(the working Mac version reports version 1.0.91; unfortunately
i do not remember which version it was, but judging from the
macports history it could have been either 0.14.1 or 1.0)
so i do not think this is a specific problem with FreeBSD.

Are you able to reproduce this problem ?  Any idea on what could it be ?

cheers
luigi

Re: [Qemu-devel] qemu 1.1.0 slow as hell booting ?

2012-07-20 Thread Luigi Rizzo

On Thu, Jul 19, 2012 at 04:43:05PM +0200, Luigi Rizzo wrote:
> hi,
> while playing with various qemu versions i noticed that
> qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when
> booting FreeBSD (take for instance the netmap image at
> 
> http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin
> 
> while 0.11.x and 1.1.1 and 1.0.1 seem to work well.
> 
> I should mention that i had the same exact problem a few months
> ago with a qemu version compiled on my Mac using macports,
> which was fixed by manually building one myself from the sources
> (the working Mac version reports version 1.0.91; unfortunately
> i do not remember which version it was, but judging from the
> macports history it could have been either 0.14.1 or 1.0)
> so i do not think this is a specific problem with FreeBSD.
> 
> Are you able to reproduce this problem ?  Any idea on what could it be ?

Just for the archives:

the problem was that the FreeBSD port, when compiling with CLANG,
used --enable-tcg-interpreter which caused the slow code to be
generated. I suppose the same problem existed in the macports version

cheers
luigi

> cheers
> luigi

Re: [Qemu-devel] qemu 1.1.0 slow as hell booting ?

2012-07-20 Thread Luigi Rizzo

On Fri, Jul 20, 2012 at 09:04:51PM +1000, ronnie sahlberg wrote:
> Is it only very slow during boot   but then becomes normal speed when
> the system is fully booted and kernel and userspace are all up and
> running normally?

i did not have enough patience to go past the 2 minutes needed
to load text+data for the kernel.
See my followup, it was due to configure using --enable-tcg-interpreter
when using CLANG.

cheers
luigi


> regards
> ronnie sahlberg
> 
> 
> On Fri, Jul 20, 2012 at 12:43 AM, Luigi Rizzo  wrote:
> > hi,
> > while playing with various qemu versions i noticed that
> > qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when
> > booting FreeBSD (take for instance the netmap image at
> >
> > http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin
> >
> > while 0.11.x and 1.1.1 and 1.0.1 seem to work well.
> >
> > I should mention that i had the same exact problem a few months
> > ago with a qemu version compiled on my Mac using macports,
> > which was fixed by manually building one myself from the sources
> > (the working Mac version reports version 1.0.91; unfortunately
> > i do not remember which version it was, but judging from the
> > macports history it could have been either 0.14.1 or 1.0)
> > so i do not think this is a specific problem with FreeBSD.
> >
> > Are you able to reproduce this problem ?  Any idea on what could it be ?
> >
> > cheers
> > luigi
> >

[Qemu-devel] forgotten commit ? (Re: Proposed patch: huge RX speedup for hw/e1000.c)

2012-07-24 Thread Luigi Rizzo

Paolo,
a few weeks ago you posted the patch below but apparently
it did not get in after my 'tested-by' reply of June C1st4th
I'd like to confirm that your patch works perfectly for me.

Tested-by: Luigi Rizzo 

cheers
luigi

On Thu, May 31, 2012 at 01:03:35PM +0200, Paolo Bonzini wrote:
> Il 31/05/2012 12:40, Jan Kiszka ha scritto:
> > On 2012-05-31 12:03, Paolo Bonzini wrote:
> >> Il 31/05/2012 10:23, Jan Kiszka ha scritto:
> >>>>> @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val)
> >>>>>  {
> >>>>>  s->check_rxov = 0;
> >>>>>  s->mac_reg[index] = val & 0x;
> >>>>> +qemu_notify_event();
> >>> This still looks like the wrong tool: Packets that can't be delivered
> >>> are queued.
> >>
> >> Packets that are read from the tap but can't be delivered are queued;
> >> packets that are left on the tap need qemu_notify_event to be flushed.
> >>
> >>> So we need to flush the queue and clear the blocked delivery
> >>> there. qemu_flush_queued_packets appears more appropriate for this.
> >>
> >> Right, and qemu_flush_queued_packets needs to call qemu_notify_event
> >> which makes the call in virtio-net unnecessary.
> >>
> >> Paolo
> >>
> >> diff --git a/hw/e1000.c b/hw/e1000.c
> >> index 4573f13..43d933a 100644
> >> --- a/hw/e1000.c
> >> +++ b/hw/e1000.c
> >> @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val)
> >>  s->rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) & 3) + 1;
> >>  DBGOUT(RX, "RCTL: %d, mac_reg[RCTL] = 0x%x\n", s->mac_reg[RDT],
> >> s->mac_reg[RCTL]);
> >> +qemu_flush_queued_packets(&s->nic->nc);
> >>  }
> >>  
> >>  static void
> >> @@ -926,6 +927,9 @@ set_rdt(E1000State *s, int index, uint32_t val)
> >>  {
> >>  s->check_rxov = 0;
> >>  s->mac_reg[index] = val & 0x;
> >> +if (e1000_has_rxbufs(s, 1)) {
> >> +qemu_flush_queued_packets(&s->nic->nc);
> >> +}
> >>  }
> >>  
> >>  static void
> >> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> >> index 3f190d4..0974945 100644
> >> --- a/hw/virtio-net.c
> >> +++ b/hw/virtio-net.c
> >> @@ -447,10 +447,6 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, 
> >> VirtQueue *vq)
> >>  VirtIONet *n = to_virtio_net(vdev);
> >>  
> >>  qemu_flush_queued_packets(&n->nic->nc);
> >> -
> >> -/* We now have RX buffers, signal to the IO thread to break out of the
> >> - * select to re-poll the tap file descriptor */
> >> -qemu_notify_event();
> >>  }
> >>  
> >>  static int virtio_net_can_receive(VLANClientState *nc)
> >> diff --git a/net.c b/net.c
> >> index 1922d8a..fa846ae 100644
> >> --- a/net.c
> >> +++ b/net.c
> >> @@ -491,7 +491,12 @@ void qemu_flush_queued_packets(VLANClientState *vc)
> >>  queue = vc->send_queue;
> >>  }
> >>  
> >> -qemu_net_queue_flush(queue);
> >> +if (qemu_net_queue_flush(queue)) {
> >> +/* We emptied the queue successfully, signal to the IO thread to 
> >> repoll
> >> + * the file descriptor (for tap, for example).
> >> + */
> >> +qemu_notify_event();
> >> +}
> >>  }
> >>  
> >>  static ssize_t qemu_send_packet_async_with_flags(VLANClientState *sender,
> >> diff --git a/net/queue.c b/net/queue.c
> >> index 1ab5247..fd1c7e6 100644
> >> --- a/net/queue.c
> >> +++ b/net/queue.c
> >> @@ -232,7 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, 
> >> VLANClientState *from)
> >>  }
> >>  }
> >>  
> >> -void qemu_net_queue_flush(NetQueue *queue)
> >> +bool qemu_net_queue_flush(NetQueue *queue)
> >>  {
> >>  while (!QTAILQ_EMPTY(&queue->packets)) {
> >>  NetPacket *packet;
> >> @@ -248,7 +248,7 @@ void qemu_net_queue_flush(NetQueue *queue)
> >>   packet->size);
> >>  if (ret == 0) {
> >>  QTAILQ_INSERT_HEAD(&queue->packets, packet, entry);
> >> -break;
> >> +return 0;
> >>  }
> >>  
> >>  if (packet->sent_cb) {
> >> @@ -257,4 +257,5 @@ void qemu_net_queue_flush(NetQueue *queue)
> >>  
> >>  g_free(packet);
> >>  }
> >> +return 1;
> >>  }
> >> diff --git a/net/queue.h b/net/queue.h
> >> index a31958e..4bf6d3c 100644
> >> --- a/net/queue.h
> >> +++ b/net/queue.h
> >> @@ -66,6 +66,6 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue,
> >>  NetPacketSent *sent_cb);
> >>  
> >>  void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from);
> >> -void qemu_net_queue_flush(NetQueue *queue);
> >> +bool qemu_net_queue_flush(NetQueue *queue);
> >>  
> >>  #endif /* QEMU_NET_QUEUE_H */
> > 
> > Looks good.
> 
> Luigi, please reply with a Tested-by when possible.
> 
> Paolo

[Qemu-devel] interrupt mitigation for e1000

2012-07-24 Thread Luigi Rizzo

I noticed that the various NIC modules in qemu/kvm do not implement
interrupt mitigation, which is very beneficial as it dramatically
reduces exits from the hypervisor.

As a proof of concept i tried to implement it for the e1000 driver
(patch below), and it brings tx performance from 9 to 56Kpps on
qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm.

I am going to measure the rx interrupt mitigation in the next couple
of days.

Is there any interest in having this code in ?

cheers
luigi

diff -ubwrp --exclude '*.[do]' /tmp/qemu-61dc008/hw/e1000.c ./hw/e1000.c
--- /tmp/qemu-61dc008/hw/e1000.c2012-07-20 01:25:52.0 +0200
+++ ./hw/e1000.c2012-07-24 18:21:39.0 +0200
@@ -33,6 +33,8 @@
 #include "sysemu.h"
 #include "dma.h"
 
+#define MITIGATION
+
 #include "e1000_hw.h"
 
 #define E1000_DEBUG
@@ -127,6 +129,13 @@ typedef struct E1000State_st {
 } eecd_state;
 
 QEMUTimer *autoneg_timer;
+
+#ifdef MITIGATION
+QEMUBH *int_bh;// interrupt mitigation handler
+int tx_ics_count;  // pending tx int requests
+int rx_ics_count;  // pending rx int requests
+int int_cause; // int cause
+#endif // MITIGATION
 } E1000State;
 
 #definedefreg(x)   x = (E1000_##x>>2)
@@ -638,6 +648,26 @@ start_xmit(E1000State *s)
 return;
 }
 
+#ifdef MITIGATION
+/* we transmit the first few packets, or we do if we are
+ * approaching a full ring. in the latter case, also
+ * send an ics.
+ * 
+ */
+{
+int len, pending;
+len = s->mac_reg[TDLEN] / sizeof(desc) ;
+pending = s->mac_reg[TDT] - s->mac_reg[TDH];
+if (pending < 0)
+   pending += len;
+/* ignore requests after the first few ones, as long as
+ * we are not approaching a full ring.
+ * Otherwise, deliver packets to the backend.
+ */
+if (s->tx_ics_count > 4 && s->tx_ics_count + pending < len - 5)
+   return;
+#endif // MITIGATION
+
 while (s->mac_reg[TDH] != s->mac_reg[TDT]) {
 base = tx_desc_base(s) +
sizeof(struct e1000_tx_desc) * s->mac_reg[TDH];
@@ -663,7 +693,21 @@ start_xmit(E1000State *s)
 break;
 }
 }
+#ifdef MITIGATION
+s->int_cause |= cause; // remember the interrupt cause.
+s->tx_ics_count += pending;
+if (s->tx_ics_count >= len - 5) {
+// if the ring is about to become full, generate an interrupt
+   set_ics(s, 0, s->int_cause);
+   s->tx_ics_count = 0;
+   s->int_cause = 0;
+} else {   // otherwise just schedule it for later.
+qemu_bh_schedule_idle(s->int_bh);
+}
+}
+#else /* !MITIGATION */
 set_ics(s, 0, cause);
+#endif
 }
 
 static int
@@ -875,7 +919,27 @@ e1000_receive(VLANClientState *nc, const
 s->rxbuf_min_shift)
 n |= E1000_ICS_RXDMT0;
 
+#ifdef MITIGATION
+#define MIT_RXDMT0_SENT10  // large
+s->int_cause |= n;
+if (s->rx_ics_count == 0) {
+   /* deliver the first interrupt */
+   set_ics(s, 0, s->int_cause);
+   s->int_cause = 0;
+   s->rx_ics_count++;
+} else if ( (n & E1000_ICS_RXDMT0) && s->rx_ics_count < MIT_RXDMT0_SENT) {
+   /* also deliver if we are approaching ring full */
+   set_ics(s, 0, s->int_cause);
+   s->int_cause = 0;
+   s->rx_ics_count = MIT_RXDMT0_SENT;
+} else {
+   /* otherwise schedule for later */
+   s->rx_ics_count++;
+   qemu_bh_schedule_idle(s->int_bh);
+}
+#else /* !MITIGATION */
 set_ics(s, 0, n);
+#endif /* !MITIGATION */
 
 return size;
 }
@@ -1214,6 +1281,20 @@ static NetClientInfo net_e1000_info = {
 .link_status_changed = e1000_set_link_status,
 };
 
+#ifdef MITIGATION
+static void e1000_int_bh(void *opaque)
+{
+E1000State *s = opaque;
+if (s->tx_ics_count < 1 && s->rx_ics_count < 1)
+   return;
+s->tx_ics_count = 0;
+s->rx_ics_count = 0;
+start_xmit(s);
+set_ics(s, 0, s->int_cause);
+s->int_cause = 0;
+}
+#endif /* MITIGATION */
+
 static int pci_e1000_init(PCIDevice *pci_dev)
 {
 E1000State *d = DO_UPCAST(E1000State, dev, pci_dev);
@@ -1231,6 +1312,9 @@ static int pci_e1000_init(PCIDevice *pci
 
 e1000_mmio_setup(d);
 
+#ifdef MITIGATION
+d->int_bh = qemu_bh_new(e1000_int_bh, d);
+#endif /* MITIGATION */
 pci_register_bar(&d->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio);
 
 pci_register_bar(&d->dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->io);

Re: [Qemu-devel] interrupt mitigation for e1000

2012-07-25 Thread Luigi Rizzo

On Wed, Jul 25, 2012 at 11:53:29AM +0300, Avi Kivity wrote:
> On 07/24/2012 07:58 PM, Luigi Rizzo wrote:
> > I noticed that the various NIC modules in qemu/kvm do not implement
> > interrupt mitigation, which is very beneficial as it dramatically
> > reduces exits from the hypervisor.
> > 
> > As a proof of concept i tried to implement it for the e1000 driver
> > (patch below), and it brings tx performance from 9 to 56Kpps on
> > qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm.
> > 
> > I am going to measure the rx interrupt mitigation in the next couple
> > of days.
> > 
> > Is there any interest in having this code in ?
> 
> Indeed.  But please drop the #ifdef MITIGATIONs.

Thanks for the comments. The #ifdef block MITIGATION was only temporary to
point out the differences and run the performance comparisons.
Similarly, the magic thresholds below will be replaced with
appropriately commented #defines.

Note:
On the real hardware interrupt mitigation is controlled by a total of four
registers (TIDV, TADV, RIDV, RADV) which control it with a granularity
of 1024ns , see

http://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

An exact emulation of the feature is hard, because the timer resolution we
have is much coarser (in the ms range). So i am inclined to use a different
approach, similar to the one i have implemented, namely:
- the first few packets (whether 1 or 4 or 5 will be decided on the host)
  report an interrupt immediately;
- subsequent interrupts are delayed through qemu_bh_schedule_idle()
  (which is unpredictable but efficient; i tried qemu_bh_schedule()
  but it completely defeats mitigation)
- when the TX or RX rings are close to getting full, then again
  an interrupt is delivered immediately.

This approach also has the advantage of not requiring specific support
in the OS drivers.

cheers
luigi

> > +
> > +#ifdef MITIGATION
> > +QEMUBH *int_bh;// interrupt mitigation handler
> > +int tx_ics_count;  // pending tx int requests
> > +int rx_ics_count;  // pending rx int requests
> > +int int_cause; // int cause
> 
> Use uint32_t for int_cause, also a correctly sized type for the packet
> counts.
> 
> >  
> > +#ifdef MITIGATION
> > +/* we transmit the first few packets, or we do if we are
> > + * approaching a full ring. in the latter case, also
> > + * send an ics.
> > + * 
> > + */
> > +{
> > +int len, pending;
> > +len = s->mac_reg[TDLEN] / sizeof(desc) ;
> > +pending = s->mac_reg[TDT] - s->mac_reg[TDH];
> > +if (pending < 0)
> > +   pending += len;
> > +/* ignore requests after the first few ones, as long as
> > + * we are not approaching a full ring.
> > + * Otherwise, deliver packets to the backend.
> > + */
> > +if (s->tx_ics_count > 4 && s->tx_ics_count + pending < len - 5)
> > +   return;
> 
> Where do the 4 and 5 come from?  Shouldn't they be controlled by the
> guest using a device register?
> 
> >  }
> > +#ifdef MITIGATION
> > +s->int_cause |= cause; // remember the interrupt cause.
> > +s->tx_ics_count += pending;
> > +if (s->tx_ics_count >= len - 5) {
> > +// if the ring is about to become full, generate an interrupt
> 
> Another magic number.
> 
> > +   set_ics(s, 0, s->int_cause);
> > +   s->tx_ics_count = 0;
> > +   s->int_cause = 0;
> > +} else {   // otherwise just schedule it for later.
> > +qemu_bh_schedule_idle(s->int_bh);
> > +}
> > +}
> > +#else /* !MITIGATION */
> >  set_ics(s, 0, cause);
> > +#endif
> >  }
> >  
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
>

Re: [Qemu-devel] interrupt mitigation for e1000

2012-07-25 Thread Luigi Rizzo

On Wed, Jul 25, 2012 at 12:12:55PM +0200, Paolo Bonzini wrote:
> Il 25/07/2012 11:56, Luigi Rizzo ha scritto:
> > On Wed, Jul 25, 2012 at 11:53:29AM +0300, Avi Kivity wrote:
> >> On 07/24/2012 07:58 PM, Luigi Rizzo wrote:
> >>> I noticed that the various NIC modules in qemu/kvm do not implement
> >>> interrupt mitigation, which is very beneficial as it dramatically
> >>> reduces exits from the hypervisor.
> >>>
> >>> As a proof of concept i tried to implement it for the e1000 driver
> >>> (patch below), and it brings tx performance from 9 to 56Kpps on
> >>> qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm.
> >>>
> >>> I am going to measure the rx interrupt mitigation in the next couple
> >>> of days.
> >>>
> >>> Is there any interest in having this code in ?
> >>
> >> Indeed.  But please drop the #ifdef MITIGATIONs.
> > 
> > Thanks for the comments. The #ifdef block MITIGATION was only temporary to
> > point out the differences and run the performance comparisons.
> > Similarly, the magic thresholds below will be replaced with
> > appropriately commented #defines.
> > 
> > Note:
> > On the real hardware interrupt mitigation is controlled by a total of four
> > registers (TIDV, TADV, RIDV, RADV) which control it with a granularity
> > of 1024ns , see
> > 
> > http://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf
> > 
> > An exact emulation of the feature is hard, because the timer resolution we
> > have is much coarser (in the ms range). So i am inclined to use a different
> > approach, similar to the one i have implemented, namely:
> > - the first few packets (whether 1 or 4 or 5 will be decided on the host)
> >   report an interrupt immediately;
> > - subsequent interrupts are delayed through qemu_bh_schedule_idle()
> 
> qemu_bh_schedule_idle() is really a 10ms timer.

yes, i figured that out, this is why i said that my code was more
a "proof of concept" than an actual patch.

If you have a suggestion on how to schedule a shorter (say 1ms) timer i
am all hears. Perhaps qemu_new_timer_ns() and friends ?
 
This said, i do not plan to implement the full mitigation registers
controlled by the guest, just possibly use a parameter as in virtio-net
where you can have
'tx=bh' or 'tx=timer' and 'x-txtimer=N' with N is the mitigation delay
in nanoseconds (virtually, in practice rounded to whatever the host granularity 
is)

cheers
luigi

71 matches

Mail list logo