Peter, thanks for taking a look and providing feedback!

On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <pet...@redhat.com> wrote:
>
> On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >    several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >    (observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer and would fail DMA
> > map requests while the buffer was already in use. In turn, this would
> > cause DMA failures that ultimately manifest as hardware errors from the
> > guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited by a new command line
> > parameter. The default is 4096 bytes to match the previous maximum
> > buffer size. It is expected that suitable limits will vary quite a bit
> > in practice depending on device models and workloads.
> >
> > Signed-off-by: Mattias Nissler <mniss...@rivosinc.com>
> > ---
> >  include/sysemu/sysemu.h |  2 +
> >  qemu-options.hx         | 27 +++++++++++++
> >  softmmu/globals.c       |  1 +
> >  softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
> >  softmmu/vl.c            |  6 +++
> >  5 files changed, 83 insertions(+), 37 deletions(-)
> >
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index 25be2a692e..c5dc93cb53 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> >  extern const char *prom_envs[MAX_PROM_ENVS];
> >  extern unsigned int nb_prom_envs;
> >
> > +extern uint64_t max_bounce_buffer_size;
> > +
> >  /* serial ports */
> >
> >  /* Return the Chardev for serial port i, or NULL if none */
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 29b98c3d4c..6071794237 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -4959,6 +4959,33 @@ SRST
> >  ERST
> >  #endif
> >
> > +DEF("max-bounce-buffer-size", HAS_ARG,
> > +    QEMU_OPTION_max_bounce_buffer_size,
> > +    "-max-bounce-buffer-size size\n"
> > +    "                DMA bounce buffer size limit in bytes 
> > (default=4096)\n",
> > +    QEMU_ARCH_ALL)
> > +SRST
> > +``-max-bounce-buffer-size size``
> > +    Set the limit in bytes for DMA bounce buffer allocations.
> > +
> > +    DMA bounce buffers are used when device models request memory-mapped 
> > access
> > +    to memory regions that can't be directly mapped by the qemu process, 
> > so the
> > +    memory must read or written to a temporary local buffer for the device
> > +    model to work with. This is the case e.g. for I/O memory regions, and 
> > when
> > +    running in multi-process mode without shared access to memory.
> > +
> > +    Whether bounce buffering is necessary depends heavily on the device 
> > model
> > +    implementation. Some devices use explicit DMA read and write operations
> > +    which do not require bounce buffers. Some devices, notably storage, 
> > will
> > +    retry a failed DMA map request after bounce buffer space becomes 
> > available
> > +    again. Most other devices will bail when encountering map request 
> > failures,
> > +    which will typically appear to the guest as a hardware error.
> > +
> > +    Suitable bounce buffer size values depend on the workload and guest
> > +    configuration. A few kilobytes up to a few megabytes are common sizes
> > +    encountered in practice.
>
> Does it mean that the default 4K size can still easily fail with some
> device setup?

Yes. The thing is that the respective device setup is pretty exotic,
at least the only setup I'm aware of is multi-process with direct RAM
access via shared file descriptors from the device process disabled
(which hurts performance, so few people will run this setup). In
theory, DMA to an I/O region of some sort would also run into the
issue even in single process mode, but I'm not aware of such a
situation. Looking at it from a historic perspective, note that the
single-bounce-buffer restriction has been present since a decade, and
thus the issue has been present for years (since multi-process is a
thing), without it hurting anyone enough to get fixed. But don't get
me wrong - I don't want to downplay anything and very much would like
to see this fixed, but I want to be honest and put things into the
right perspective.

>
> IIUC the whole point of limit here is to make sure the allocation is still
> bounded, while 4K itself is not a hard limit. Making it bigger would be,
> IMHO, nice if it should work with known configs which used to be broken.

I'd be in favor of bumping the default. Networking should be fine with
64KB, but I've observed a default Linux + xhci + usb_storage setup to
use up to 1MB of DMA buffers, we'd probably need to raise it
considerably. Would 4MB still be acceptable? That wouldn't allow a
single nefarious VM to stage a memory denial of service attack, but
what if you're running many VMs?

>
> > +ERST
> > +
> >  DEFHEADING()
> >
> >  DEFHEADING(Generic object creation:)
> > diff --git a/softmmu/globals.c b/softmmu/globals.c
> > index e83b5428d1..d3cc010717 100644
> > --- a/softmmu/globals.c
> > +++ b/softmmu/globals.c
> > @@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
> >  uint8_t *boot_splash_filedata;
> >  int only_migratable; /* turn it off unless user states otherwise */
> >  int icount_align_option;
> > +uint64_t max_bounce_buffer_size = 4096;
> >
> >  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in 
> > the
> >   * little-endian "wire format" described in the SMBIOS 2.6 specification.
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index 3df73542e1..9f0fec0c8e 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -50,6 +50,7 @@
> >  #include "sysemu/dma.h"
> >  #include "sysemu/hostmem.h"
> >  #include "sysemu/hw_accel.h"
> > +#include "sysemu/sysemu.h"
> >  #include "sysemu/xen-mapcache.h"
> >  #include "trace/trace-root.h"
> >
> > @@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr 
> > len)
> >
> >  typedef struct {
> >      MemoryRegion *mr;
> > -    void *buffer;
> >      hwaddr addr;
> > -    hwaddr len;
> > -    bool in_use;
> > +    size_t len;
> > +    uint8_t buffer[];
> >  } BounceBuffer;
> >
> > -static BounceBuffer bounce;
> > +static size_t bounce_buffer_size;
> >
> >  typedef struct MapClient {
> >      QEMUBH *bh;
> > @@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
> >      qemu_mutex_lock(&map_client_list_lock);
> >      client->bh = bh;
> >      QLIST_INSERT_HEAD(&map_client_list, client, link);
> > -    /* Write map_client_list before reading in_use.  */
> > +    /* Write map_client_list before reading bounce_buffer_size.  */
> >      smp_mb();
> > -    if (!qatomic_read(&bounce.in_use)) {
> > +    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
> >          cpu_notify_map_clients_locked();
> >      }
> >      qemu_mutex_unlock(&map_client_list_lock);
> > @@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
> >      RCU_READ_LOCK_GUARD();
> >      fv = address_space_to_flatview(as);
> >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > +    memory_region_ref(mr);
> >
> >      if (!memory_access_is_direct(mr, is_write)) {
> > -        if (qatomic_xchg(&bounce.in_use, true)) {
> > +        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
> > +        if (size > max_bounce_buffer_size) {
> > +            size_t excess = size - max_bounce_buffer_size;
> > +            l -= excess;
> > +            qatomic_sub(&bounce_buffer_size, excess);
> > +        }
> > +
> > +        if (l == 0) {
> >              *plen = 0;
> >              return NULL;
> >          }
> > -        /* Avoid unbounded allocations */
> > -        l = MIN(l, TARGET_PAGE_SIZE);
> > -        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > -        bounce.addr = addr;
> > -        bounce.len = l;
> >
> > -        memory_region_ref(mr);
> > -        bounce.mr = mr;
> > +        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
>
> Maybe g_malloc0() would be better?

Good point, will change.

>
> I just checked that we had target page aligned allocations since the 1st
> day (commit 6d16c2f88f2a).  I didn't find any clue showing why it was done
> like that, but I do have worry on whether any existing caller that may
> implicitly relying on an address that is target page aligned.  But maybe
> not a major issue; I didn't see anything rely on that yet.

I did go through the same exercise when noticing the page alignment
and arrived at the same conclusion as you. That makes it two people
thinking it's OK, so I feel like we should take the risk here, in
particular given that we know this code path is already broken as is.

>
> > +        bounce->mr = mr;
> > +        bounce->addr = addr;
> > +        bounce->len = l;
> > +
> >          if (!is_write) {
> >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > -                               bounce.buffer, l);
> > +                          bounce->buffer, l);
> >          }
> >
> >          *plen = l;
> > -        return bounce.buffer;
> > +        return bounce->buffer;
> >      }
> >
> > -
> > -    memory_region_ref(mr);
> >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> >                                          l, is_write, attrs);
> >      fuzz_dma_read_cb(addr, *plen, mr);
> > @@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
> >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> >                           bool is_write, hwaddr access_len)
> >  {
> > -    if (buffer != bounce.buffer) {
> > -        MemoryRegion *mr;
> > -        ram_addr_t addr1;
> > +    MemoryRegion *mr;
> > +    ram_addr_t addr1;
> > +
> > +    mr = memory_region_from_host(buffer, &addr1);
> > +    if (mr == NULL) {
> > +        /*
> > +         * Must be a bounce buffer (unless the caller passed a pointer 
> > which
> > +         * wasn't returned by address_space_map, which is illegal).
>
> Is it possible to still have some kind of sanity check to make sure it's a
> bounce buffer passed in, just in case of a caller bug?  Or, the failure can
> be weird..

I was contemplating putting a magic number as the first struct member
as a best effort to detect invalid pointers and corruptions, but
wasn't sure it's warranted. Since you ask, I'll make that change.

>
> > +         */
> > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> >
> > -        mr = memory_region_from_host(buffer, &addr1);
> > -        assert(mr != NULL);
> >          if (is_write) {
> > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > -        }
> > -        if (xen_enabled()) {
> > -            xen_invalidate_map_cache_entry(buffer);
> > +            address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > +                                bounce->buffer, access_len);
> >          }
> > -        memory_region_unref(mr);
> > +
> > +        memory_region_unref(bounce->mr);
> > +        qatomic_sub(&bounce_buffer_size, bounce->len);
> > +        /* Write bounce_buffer_size before reading map_client_list. */
> > +        smp_mb();
> > +        cpu_notify_map_clients();
> > +        g_free(bounce);
> >          return;
> >      }
> > +
> > +    if (xen_enabled()) {
> > +        xen_invalidate_map_cache_entry(buffer);
> > +    }
> >      if (is_write) {
> > -        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > -                            bounce.buffer, access_len);
> > -    }
> > -    qemu_vfree(bounce.buffer);
> > -    bounce.buffer = NULL;
> > -    memory_region_unref(bounce.mr);
> > -    /* Clear in_use before reading map_client_list.  */
> > -    qatomic_set_mb(&bounce.in_use, false);
> > -    cpu_notify_map_clients();
> > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > +    }
> >  }
> >
> >  void *cpu_physical_memory_map(hwaddr addr,
> > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > index b0b96f67fa..dbe52f5ea1 100644
> > --- a/softmmu/vl.c
> > +++ b/softmmu/vl.c
> > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> >                  exit(1);
> >  #endif
> >                  break;
> > +            case QEMU_OPTION_max_bounce_buffer_size:
> > +                if (qemu_strtosz(optarg, NULL, &max_bounce_buffer_size) < 
> > 0) {
> > +                    error_report("invalid -max-ounce-buffer-size value");
> > +                    exit(1);
> > +                }
> > +                break;
>
> PS: I had a vague memory that we do not recommend adding more qemu cmdline
> options, but I don't know enough on the plan to say anything real.

I am aware of that, and I'm really not happy with the command line
option myself. Consider the command line flag a straw man I put in to
see whether any reviewers have better ideas :)

More seriously, I actually did look around to see whether I can add
the parameter to one of the existing option groupings somewhere, but
neither do I have a suitable QOM object that I can attach the
parameter to, nor did I find any global option groups that fits: this
is not really memory configuration, and it's not really CPU
configuration, it's more related to shared device model
infrastructure... If you have a good idea for a home for this, I'm all
ears.

>
> >              case QEMU_OPTION_object:
> >                  object_option_parse(optarg);
> >                  break;
> > --
> > 2.34.1
> >
>
> Thanks,
>
> --
> Peter Xu
>

Reply via email to