Re: [Qemu-devel] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Completion entries are meant to be only read by the host and written by the 
> > device.
> > The driver is supposed to scan the completions from the last point where it 
> > left,
> > and until it sees a completion with non flipped phase bit.
> 
> (Disclaimer: This is the first time I read the nvme driver, or really
> something in the nvme spec.)
> 
> Well, no, completion entries are also meant to be initialized by the
> host.  To me it looks like this is the place where that happens:
> Everything that has been processed by the device is immediately being
> re-initialized.
> 
> Maybe we shouldn’t do that here but in nvme_submit_command().  But
> currently we don’t, and I don’t see any other place where we currently
> initialize the CQ entries.

Hi!
I couldn't find any place in the spec that says that completion entries should 
be initialized.
It is probably wise to initialize that area to 0 on driver initialization, but 
nothing beyond that.
In particular that is what the kernel nvme driver does. 
Other that allocating a zeroed memory (and even that I am not sure it does), 
it doesn't write to the completion entries.

Thanks for the very very good review btw. I will go over all patches now and 
fix things.

Best regards,
Maxim Levitsky

> 
> Max
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 73ed5fa75f..6d4e7f3d83 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  while (q->inflight) {
> >  int16_t cid;
> >  c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> > -if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> > +if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> >  break;
> >  }
> >  q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> > @@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  qemu_mutex_unlock(&q->lock);
> >  req.cb(req.opaque, nvme_translate_error(c));
> >  qemu_mutex_lock(&q->lock);
> > -c->cid = cpu_to_le16(0);
> >  q->inflight--;
> > -/* Flip Phase Tag bit. */
> > -c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> >  progress = true;
> >  }
> >  if (progress) {
> > 
> 
> 





Re: [Qemu-devel] [PATCH v3 2/6] block/nvme: fix doorbell stride

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:10 +0200, Max Reitz wrote:
> On 05.07.19 13:09, Max Reitz wrote:
> > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > Fix the math involving non standard doorbell stride
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  block/nvme.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index 6d4e7f3d83..52798081b2 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c
> > > @@ -217,7 +217,7 @@ static NVMeQueuePair 
> > > *nvme_create_queue_pair(BlockDriverState *bs,
> > >  error_propagate(errp, local_err);
> > >  goto fail;
> > >  }
> > > -q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 
> > > 1];
> > > +q->cq.doorbell = &s->regs->doorbells[(idx * 2 + 1) * 
> > > s->doorbell_scale];
> > >  
> > >  return q;
> > >  fail:
> > 
> > Hm.  How has this ever worked?
> 
> (Ah, because CAP.DSTRD has probably been 0 in most devices.)
> 
Exactly, and I used cache line stride in my nvme-mdev, which broke this and I 
spend an evening figuring out
what is going on. I was sure that there is some memory ordering bug or 
something even weirder before (as usual)
finding that this is a very simple bug.
I tested nvme-mdev pretty much with everything I could get my hands on, 
including this driver.


Best regards,
Maxim Levitsky





Re: [Qemu-devel] [PATCH v3 3/6] block/nvme: support larger that 512 bytes sector devices

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:58 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Currently the driver hardcodes the sector size to 512,
> > and doesn't check the underlying device. Fix that.
> > 
> > Also fail if underlying nvme device is formatted with metadata
> > as this needs special support.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 45 -
> >  1 file changed, 40 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 52798081b2..1f0d09349f 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -463,7 +467,22 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  }
> >  
> >  s->nsze = le64_to_cpu(idns->nsze);
> > +lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> > +
> > +if (lbaf->ms) {
> > +error_setg(errp, "Namespaces with metadata are not yet supported");
> > +goto out;
> > +}
> > +
> > +hwsect_size = 1 << lbaf->ds;
> > +
> > +if (hwsect_size < BDRV_SECTOR_BITS || hwsect_size > s->page_size) {
> 
> s/BDRV_SECTOR_BITS/BDRV_SECTOR_SIZE/
Oops.

> 
> > +error_setg(errp, "Namespace has unsupported block size (%d)",
> > +hwsect_size);
> > +goto out;
> > +}
> >  
> > +s->blkshift = lbaf->ds;
> >  out:
> >  qemu_vfio_dma_unmap(s->vfio, resp);
> >  qemu_vfree(resp);
> > @@ -782,8 +801,22 @@ fail:
> >  static int64_t nvme_getlength(BlockDriverState *bs)
> >  {
> >  BDRVNVMeState *s = bs->opaque;
> > +return s->nsze << s->blkshift;
> > +}
> >  
> > -return s->nsze << BDRV_SECTOR_BITS;
> > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +assert(s->blkshift >= 9);
> 
> I think BDRV_SECTOR_BITS is more correct here (this is about what the
> general block layer code expects).  Also, there’s no pain in doing so,
> as you did check against BDRV_SECTOR_SIZE in nvme_identify().
> Max
Of course, thanks!!

> 
> > +return 1 << s->blkshift;
> > +}
> > +
> > +static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
> > +{
> > +int64_t blocksize = nvme_get_blocksize(bs);
> > +bsz->phys = blocksize;
> > +bsz->log = blocksize;
> > +return 0;
> >  }
> >  
> >  /* Called with s->dma_map_lock */
> 
> 

Thanks for the review,
Best regards,
Maxim Levitsky




Re: [Qemu-devel] [PATCH v3 4/6] block/nvme: add support for image creation

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 14:09 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Tesed on a nvme device like that:
> > 
> > # create preallocated qcow2 image
> > $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o 
> > preallocation=metadata
> > Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> > cluster_size=65536 preallocation=metadata lazy_refcounts=off 
> > refcount_bits=16
> > 
> > # create an empty qcow2 image
> > $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=off
> > Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> > cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 108 +++
> >  1 file changed, 108 insertions(+)
> 
> Hm.  I’m not quite sure I like this, because this is not image creation.

I fully agree with you, and the whole thing did felt kind of wrong.
I kind of think that bdrv_co_create_opts is kind of outdated for the purpose, 
especially
with the nvme driver.
I think that it would be better if the bdrv_file_open just supported something 
like 'O_CREAT'.

I done this the mostly the same was as the file-posix does this on the block 
devices,
including that 'hack' of zeroing the first sector, for which I really don't 
know if this is the right solution.



> 
> What we need is a general interface for formatting existing files.  I
> mean, we have that in QMP (blockdev-create), but the problem is that
> this doesn’t really translate to qemu-img create.
> 
> I wonder whether it’s best to hack something up that makes
> bdrv_create_file() a no-op, or whether we should expose blockdev-create
> over qemu-img.  I’ll see how difficult the latter is, it sounds fun
> (famous last words).
For existing images, the 'bdrv_create_file' is already kind of a nop, other 
that zeroing the first sector,
which kind of makes sense, but probably best done on higher level than in each 
driver.

So these are my thoughts about this, thanks for the review!

Best regards,
Maxim Levitsky




Re: [Qemu-devel] [PATCH v3 5/6] block/nvme: add support for write zeros

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 15:33 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 69 +++-
> >  block/trace-events   |  1 +
> >  include/block/nvme.h | 19 +++-
> >  3 files changed, 87 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 152d27b07f..02e0846643 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -469,6 +473,11 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  s->nsze = le64_to_cpu(idns->nsze);
> >  lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> >  
> > +if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
> > +NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
> > +NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
> > +bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
> > +
> 
> This violates the coding style, there should be curly brackets here.
100% agree + I need to see if we can update the checkpatch.pl to catch this.


> 
> >  if (lbaf->ms) {
> >  error_setg(errp, "Namespaces with metadata are not yet supported");
> >  goto out;
> > @@ -763,6 +772,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  int ret;
> >  BDRVNVMeState *s = bs->opaque;
> >  
> > +bs->supported_write_flags = BDRV_REQ_FUA;
> > +
> >  opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> >  qemu_opts_absorb_qdict(opts, options, &error_abort);
> >  device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> > @@ -791,7 +802,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  goto fail;
> >  }
> >  }
> > -bs->supported_write_flags = BDRV_REQ_FUA;
> 
> Any reason for this movement?

This is because the nvme_identify checks if the underlying namespace
supports 'discarded data reads back as zeros', and in which case it sets the
BDRV_REQ_MAY_UNMAP in bs->supported_write_flags which later allow me to set
'deallocate' bit in the write zeros command which hints the controller
to discard the area.

This was moved to avoid overwriting the value. I could have instead just ored 
the value,
but this way I think is cleaner a bit.



> 
> >  return 0;
> >  fail:
> >  nvme_close(bs);
> > @@ -1085,6 +1095,60 @@ static coroutine_fn int 
> > nvme_co_flush(BlockDriverState *bs)
> >  }
> >  
> >  
> > +static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
> > +  int64_t offset,
> > +  int bytes,
> > +  BdrvRequestFlags flags)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +
> > +if (!s->supports_write_zeros) {
> > +return -ENOTSUP;
> > +}
> > +
> > +uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
> 
> Another coding style violation: Variable declarations and other code may
> not be mixed.
Another bug in checkpatch.pl :-)

> 
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_WRITE_ZEROS,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
> > +.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> 
> [...]
> 
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 3ec8efcc43..65eb65c740 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -653,12 +653,29 @@ typedef struct NvmeIdNs {
> >  uint8_t mc;
> >  uint8_t dpc;
> >  uint8_t dps;
> > -uint8_t res30[98];
> > +
> > +uint8_t nmic;
> > +uint8_t rescap;
> > +uint8_t fpi;
> > +uint8_t dlfeat;
> > +
> > +uint8_t res30[94];
> >  NvmeLBAFlbaf[16];
> >  uint8_t res192[192];
> >  uint8_t vs[3712];
> >  } NvmeIdNs;
> >  
> > +
> > +/*Deallocate Logical Block Features*/
> > +#define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)   ((dlfeat) & 0x10)
> > +#define NVME_ID_NS_DLFEAT_WRITE_ZEROS(dlfeat) ((dlfeat) & 0x04)
> 
> Isn’t it bit 3, i.e. 0x08?
Oops, I haven't noticed that 'read behavier' field is 3 bits and not 2!
Thank you very much. I haven't caught this since my device I tested on doesn't 
support this anyway (dlfeat == 0)

> 
> Max
> 
> > +
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR(dlfeat) ((dlfeat) & 0x3)
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_UNDEFINED   0
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS   1
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ONES2
> > +
> > +
> >  #define NVME_ID_NS_NSFEAT_THIN(nsfeat)  ((n

Re: [Qemu-devel] [PATCH v4] block/nvme: add support for discard

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 15:50 +0200, Max Reitz wrote:
> On 03.07.19 18:07, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c   | 81 ++
> >  block/trace-events |  2 ++
> >  2 files changed, 83 insertions(+)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 02e0846643..96a715dcc1 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -460,6 +461,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >s->page_size / sizeof(uint64_t) * s->page_size);
> >  
> >  s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> > +s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
> 
> Shouldn’t this be le16_to_cpu(idctrl->oncs)?  Same in the previous
> patch, now that I think about it.

This reminds me of how I basically scrubbed though nvme-mdev looking for 
endiannes bugs, 
manually searching for every reference to hardware controlled structure.
Thank you very much!!


> 
> >  
> >  memset(resp, 0, 4096);
> >  
> > @@ -1149,6 +1151,84 @@ static coroutine_fn int 
> > nvme_co_pwrite_zeroes(BlockDriverState *bs,
> >  }
> >  
> >  
> > +static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
> > + int64_t offset,
> > + int bytes)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +NvmeDsmRange *buf;
> > +QEMUIOVector local_qiov;
> > +int r;
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_DSM,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = 0, /*number of ranges - 0 based*/
> 
> I’d make this cpu_to_le32(0).  Sure, there is no effect for 0, but in
> theory this is a variable value, so...
Let it be.

> 
> > +.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> > +
> > +if (!s->supports_discard) {
> > +return -ENOTSUP;
> > +}
> > +
> > +assert(s->nr_queues > 1);
> > +
> > +buf = qemu_try_blockalign0(bs, 4096);
> 
> I’m not sure whether this needs to be 4096 or whether 16 would suffice,
>  but I suppose this gets us the least trouble.
Exactly. Even better would be now that I think about it to use 's->page_size', 
the device page size.
It is at least 4K (spec minimum).

Speaking of which, there is a theoretical bug there - the device in theory can 
indicate that its minimal page size is larger that 4K.
The kernel currently rejects such devices, but here the driver just forces 4K 
page size in the CC register

> 
> > +if (!buf) {
> > +return -ENOMEM;
> 
> Indentation is off.
True!

> 
> > +}
> > +
> > +buf->nlb = cpu_to_le32(bytes >> s->blkshift);
> > +buf->slba = cpu_to_le64(offset >> s->blkshift);
> > +buf->cattr = 0;
> > +
> > +qemu_iovec_init(&local_qiov, 1);
> > +qemu_iovec_add(&local_qiov, buf, 4096);
> > +
> > +req = nvme_get_free_req(ioq);
> > +assert(req);
> > +
> > +qemu_co_mutex_lock(&s->dma_map_lock);
> > +r = nvme_cmd_map_qiov(bs, &cmd, req, &local_qiov);
> > +qemu_co_mutex_unlock(&s->dma_map_lock);
> > +
> > +if (r) {
> > +req->busy = false;
> > +return r;
> 
> Leaking buf and local_qiov here.
True, fixed.

> 
> > +}
> > +
> > +trace_nvme_dsm(s, offset, bytes);
> > +
> > +nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> > +
> > +data.co = qemu_coroutine_self();
> > +while (data.ret == -EINPROGRESS) {
> > +qemu_coroutine_yield();
> > +}
> > +
> > +qemu_co_mutex_lock(&s->dma_map_lock);
> > +r = nvme_cmd_unmap_qiov(bs, &local_qiov);
> > +qemu_co_mutex_unlock(&s->dma_map_lock);
> > +if (r) {
> > +return r;
> 
> Leaking buf and local_qiov here, too.
True, fixed - next time will check error paths better.

> 
> Max
> 
> > +}
> > +
> > +trace_nvme_dsm_done(s, offset, bytes, data.ret);
> > +
> > +qemu_iovec_destroy(&local_qiov);
> > +qemu_vfree(buf);
> > +return data.ret;
> > +
> > +}
> > +
> > +
> >  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> > BlockReopenQueue *queue, Error **errp)
> >  {
> 
> 

Thanks for the review,
Best regards,
Maxim Levitsky





[Qemu-devel] [PATCH v6 00/14] Build ACPI Heterogeneous Memory Attribute Table (HMAT)

2019-07-07 Thread Tao Xu
This series of patches will build Heterogeneous Memory Attribute Table (HMAT)
according to the command line. The ACPI HMAT describes the memory attributes,
such as memory side cache attributes and bandwidth and latency details,
related to the Memory Proximity Domain.
The software is expected to use HMAT information as hint for optimization.

The V5 patches link:
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg03138.html

Changelog:
v6:
- Rebase to upstream, move numa globals in arm/sbsa-ref
- When used once or twice in the function, use
  ms->numa_state->num_nodes(numa_info) directly (Igor)
- Correct some mistakes
- Use once monitor_printf in hmp_info_numa (Igor)
- Add new patch to extend CLI of "-numa node" option to indicate the
  initiator numa node-id (Dan)
- Update to ACPI 6.3 (Jonathan)
- Add QMP interface to update HMAT at runtime (Igor)
- Add test cases for ACPI HMAT

v5:
- spilt the 1-6/11 and 8/11 from patch v4 to build Memory Subsystem
Address Range Structure(s) and System Locality Latency and Bandwidth
Information Structure(s) firstly.
- add 1/8 of patch v5 to simplify arm_load_dtb() (Igor)
- drop the helper machine_num_numa_nodes() and use
machine->numa_state->num_nodes (and numa_state->nodes) directly (Igor)
- Add more descriptions from ACPI spec (Igor)
- Add the reason of using stub (Igor)
- Use GArray for NUMA memory ranges data (Igor)
- Separate hmat_build_lb() (Igor)
- Drop all global variables and use local variables instead (Igor)
- Add error message when base unit < 10
- Update the hmat-lb option example by using '-numa cpu'
and '-numa memdev' (Igor)

v4:
- send the patch of "move numa global variables into MachineState"
together with HMAT patches.
https://lists.gnu.org/archive/html/qemu-devel/2019-04/msg03662.html
- spilt the 1/8 of v3 patch into two patches, 4/11 introduces
build_mem_ranges() and 5/11 builds HMAT (Igor)
- use build_append_int_noprefix() to build parts of ACPI table in
all patches (Igor)
- Split 8/8 of patch v3 into two parts, 10/11 introduces NFIT
generalizations (build_acpi_aml_common), and use it in 11/11 to
simplify hmat_build_aml (Igor)
- use MachineState instead of PCMachineState to build HMAT more
generalic (Igor)
- move the 7/8 v3 patch into the former patches
- update the version tag from 4.0 to 4.1
v3:
- rebase the fixing patch into the jingqi's patches (Eric)
- update the version tag from 3.10 to 4.0 (Eric)
v2:
  Per Igor and Eric's comments, fix some coding style and small issues:
- update the version number in qapi/misc.json
- including the expansion of the acronym HMAT in qapi/misc.json
- correct spell mistakes in qapi/misc.json and qemu-options.hx
- fix the comment syle in hw/i386/acpi-build.c
and hw/acpi/hmat.h
   - remove some unnecessary head files in hw/acpi/hmat.c 
   - use hardcoded numbers from spec to generate
   Memory Subsystem Address Range Structure in hw/acpi/hmat.c
   - drop the struct AcpiHmat and AcpiHmatSpaRange
in hw/acpi/hmat.h
   - rewrite NFIT code to build _HMA method

Liu Jingqi (6):
  hmat acpi: Build Memory Proximity Domain Attributes Structure(s)
  hmat acpi: Build System Locality Latency and Bandwidth Information
Structure(s)
  hmat acpi: Build Memory Side Cache Information Structure(s)
  numa: Extend the CLI to provide memory latency and bandwidth
information
  numa: Extend the CLI to provide memory side cache information
  hmat acpi: Implement _HMA method to update HMAT at runtime

Tao Xu (8):
  hw/arm: simplify arm_load_dtb
  numa: move numa global variable nb_numa_nodes into MachineState
  numa: move numa global variable have_numa_distance into MachineState
  numa: move numa global variable numa_info into MachineState
  numa: Extend CLI to provide initiator information for numa nodes
  acpi: introduce aml_build_runtime_buf for NFIT generalizations
  QMP: Add QMP interface to update HMAT at runtime
  tests/bios-tables-test: add test cases for ACPI HMAT

 exec.c  |   5 +-
 hw/acpi/Kconfig |   5 +
 hw/acpi/Makefile.objs   |   1 +
 hw/acpi/acpi-stub.c |   7 +
 hw/acpi/aml-build.c |   9 +-
 hw/acpi/hmat.c  | 552 
 hw/acpi/hmat.h  | 172 +
 hw/acpi/nvdimm.c|  49 ++-
 hw/arm/aspeed.c |   5 +-
 hw/arm/boot.c   |  20 +-
 hw/arm/collie.c |   8 +-
 hw/arm/cubieboard.c |   5 +-
 hw/arm/exynos4_boards.c |   7 +-
 hw/arm/highbank.c   |   8 +-
 hw/arm/imx25_pdk.c  |   5 +-
 hw/arm/integratorcp.c   |   8 +-
 hw/arm/kzm.c|   5 +-
 hw/arm/mainstone.c  |   5 +-
 hw/arm/mcimx6ul-evk.c 

[Qemu-devel] [PATCH v6 01/14] hw/arm: simplify arm_load_dtb

2019-07-07 Thread Tao Xu
In struct arm_boot_info, kernel_filename, initrd_filename and
kernel_cmdline are copied from from MachineState. This patch add
MachineState as a parameter into arm_load_dtb() and move the copy chunk
of kernel_filename, initrd_filename and kernel_cmdline into
arm_load_kernel().

Reviewed-by: Igor Mammedov 
Reviewed-by: Liu Jingqi 
Suggested-by: Igor Mammedov 
Signed-off-by: Tao Xu 
---

Changes in v6:
- rebase to upstream, move the copy chunk of kernel_filename
  arm_load_kernel() in arm/sbsa-ref
---
 hw/arm/aspeed.c   |  5 +
 hw/arm/boot.c | 14 --
 hw/arm/collie.c   |  8 +---
 hw/arm/cubieboard.c   |  5 +
 hw/arm/exynos4_boards.c   |  7 ++-
 hw/arm/highbank.c |  8 +---
 hw/arm/imx25_pdk.c|  5 +
 hw/arm/integratorcp.c |  8 +---
 hw/arm/kzm.c  |  5 +
 hw/arm/mainstone.c|  5 +
 hw/arm/mcimx6ul-evk.c |  5 +
 hw/arm/mcimx7d-sabre.c|  5 +
 hw/arm/musicpal.c |  8 +---
 hw/arm/nseries.c  |  5 +
 hw/arm/omap_sx1.c |  5 +
 hw/arm/palm.c | 10 ++
 hw/arm/raspi.c|  6 +-
 hw/arm/realview.c |  5 +
 hw/arm/sabrelite.c|  5 +
 hw/arm/sbsa-ref.c |  3 +--
 hw/arm/spitz.c|  5 +
 hw/arm/tosa.c |  8 +---
 hw/arm/versatilepb.c  |  5 +
 hw/arm/vexpress.c |  5 +
 hw/arm/virt.c |  8 +++-
 hw/arm/xilinx_zynq.c  |  8 +---
 hw/arm/xlnx-versal-virt.c |  7 ++-
 hw/arm/xlnx-zcu102.c  |  5 +
 hw/arm/z2.c   |  8 +---
 include/hw/arm/boot.h |  4 ++--
 30 files changed, 43 insertions(+), 147 deletions(-)

diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
index 8b6d304247..ce647bdb49 100644
--- a/hw/arm/aspeed.c
+++ b/hw/arm/aspeed.c
@@ -241,9 +241,6 @@ static void aspeed_board_init(MachineState *machine,
 write_boot_rom(drive0, FIRMWARE_ADDR, fl->size, &error_abort);
 }
 
-aspeed_board_binfo.kernel_filename = machine->kernel_filename;
-aspeed_board_binfo.initrd_filename = machine->initrd_filename;
-aspeed_board_binfo.kernel_cmdline = machine->kernel_cmdline;
 aspeed_board_binfo.ram_size = ram_size;
 aspeed_board_binfo.loader_start = sc->info->memmap[ASPEED_SDRAM];
 aspeed_board_binfo.nb_cpus = bmc->soc.num_cpus;
@@ -252,7 +249,7 @@ static void aspeed_board_init(MachineState *machine,
 cfg->i2c_init(bmc);
 }
 
-arm_load_kernel(ARM_CPU(first_cpu), &aspeed_board_binfo);
+arm_load_kernel(ARM_CPU(first_cpu), machine, &aspeed_board_binfo);
 }
 
 static void palmetto_bmc_i2c_init(AspeedBoardState *bmc)
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 1fb24fbef2..a90151f465 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -524,7 +524,7 @@ static void fdt_add_psci_node(void *fdt)
 }
 
 int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
- hwaddr addr_limit, AddressSpace *as)
+ hwaddr addr_limit, AddressSpace *as, MachineState *ms)
 {
 void *fdt = NULL;
 int size, rc, n = 0;
@@ -627,9 +627,9 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info 
*binfo,
 qemu_fdt_add_subnode(fdt, "/chosen");
 }
 
-if (binfo->kernel_cmdline && *binfo->kernel_cmdline) {
+if (ms->kernel_cmdline && *ms->kernel_cmdline) {
 rc = qemu_fdt_setprop_string(fdt, "/chosen", "bootargs",
- binfo->kernel_cmdline);
+ ms->kernel_cmdline);
 if (rc < 0) {
 fprintf(stderr, "couldn't set /chosen/bootargs\n");
 goto fail;
@@ -1244,7 +1244,7 @@ static void arm_setup_firmware_boot(ARMCPU *cpu, struct 
arm_boot_info *info)
  */
 }
 
-void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info)
+void arm_load_kernel(ARMCPU *cpu, MachineState *ms, struct arm_boot_info *info)
 {
 CPUState *cs;
 AddressSpace *as = arm_boot_address_space(cpu, info);
@@ -1265,7 +1265,9 @@ void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info 
*info)
  * doesn't support secure.
  */
 assert(!(info->secure_board_setup && kvm_enabled()));
-
+info->kernel_filename = ms->kernel_filename;
+info->kernel_cmdline = ms->kernel_cmdline;
+info->initrd_filename = ms->initrd_filename;
 info->dtb_filename = qemu_opt_get(qemu_get_machine_opts(), "dtb");
 info->dtb_limit = 0;
 
@@ -1277,7 +1279,7 @@ void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info 
*info)
 }
 
 if (!info->skip_dtb_autoload && have_dtb(info)) {
-if (arm_load_dtb(info->dtb_start, info, info->dtb_limit, as) < 0) {
+if (arm_load_dtb(info->dtb_start, info, info->dtb_limit, as, ms) < 0) {
 exit(1);
 }
 }
diff --git a/hw/arm/collie.c b/hw/arm/collie.c
index 3db3c56004..72bc8f26e5 100644
--- a/hw/arm/collie.c
+++ b/hw/arm/collie.c
@@ -26,9 +26,6 @@ static struct arm_boot_info collie_bin

[Qemu-devel] [PATCH v6 03/14] numa: move numa global variable have_numa_distance into MachineState

2019-07-07 Thread Tao Xu
Move existing numa global have_numa_distance into NumaState.

Reviewed-by: Igor Mammedov 
Reviewed-by: Liu Jingqi 
Suggested-by: Igor Mammedov 
Suggested-by: Eduardo Habkost 
Signed-off-by: Tao Xu 
---

Changes in v6:
- rebase to upstream, move globals in arm/sbsa-ref
---
 hw/arm/sbsa-ref.c| 2 +-
 hw/arm/virt-acpi-build.c | 2 +-
 hw/arm/virt.c| 2 +-
 hw/i386/acpi-build.c | 2 +-
 include/sysemu/numa.h| 4 ++--
 numa.c   | 5 ++---
 6 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
index 9f0cd37d72..13a9bf938c 100644
--- a/hw/arm/sbsa-ref.c
+++ b/hw/arm/sbsa-ref.c
@@ -158,7 +158,7 @@ static void create_fdt(SBSAMachineState *sms)
 qemu_fdt_setprop_cell(fdt, "/", "#address-cells", 0x2);
 qemu_fdt_setprop_cell(fdt, "/", "#size-cells", 0x2);
 
-if (have_numa_distance) {
+if (ms->numa_state->have_numa_distance) {
 int size = nb_numa_nodes * nb_numa_nodes * 3 * sizeof(uint32_t);
 uint32_t *matrix = g_malloc0(size);
 int idx, i, j;
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index a2cc4b84fe..461a44b5b0 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -797,7 +797,7 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 if (ms->numa_state->num_nodes > 0) {
 acpi_add_table(table_offsets, tables_blob);
 build_srat(tables_blob, tables->linker, vms);
-if (have_numa_distance) {
+if (ms->numa_state->have_numa_distance) {
 acpi_add_table(table_offsets, tables_blob);
 build_slit(tables_blob, tables->linker, ms);
 }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index e7d72096e5..ef051569ef 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -232,7 +232,7 @@ static void create_fdt(VirtMachineState *vms)
 "clk24mhz");
 qemu_fdt_setprop_cell(fdt, "/apb-pclk", "phandle", vms->clock_phandle);
 
-if (have_numa_distance) {
+if (nb_numa_nodes > 0 && ms->numa_state->have_numa_distance) {
 int size = nb_numa_nodes * nb_numa_nodes * 3 * sizeof(uint32_t);
 uint32_t *matrix = g_malloc0(size);
 int idx, i, j;
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index d26a7be7a6..0e451e06d1 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2680,7 +2680,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
*machine)
 if (pcms->numa_nodes) {
 acpi_add_table(table_offsets, tables_blob);
 build_srat(tables_blob, tables->linker, machine);
-if (have_numa_distance) {
+if (machine->numa_state->have_numa_distance) {
 acpi_add_table(table_offsets, tables_blob);
 build_slit(tables_blob, tables->linker, machine);
 }
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index bea3b5b899..b9fa457948 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -6,8 +6,6 @@
 #include "sysemu/hostmem.h"
 #include "hw/boards.h"
 
-extern bool have_numa_distance;
-
 struct NodeInfo {
 uint64_t node_mem;
 struct HostMemoryBackend *node_memdev;
@@ -26,6 +24,8 @@ struct NumaState {
 /* Number of NUMA nodes */
 int num_nodes;
 
+/* Allow setting NUMA distance for different NUMA nodes */
+bool have_numa_distance;
 };
 typedef struct NumaState NumaState;
 
diff --git a/numa.c b/numa.c
index d35e5c2c2e..d49c53b3bf 100644
--- a/numa.c
+++ b/numa.c
@@ -52,7 +52,6 @@ static int have_memdevs = -1;
 static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
  * For all nodes, nodeid < max_numa_nodeid
  */
-bool have_numa_distance;
 NodeInfo numa_info[MAX_NODES];
 
 
@@ -171,7 +170,7 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions 
*dist, Error **errp)
 }
 
 numa_info[src].distance[dst] = val;
-have_numa_distance = true;
+ms->numa_state->have_numa_distance = true;
 }
 
 static
@@ -442,7 +441,7 @@ void numa_complete_configuration(MachineState *ms)
  * asymmetric. In this case, the distances for both directions
  * of all node pairs are required.
  */
-if (have_numa_distance) {
+if (ms->numa_state->have_numa_distance) {
 /* Validate enough NUMA distance information was provided. */
 validate_numa_distance(ms);
 
-- 
2.20.1




[Qemu-devel] [PATCH v6 07/14] hmat acpi: Build System Locality Latency and Bandwidth Information Structure(s)

2019-07-07 Thread Tao Xu
From: Liu Jingqi 

This structure describes the memory access latency and bandwidth
information from various memory access initiator proximity domains.
The latency and bandwidth numbers represented in this structure
correspond to rated latency and bandwidth for the platform.
The software could use this information as hint for optimization.

Signed-off-by: Liu Jingqi 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Update the describes in ACPI 6.3
- remove num target and target_pxm, because all numa node can be
  target(no matter it can be reached or not, The Entry Base Unit for
  latency 0x means the initiator and target domains are
  unreachable from each other)
---
 hw/acpi/hmat.c  | 94 -
 hw/acpi/hmat.h  | 39 +
 include/qemu/typedefs.h |  1 +
 include/sysemu/numa.h   |  3 ++
 include/sysemu/sysemu.h | 22 ++
 5 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index abf99b1adc..6dd39b0c85 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -67,11 +67,80 @@ static void build_hmat_mpda(GArray *table_data, uint16_t 
flags, int initiator,
 build_append_int_noprefix(table_data, 0, 8);
 }
 
+/*
+ * ACPI 6.3: 5.2.27.4 System Locality Latency and Bandwidth Information
+ * Structure: Table 5-142
+ */
+static void build_hmat_lb(GArray *table_data, HMAT_LB_Info *numa_hmat_lb,
+  uint32_t num_initiator, uint32_t num_target,
+  uint32_t *initiator_pxm, int type)
+{
+uint32_t s = num_initiator;
+uint32_t t = num_target;
+uint8_t m, n;
+int i;
+
+/* Type */
+build_append_int_noprefix(table_data, 1, 2);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 2);
+/* Length */
+build_append_int_noprefix(table_data, 32 + 4 * s + 4 * t + 2 * s * t, 4);
+/* Flags */
+build_append_int_noprefix(table_data, numa_hmat_lb->hierarchy, 1);
+/* Data Type */
+build_append_int_noprefix(table_data, numa_hmat_lb->data_type, 1);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 2);
+/* Number of Initiator Proximity Domains (s) */
+build_append_int_noprefix(table_data, s, 4);
+/* Number of Target Proximity Domains (t) */
+build_append_int_noprefix(table_data, t, 4);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 4);
+
+/* Entry Base Unit */
+if (type <= HMAT_LB_DATA_WRITE_LATENCY) {
+build_append_int_noprefix(table_data, numa_hmat_lb->base_lat, 8);
+} else {
+build_append_int_noprefix(table_data, numa_hmat_lb->base_bw, 8);
+}
+
+/* Initiator Proximity Domain List */
+for (i = 0; i < s; i++) {
+build_append_int_noprefix(table_data, initiator_pxm[i], 4);
+}
+
+/* Target Proximity Domain List */
+for (i = 0; i < t; i++) {
+build_append_int_noprefix(table_data, i, 4);
+}
+
+/* Latency or Bandwidth Entries */
+for (i = 0; i < s; i++) {
+m = initiator_pxm[i];
+for (n = 0; n < t; n++) {
+uint16_t entry;
+
+if (type <= HMAT_LB_DATA_WRITE_LATENCY) {
+entry = numa_hmat_lb->latency[m][n];
+} else {
+entry = numa_hmat_lb->bandwidth[m][n];
+}
+
+build_append_int_noprefix(table_data, entry, 2);
+}
+}
+}
+
 /* Build HMAT sub table structures */
 static void hmat_build_table_structs(GArray *table_data, NumaState *nstat)
 {
 uint16_t flags;
-int i;
+uint32_t num_initiator = 0;
+uint32_t initiator_pxm[MAX_NODES];
+int i, hrchy, type;
+HMAT_LB_Info *numa_hmat_lb;
 
 for (i = 0; i < nstat->num_nodes; i++) {
 flags = 0;
@@ -82,6 +151,29 @@ static void hmat_build_table_structs(GArray *table_data, 
NumaState *nstat)
 
 build_hmat_mpda(table_data, flags, nstat->nodes[i].initiator, i);
 }
+
+for (i = 0; i < nstat->num_nodes; i++) {
+if (nstat->nodes[i].has_cpu) {
+initiator_pxm[num_initiator++] = i;
+}
+}
+
+/*
+ * ACPI 6.3: 5.2.27.4 System Locality Latency and Bandwidth Information
+ * Structure: Table 5-142
+ */
+for (hrchy = HMAT_LB_MEM_MEMORY;
+ hrchy <= HMAT_LB_MEM_CACHE_3RD_LEVEL; hrchy++) {
+for (type = HMAT_LB_DATA_ACCESS_LATENCY;
+ type <= HMAT_LB_DATA_WRITE_BANDWIDTH; type++) {
+numa_hmat_lb = nstat->hmat_lb[hrchy][type];
+
+if (numa_hmat_lb) {
+build_hmat_lb(table_data, numa_hmat_lb, num_initiator,
+  nstat->num_nodes, initiator_pxm, type);
+}
+}
+}
 }
 
 void build_hmat(GArray *table_data, BIOSLinker *linker, NumaState *nstat)
diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
index 574cfba60a..9d5f407b8a 100644
--- a/hw/acpi/hmat.h
+++ b/hw/acpi/hmat.h
@@ -40,6 +40,45 @@
  */
 #define HMAT_PROX_INIT_VALID 0x1
 
+struct HMAT_LB_Info {
+

[Qemu-devel] [PATCH v6 10/14] numa: Extend the CLI to provide memory side cache information

2019-07-07 Thread Tao Xu
From: Liu Jingqi 

Add -numa hmat-cache option to provide Memory Side Cache Information.
These memory attributes help to build Memory Side Cache Information
Structure(s) in ACPI Heterogeneous Memory Attribute Table (HMAT).

Signed-off-by: Liu Jingqi 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Add add reference to ACPI 6.3 spec (Igor)
---
 numa.c  | 68 +
 qapi/misc.json  | 81 +++--
 qemu-options.hx | 16 --
 3 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/numa.c b/numa.c
index db5fe8c12f..1cbfb8ab4e 100644
--- a/numa.c
+++ b/numa.c
@@ -307,6 +307,68 @@ static void parse_numa_hmat_lb(MachineState *ms, 
NumaHmatLBOptions *node,
 }
 }
 
+static
+void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node,
+Error **errp)
+{
+int nb_numa_nodes = ms->numa_state->num_nodes;
+HMAT_Cache_Info *hmat_cache = NULL;
+
+if (node->node_id >= nb_numa_nodes) {
+error_setg(errp, "Invalid node-id=%" PRIu32
+   ", it should be less than %d.",
+   node->node_id, nb_numa_nodes);
+return;
+}
+
+if (node->total > MAX_HMAT_CACHE_LEVEL) {
+error_setg(errp, "Invalid total=%" PRIu8
+   ", it should be less than or equal to %d.",
+   node->total, MAX_HMAT_CACHE_LEVEL);
+return;
+}
+if (node->level > node->total) {
+error_setg(errp, "Invalid level=%" PRIu8
+   ", it should be less than or equal to"
+   " total=%" PRIu8 ".",
+   node->level, node->total);
+return;
+}
+if (ms->numa_state->hmat_cache[node->node_id][node->level]) {
+error_setg(errp, "Duplicate configuration of the side cache for "
+   "node-id=%" PRIu32 " and level=%" PRIu8 ".",
+   node->node_id, node->level);
+return;
+}
+
+if ((node->level > 1) &&
+ms->numa_state->hmat_cache[node->node_id][node->level - 1] &&
+(node->size >=
+ms->numa_state->hmat_cache[node->node_id][node->level - 1]->size)) 
{
+error_setg(errp, "Invalid size=0x%" PRIx64
+   ", the size of level=%" PRIu8
+   " should be less than the size(0x%" PRIx64
+   ") of level=%" PRIu8 ".",
+   node->size, node->level,
+   ms->numa_state->hmat_cache[node->node_id]
+ [node->level - 1]->size,
+   node->level - 1);
+return;
+}
+
+hmat_cache = g_malloc0(sizeof(*hmat_cache));
+
+hmat_cache->mem_proximity = node->node_id;
+hmat_cache->size = node->size;
+hmat_cache->total_levels = node->total;
+hmat_cache->level = node->level;
+hmat_cache->associativity = node->assoc;
+hmat_cache->write_policy = node->policy;
+hmat_cache->line_size = node->line;
+
+ms->numa_state->hmat_cache[node->node_id][node->level] = hmat_cache;
+}
+
 static
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
 {
@@ -352,6 +414,12 @@ void set_numa_options(MachineState *ms, NumaOptions 
*object, Error **errp)
 goto end;
 }
 break;
+case NUMA_OPTIONS_TYPE_HMAT_CACHE:
+parse_numa_hmat_cache(ms, &object->u.hmat_cache, &err);
+if (err) {
+goto end;
+}
+break;
 default:
 abort();
 }
diff --git a/qapi/misc.json b/qapi/misc.json
index 6c7a356fa8..00d5ffabe8 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2537,10 +2537,12 @@
 #
 # @hmat-lb: memory latency and bandwidth information (Since: 4.1)
 #
+# @hmat-cache: memory side cache information (Since: 4.1)
+#
 # Since: 2.1
 ##
 { 'enum': 'NumaOptionsType',
-  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
+  'data': [ 'node', 'dist', 'cpu', 'hmat-lb', 'hmat-cache' ] }
 
 ##
 # @NumaOptions:
@@ -2556,7 +2558,8 @@
 'node': 'NumaNodeOptions',
 'dist': 'NumaDistOptions',
 'cpu': 'NumaCpuOptions',
-'hmat-lb': 'NumaHmatLBOptions' }}
+'hmat-lb': 'NumaHmatLBOptions',
+'hmat-cache': 'NumaHmatCacheOptions' }}
 
 ##
 # @NumaNodeOptions:
@@ -2719,6 +2722,80 @@
 '*latency': 'uint16',
 '*bandwidth': 'uint16' }}
 
+##
+# @HmatCacheAssociativity:
+#
+# Cache associativity in the Memory Side Cache
+# Information Structure of HMAT
+#
+# For more information of @HmatCacheAssociativity see
+# the chapter 5.2.27.5: Table 5-143 of ACPI 6.3 spec.
+#
+# @none: None
+#
+# @direct: Direct Mapped
+#
+# @complex: Complex Cache Indexing (implementation specific)
+#
+# Since: 4.1
+##
+{ 'enum': 'HmatCacheAssociativity',
+  'data': [ 'none', 'direct', 'complex' ] }
+
+##
+# @HmatCacheWritePolicy:
+#
+# Cache write policy in the Memory Side Cache
+# Information Structure of HMAT
+#
+# For more information of @HmatCacheWritePolicy see
+# the chapter 5.2.27.5: Table

[Qemu-devel] [PATCH v6 02/14] numa: move numa global variable nb_numa_nodes into MachineState

2019-07-07 Thread Tao Xu
Add struct NumaState in MachineState and move existing numa global
nb_numa_nodes(renamed as "num_nodes") into NumaState. And add variable
numa_support into MachineClass to decide which submachines support NUMA.

Suggested-by: Igor Mammedov 
Suggested-by: Eduardo Habkost 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Rebase to upstream, move globals in arm/sbsa-ref
- When used once or twice in the function, use
  ms->numa_state->num_nodes directly
- Correct some mistakes
- Use once monitor_printf in hmp_info_numa
---
 exec.c  |  5 ++-
 hw/acpi/aml-build.c |  3 +-
 hw/arm/boot.c   |  4 +-
 hw/arm/sbsa-ref.c   |  4 +-
 hw/arm/virt-acpi-build.c| 10 +++--
 hw/arm/virt.c   |  5 ++-
 hw/core/machine.c   | 14 +--
 hw/i386/acpi-build.c|  2 +-
 hw/i386/pc.c| 10 +++--
 hw/mem/pc-dimm.c|  2 +
 hw/pci-bridge/pci_expander_bridge.c |  3 +-
 hw/ppc/spapr.c  | 26 +++-
 include/hw/acpi/aml-build.h |  2 +-
 include/hw/boards.h |  2 +
 include/sysemu/numa.h   | 11 -
 monitor/misc.c  | 12 --
 numa.c  | 62 +
 17 files changed, 116 insertions(+), 61 deletions(-)

diff --git a/exec.c b/exec.c
index e7622d1956..6054034f9f 100644
--- a/exec.c
+++ b/exec.c
@@ -1736,6 +1736,7 @@ long qemu_minrampagesize(void)
 long hpsize = LONG_MAX;
 long mainrampagesize;
 Object *memdev_root;
+MachineState *ms = MACHINE(qdev_get_machine());
 
 mainrampagesize = qemu_mempath_getpagesize(mem_path);
 
@@ -1763,7 +1764,9 @@ long qemu_minrampagesize(void)
  * so if its page size is smaller we have got to report that size instead.
  */
 if (hpsize > mainrampagesize &&
-(nb_numa_nodes == 0 || numa_info[0].node_memdev == NULL)) {
+(ms->numa_state == NULL ||
+ ms->numa_state->num_nodes == 0 ||
+ numa_info[0].node_memdev == NULL)) {
 static bool warned;
 if (!warned) {
 error_report("Huge page support disabled (n/a for main memory).");
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 555c24f21d..63c1cae8c9 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1726,10 +1726,11 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, 
uint64_t base,
  * ACPI spec 5.2.17 System Locality Distance Information Table
  * (Revision 2.0 or later)
  */
-void build_slit(GArray *table_data, BIOSLinker *linker)
+void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms)
 {
 int slit_start, i, j;
 slit_start = table_data->len;
+int nb_numa_nodes = ms->numa_state->num_nodes;
 
 acpi_data_push(table_data, sizeof(AcpiTableHeader));
 
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index a90151f465..e28daa5278 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -598,9 +598,9 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info 
*binfo,
 }
 g_strfreev(node_path);
 
-if (nb_numa_nodes > 0) {
+if (ms->numa_state != NULL && ms->numa_state->num_nodes > 0) {
 mem_base = binfo->loader_start;
-for (i = 0; i < nb_numa_nodes; i++) {
+for (i = 0; i < ms->numa_state->num_nodes; i++) {
 mem_len = numa_info[i].node_mem;
 rc = fdt_add_memory_node(fdt, acells, mem_base,
  scells, mem_len, i);
diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
index 33087b8d94..9f0cd37d72 100644
--- a/hw/arm/sbsa-ref.c
+++ b/hw/arm/sbsa-ref.c
@@ -144,6 +144,7 @@ static void create_fdt(SBSAMachineState *sms)
 {
 void *fdt = create_device_tree(&sms->fdt_size);
 const MachineState *ms = MACHINE(sms);
+int nb_numa_nodes = ms->numa_state->num_nodes;
 int cpu;
 
 if (!fdt) {
@@ -760,7 +761,7 @@ sbsa_ref_cpu_index_to_props(MachineState *ms, unsigned 
cpu_index)
 static int64_t
 sbsa_ref_get_default_cpu_node_id(const MachineState *ms, int idx)
 {
-return idx % nb_numa_nodes;
+return idx % ms->numa_state->num_nodes;
 }
 
 static void sbsa_ref_instance_init(Object *obj)
@@ -787,6 +788,7 @@ static void sbsa_ref_class_init(ObjectClass *oc, void *data)
 mc->possible_cpu_arch_ids = sbsa_ref_possible_cpu_arch_ids;
 mc->cpu_index_to_instance_props = sbsa_ref_cpu_index_to_props;
 mc->get_default_cpu_node_id = sbsa_ref_get_default_cpu_node_id;
+mc->numa_supported = true;
 }
 
 static const TypeInfo sbsa_ref_info = {
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 0afb372769..a2cc4b84fe 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -516,7 +516,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 int i, srat_start;
 uint64_t mem_base;
 MachineClass *mc = MACHINE_GET_CLASS(vms);
-const CPUArchIdList *c

[Qemu-devel] [PATCH v6 13/14] QMP: Add QMP interface to update HMAT at runtime

2019-07-07 Thread Tao Xu
Add QMP interface to introduce new HMAT data (including System Locality
Latency and Bandwidth Information Structure,  Memory Side Cache
Information Structure) at runtime. The interface can
also replace existing HMAT data.

Suggested-by: Igor Mammedov 
Signed-off-by: Tao Xu 
---
 hw/acpi/acpi-stub.c |  7 ++
 numa.c  | 55 +++--
 qapi/misc.json  | 49 
 3 files changed, 104 insertions(+), 7 deletions(-)

diff --git a/hw/acpi/acpi-stub.c b/hw/acpi/acpi-stub.c
index 4c9d081ed4..757570ee7f 100644
--- a/hw/acpi/acpi-stub.c
+++ b/hw/acpi/acpi-stub.c
@@ -22,8 +22,15 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "hw/acpi/acpi.h"
+#include "sysemu/numa.h"
+#include "hw/acpi/hmat.h"
 
 void acpi_table_add(const QemuOpts *opts, Error **errp)
 {
 error_setg(errp, QERR_UNSUPPORTED);
 }
+
+void hmat_update(NumaState *nstat)
+{
+/* For qmp_set_hmat_lb and qmp_set_hmat_cache in numa.c can compile */
+}
diff --git a/numa.c b/numa.c
index e34a08d84b..738e1e5f89 100644
--- a/numa.c
+++ b/numa.c
@@ -188,7 +188,7 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions 
*dist, Error **errp)
 }
 
 static void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
-   Error **errp)
+   bool runtime_flag, Error **errp)
 {
 int nb_numa_nodes = ms->numa_state->num_nodes;
 NodeInfo *numa_info = ms->numa_state->nodes;
@@ -265,7 +265,8 @@ static void parse_numa_hmat_lb(MachineState *ms, 
NumaHmatLBOptions *node,
 if (!hmat_lb) {
 hmat_lb = g_malloc0(sizeof(*hmat_lb));
 ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = 
hmat_lb;
-} else if (hmat_lb->latency[node->initiator][node->target]) {
+} else if (!runtime_flag &&
+   hmat_lb->latency[node->initiator][node->target]) {
 error_setg(errp, "Duplicate configuration of the latency for "
"initiator=%" PRIu16 " and target=%" PRIu16 ".",
node->initiator, node->target);
@@ -286,7 +287,8 @@ static void parse_numa_hmat_lb(MachineState *ms, 
NumaHmatLBOptions *node,
 if (!hmat_lb) {
 hmat_lb = g_malloc0(sizeof(*hmat_lb));
 ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = 
hmat_lb;
-} else if (hmat_lb->bandwidth[node->initiator][node->target]) {
+} else if (!runtime_flag &&
+   hmat_lb->bandwidth[node->initiator][node->target]) {
 error_setg(errp, "Duplicate configuration of the bandwidth for "
"initiator=%" PRIu16 " and target=%" PRIu16 ".",
node->initiator, node->target);
@@ -314,7 +316,7 @@ static void parse_numa_hmat_lb(MachineState *ms, 
NumaHmatLBOptions *node,
 
 static
 void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node,
-Error **errp)
+   bool runtime_flag, Error **errp)
 {
 int nb_numa_nodes = ms->numa_state->num_nodes;
 HMAT_Cache_Info *hmat_cache = NULL;
@@ -339,7 +341,8 @@ void parse_numa_hmat_cache(MachineState *ms, 
NumaHmatCacheOptions *node,
node->level, node->total);
 return;
 }
-if (ms->numa_state->hmat_cache[node->node_id][node->level]) {
+if (!runtime_flag &&
+ms->numa_state->hmat_cache[node->node_id][node->level]) {
 error_setg(errp, "Duplicate configuration of the side cache for "
"node-id=%" PRIu32 " and level=%" PRIu8 ".",
node->node_id, node->level);
@@ -419,13 +422,13 @@ void set_numa_options(MachineState *ms, NumaOptions 
*object, Error **errp)
   &err);
 break;
 case NUMA_OPTIONS_TYPE_HMAT_LB:
-parse_numa_hmat_lb(ms, &object->u.hmat_lb, &err);
+parse_numa_hmat_lb(ms, &object->u.hmat_lb, 0, &err);
 if (err) {
 goto end;
 }
 break;
 case NUMA_OPTIONS_TYPE_HMAT_CACHE:
-parse_numa_hmat_cache(ms, &object->u.hmat_cache, &err);
+parse_numa_hmat_cache(ms, &object->u.hmat_cache, 0, &err);
 if (err) {
 goto end;
 }
@@ -688,6 +691,44 @@ void qmp_set_numa_node(NumaOptions *cmd, Error **errp)
 set_numa_options(MACHINE(qdev_get_machine()), cmd, errp);
 }
 
+void qmp_set_hmat_lb(NumaHmatLBOptions *node, Error **errp)
+{
+MachineState *ms = MACHINE(qdev_get_machine());
+
+if (ms->numa_state == NULL || ms->numa_state->num_nodes <= 0) {
+error_setg(errp, "NUMA is not supported");
+return;
+}
+
+if (ms->numa_state->hma_enabled) {
+parse_numa_hmat_lb(ms, node, 1, errp);
+hmat_update(ms->numa_state);
+} else {
+error_setg(errp, "HMAT can't be changed at runtime when QEMU boot"
+   " without setting HMAT latency, bandwidth or memor

[Qemu-devel] [PATCH v6 04/14] numa: move numa global variable numa_info into MachineState

2019-07-07 Thread Tao Xu
Move existing numa global numa_info (renamed as "nodes") into NumaState.

Suggested-by: Igor Mammedov 
Suggested-by: Eduardo Habkost 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Rebase to upstream, move globals in arm/sbsa-ref
- Correct some mistake(Igor)
- Use ms->numa_state->nodes directly, when use it once or twice(Igor)
---
 exec.c   |  2 +-
 hw/acpi/aml-build.c  |  6 --
 hw/arm/boot.c|  2 +-
 hw/arm/sbsa-ref.c|  3 ++-
 hw/arm/virt-acpi-build.c |  7 ---
 hw/arm/virt.c|  3 ++-
 hw/i386/pc.c |  4 ++--
 hw/ppc/spapr.c   | 10 +-
 hw/ppc/spapr_pci.c   |  4 +++-
 include/sysemu/numa.h|  3 +++
 numa.c   | 15 +--
 11 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/exec.c b/exec.c
index 6054034f9f..9517b2af41 100644
--- a/exec.c
+++ b/exec.c
@@ -1766,7 +1766,7 @@ long qemu_minrampagesize(void)
 if (hpsize > mainrampagesize &&
 (ms->numa_state == NULL ||
  ms->numa_state->num_nodes == 0 ||
- numa_info[0].node_memdev == NULL)) {
+ ms->numa_state->nodes[0].node_memdev == NULL)) {
 static bool warned;
 if (!warned) {
 error_report("Huge page support disabled (n/a for main memory).");
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 63c1cae8c9..26ccc1a3e2 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1737,8 +1737,10 @@ void build_slit(GArray *table_data, BIOSLinker *linker, 
MachineState *ms)
 build_append_int_noprefix(table_data, nb_numa_nodes, 8);
 for (i = 0; i < nb_numa_nodes; i++) {
 for (j = 0; j < nb_numa_nodes; j++) {
-assert(numa_info[i].distance[j]);
-build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
+assert(ms->numa_state->nodes[i].distance[j]);
+build_append_int_noprefix(table_data,
+  ms->numa_state->nodes[i].distance[j],
+  1);
 }
 }
 
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index e28daa5278..da228919dc 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -601,7 +601,7 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info 
*binfo,
 if (ms->numa_state != NULL && ms->numa_state->num_nodes > 0) {
 mem_base = binfo->loader_start;
 for (i = 0; i < ms->numa_state->num_nodes; i++) {
-mem_len = numa_info[i].node_mem;
+mem_len = ms->numa_state->nodes[i].node_mem;
 rc = fdt_add_memory_node(fdt, acells, mem_base,
  scells, mem_len, i);
 if (rc < 0) {
diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
index 13a9bf938c..9364ab8a50 100644
--- a/hw/arm/sbsa-ref.c
+++ b/hw/arm/sbsa-ref.c
@@ -168,7 +168,8 @@ static void create_fdt(SBSAMachineState *sms)
 idx = (i * nb_numa_nodes + j) * 3;
 matrix[idx + 0] = cpu_to_be32(i);
 matrix[idx + 1] = cpu_to_be32(j);
-matrix[idx + 2] = cpu_to_be32(numa_info[i].distance[j]);
+matrix[idx + 2] =
+cpu_to_be32(ms->numa_state->nodes[i].distance[j]);
 }
 }
 
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 461a44b5b0..89899ec4c1 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -534,11 +534,12 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 
 mem_base = vms->memmap[VIRT_MEM].base;
 for (i = 0; i < ms->numa_state->num_nodes; ++i) {
-if (numa_info[i].node_mem > 0) {
+if (ms->numa_state->nodes[i].node_mem > 0) {
 numamem = acpi_data_push(table_data, sizeof(*numamem));
-build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
+build_srat_memory(numamem, mem_base,
+  ms->numa_state->nodes[i].node_mem, i,
   MEM_AFFINITY_ENABLED);
-mem_base += numa_info[i].node_mem;
+mem_base += ms->numa_state->nodes[i].node_mem;
 }
 }
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ef051569ef..2fda2bd23f 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -242,7 +242,8 @@ static void create_fdt(VirtMachineState *vms)
 idx = (i * nb_numa_nodes + j) * 3;
 matrix[idx + 0] = cpu_to_be32(i);
 matrix[idx + 1] = cpu_to_be32(j);
-matrix[idx + 2] = cpu_to_be32(numa_info[i].distance[j]);
+matrix[idx + 2] =
+cpu_to_be32(ms->numa_state->nodes[i].distance[j]);
 }
 }
 
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index d0e5897c3f..de3260355f 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1039,7 +1039,7 @@ static FWCfgState *bochs_bios_init(AddressSpace *as, 
PCMachineState *pcms)
 }
 for (i = 0; i < nb_numa_nodes; i++) {
 numa_fw_c

[Qemu-devel] [PATCH v6 12/14] hmat acpi: Implement _HMA method to update HMAT at runtime

2019-07-07 Thread Tao Xu
From: Liu Jingqi 

OSPM evaluates HMAT only during system initialization.
Any changes to the HMAT state at runtime or information
regarding HMAT for hot plug are communicated using _HMA method.

_HMA is an optional object that enables the platform to provide
the OS with updated Heterogeneous Memory Attributes information
at runtime. _HMA provides OSPM with the latest HMAT in entirety
overriding existing HMAT.

Signed-off-by: Liu Jingqi 
Signed-off-by: Tao Xu 
---

Changes in v6:
- move AcpiHmaState from PCMachineState to MachineState
to make HMAT more generalic (Igor)
- use build_acpi_aml_common() introduced in patch 10/11 to
simplify hmat_build_aml (Igor)
- Add _HMA only qemu use hmat-lb or hmat-cache
---
 hw/acpi/hmat.c| 297 ++
 hw/acpi/hmat.h|  68 ++
 hw/core/machine.c |   4 +
 hw/i386/acpi-build.c  |   4 +
 hw/i386/pc_piix.c |   6 +
 hw/i386/pc_q35.c  |   6 +
 include/sysemu/numa.h |   5 +
 numa.c|  10 ++
 8 files changed, 400 insertions(+)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index a207581f11..33b9dd2e04 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -27,6 +27,8 @@
 #include "qemu/osdep.h"
 #include "sysemu/numa.h"
 #include "hw/acpi/hmat.h"
+#include "hw/mem/nvdimm.h"
+#include "hw/nvram/fw_cfg.h"
 
 /*
  * ACPI 6.3:
@@ -238,6 +240,270 @@ static void hmat_build_table_structs(GArray *table_data, 
NumaState *nstat)
 }
 }
 
+static uint64_t
+hmat_hma_method_read(void *opaque, hwaddr addr, unsigned size)
+{
+printf("BUG: we never read _HMA IO Port.\n");
+return 0;
+}
+
+/* _HMA Method: read HMA data. */
+static void hmat_handle_hma_method(AcpiHmaState *state,
+   HmatHmamIn *in, hwaddr hmam_mem_addr)
+{
+HmatHmaBuffer *hma_buf = &state->hma_buf;
+HmatHmamOut *read_hma_out;
+GArray *hma;
+uint32_t read_len = 0, ret_status;
+int size;
+
+if (in != NULL) {
+le32_to_cpus(&in->offset);
+}
+
+hma = hma_buf->hma;
+if (in->offset > hma->len) {
+ret_status = HMAM_RET_STATUS_INVALID;
+goto exit;
+}
+
+   /* It is the first time to read HMA. */
+if (!in->offset) {
+hma_buf->dirty = false;
+} else if (hma_buf->dirty) {
+/* HMA has been changed during Reading HMA. */
+ret_status = HMAM_RET_STATUS_HMA_CHANGED;
+goto exit;
+}
+
+ret_status = HMAM_RET_STATUS_SUCCESS;
+read_len = MIN(hma->len - in->offset,
+   HMAM_MEMORY_SIZE - 2 * sizeof(uint32_t));
+exit:
+size = sizeof(HmatHmamOut) + read_len;
+read_hma_out = g_malloc(size);
+
+read_hma_out->len = cpu_to_le32(size);
+read_hma_out->ret_status = cpu_to_le32(ret_status);
+memcpy(read_hma_out->data, hma->data + in->offset, read_len);
+
+cpu_physical_memory_write(hmam_mem_addr, read_hma_out, size);
+
+g_free(read_hma_out);
+}
+
+static void
+hmat_hma_method_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
+{
+AcpiHmaState *state = opaque;
+hwaddr hmam_mem_addr = val;
+HmatHmamIn *in;
+
+in = g_new(HmatHmamIn, 1);
+cpu_physical_memory_read(hmam_mem_addr, in, sizeof(*in));
+
+hmat_handle_hma_method(state, in, hmam_mem_addr);
+}
+
+static const MemoryRegionOps hmat_hma_method_ops = {
+.read = hmat_hma_method_read,
+.write = hmat_hma_method_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
+.valid = {
+.min_access_size = 4,
+.max_access_size = 4,
+},
+};
+
+static void hmat_init_hma_buffer(HmatHmaBuffer *hma_buf)
+{
+hma_buf->hma = g_array_new(false, true /* clear */, 1);
+}
+
+static uint8_t hmat_acpi_table_checksum(uint8_t *buffer, uint32_t length)
+{
+uint8_t sum = 0;
+uint8_t *end = buffer + length;
+
+while (buffer < end) {
+sum = (uint8_t) (sum + *(buffer++));
+}
+return (uint8_t)(0 - sum);
+}
+
+static void hmat_build_header(AcpiTableHeader *h,
+ const char *sig, int len, uint8_t rev,
+ const char *oem_id, const char *oem_table_id)
+{
+memcpy(&h->signature, sig, 4);
+h->length = cpu_to_le32(len);
+h->revision = rev;
+
+if (oem_id) {
+strncpy((char *)h->oem_id, oem_id, sizeof h->oem_id);
+} else {
+memcpy(h->oem_id, ACPI_BUILD_APPNAME6, 6);
+}
+
+if (oem_table_id) {
+strncpy((char *)h->oem_table_id, oem_table_id, 
sizeof(h->oem_table_id));
+} else {
+memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
+memcpy(h->oem_table_id + 4, sig, 4);
+}
+
+h->oem_revision = cpu_to_le32(1);
+memcpy(h->asl_compiler_id, ACPI_BUILD_APPNAME4, 4);
+h->asl_compiler_revision = cpu_to_le32(1);
+
+/* Caculate the checksum of acpi table. */
+h->checksum = 0;
+h->checksum = hmat_acpi_table_checksum((uint8_t *)h, len);
+}
+
+static void hmat_build_hma_buffer(NumaState *nstat)
+{
+HmatHmaBuffer *hma_buf = &(nstat->acpi_hma_state->hma_buf);
+
+/* Fr

[Qemu-devel] [PATCH v6 06/14] hmat acpi: Build Memory Proximity Domain Attributes Structure(s)

2019-07-07 Thread Tao Xu
From: Liu Jingqi 

HMAT is defined in ACPI 6.3: 5.2.27 Heterogeneous Memory Attribute Table
(HMAT). The specification references below link:
http://www.uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

It describes the memory attributes, such as memory side cache
attributes and bandwidth and latency details, related to the
Memory Proximity Domain. The software is
expected to use this information as hint for optimization.

This structure describes Memory Proximity Domain Attributes by memory
subsystem and its associativity with processor proximity domain as well as
hint for memory usage.

Signed-off-by: Liu Jingqi 
Signed-off-by: Tao Xu 
---

Changes in v6:
Update to ACPI 6.3, main changes are:
- Renamed as Memory Proximity Domain Attributes, use numa nodes to
  replace memory ranges
- Use "-numa initiator" to describe "the Attached Initiator", more
  clear for memory topology
- Because HMAT does not use memory ranges, remove the codes to build
  mem_renges for dimm device
---
 hw/acpi/Kconfig   |   5 +++
 hw/acpi/Makefile.objs |   1 +
 hw/acpi/hmat.c| 101 ++
 hw/acpi/hmat.h|  45 +++
 hw/i386/acpi-build.c  |   3 ++
 5 files changed, 155 insertions(+)
 create mode 100644 hw/acpi/hmat.c
 create mode 100644 hw/acpi/hmat.h

diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
index 7c59cf900b..039bb99efa 100644
--- a/hw/acpi/Kconfig
+++ b/hw/acpi/Kconfig
@@ -7,6 +7,7 @@ config ACPI_X86
 select ACPI_NVDIMM
 select ACPI_CPU_HOTPLUG
 select ACPI_MEMORY_HOTPLUG
+select ACPI_HMAT
 
 config ACPI_X86_ICH
 bool
@@ -31,3 +32,7 @@ config ACPI_VMGENID
 bool
 default y
 depends on PC
+
+config ACPI_HMAT
+bool
+depends on ACPI
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 9bb2101e3b..c05019b059 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -6,6 +6,7 @@ common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu.o
 common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI_VMGENID) += vmgenid.o
+common-obj-$(CONFIG_ACPI_HMAT) += hmat.o
 common-obj-$(call lnot,$(CONFIG_ACPI_X86)) += acpi-stub.o
 
 common-obj-y += acpi_interface.o
diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
new file mode 100644
index 00..abf99b1adc
--- /dev/null
+++ b/hw/acpi/hmat.c
@@ -0,0 +1,101 @@
+/*
+ * HMAT ACPI Implementation
+ *
+ * Copyright(C) 2019 Intel Corporation.
+ *
+ * Author:
+ *  Liu jingqi 
+ *  Tao Xu 
+ *
+ * HMAT is defined in ACPI 6.3: 5.2.27 Heterogeneous Memory Attribute Table
+ * (HMAT)
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/numa.h"
+#include "hw/acpi/hmat.h"
+
+/*
+ * ACPI 6.3:
+ * 5.2.27.3 Memory Proximity Domain Attributes Structure: Table 5-141
+ */
+static void build_hmat_mpda(GArray *table_data, uint16_t flags, int initiator,
+   int mem_node)
+{
+
+/* Memory Proximity Domain Attributes Structure */
+/* Type */
+build_append_int_noprefix(table_data, 0, 2);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 2);
+/* Length */
+build_append_int_noprefix(table_data, 40, 4);
+/* Flags */
+build_append_int_noprefix(table_data, flags, 2);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 2);
+/* Proximity Domain for the Attached Initiator */
+build_append_int_noprefix(table_data, initiator, 4);
+/* Proximity Domain for the Memory */
+build_append_int_noprefix(table_data, mem_node, 4);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 4);
+/*
+ * Reserved:
+ * Previously defined as the Start Address of the System Physical
+ * Address Range. Deprecated since ACPI Spec 6.3.
+ */
+build_append_int_noprefix(table_data, 0, 8);
+/*
+ * Reserved:
+ * Previously defined as the Range Length of the region in bytes.
+ * Deprecated since ACPI Spec 6.3.
+ */
+build_append_int_noprefix(table_data, 0, 8);
+}
+
+/* Build HMAT sub table structures */
+static void hmat_build_table_structs(GArray *table_data, NumaState *nstat)
+{
+uint16_t flags;
+int i;
+
+for (i = 0; i < nstat->num_nodes; i++) {
+flags = 0;
+
+if (nstat->nodes[i].initia

[Qemu-devel] [PATCH v6 05/14] numa: Extend CLI to provide initiator information for numa nodes

2019-07-07 Thread Tao Xu
In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
The initiator represents processor which access to memory. And in 5.2.27.3
Memory Proximity Domain Attributes Structure, the attached initiator is
defined as where the memory controller responsible for a memory proximity
domain. With attached initiator information, the topology of heterogeneous
memory can be described.

Extend CLI of "-numa node" option to indicate the initiator numa
node-id.

Suggested-by: Dan Williams 
Signed-off-by: Tao Xu 
---
 hw/core/machine.c | 24 
 include/sysemu/numa.h |  3 +++
 numa.c| 13 +
 qapi/misc.json|  6 +-
 qemu-options.hx   | 27 +++
 5 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 77b5967a68..c48e5b8078 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -649,6 +649,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
const CpuInstanceProperties *props, Error 
**errp)
 {
 MachineClass *mc = MACHINE_GET_CLASS(machine);
+NodeInfo *numa_info = machine->numa_state->nodes;
 bool match = false;
 int i;
 
@@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
 match = true;
 slot->props.node_id = props->node_id;
 slot->props.has_node_id = props->has_node_id;
+
+if (numa_info[props->node_id].initiator_valid &&
+(props->node_id != numa_info[props->node_id].initiator)) {
+error_setg(errp, "The initiator of CPU NUMA node %" PRId64
+   " should be itself.", props->node_id);
+return;
+}
+numa_info[props->node_id].initiator_valid = true;
+numa_info[props->node_id].has_cpu = true;
+numa_info[props->node_id].initiator = props->node_id;
 }
 
 if (!match) {
@@ -974,6 +985,7 @@ static void machine_numa_finish_cpu_init(MachineState 
*machine)
 GString *s = g_string_new(NULL);
 MachineClass *mc = MACHINE_GET_CLASS(machine);
 const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
+NodeInfo *numa_info = machine->numa_state->nodes;
 
 assert(machine->numa_state->num_nodes);
 for (i = 0; i < possible_cpus->len; i++) {
@@ -1007,6 +1019,18 @@ static void machine_numa_finish_cpu_init(MachineState 
*machine)
 machine_set_cpu_numa_node(machine, &props, &error_fatal);
 }
 }
+
+for (i = 0; i < machine->numa_state->num_nodes; i++) {
+if (numa_info[i].initiator_valid &&
+!numa_info[numa_info[i].initiator].has_cpu) {
+error_report("The initiator-id %"PRIu16 " of NUMA node %d"
+ " does not exist.", numa_info[i].initiator, i);
+error_printf("\n");
+
+exit(1);
+}
+}
+
 if (s->len && !qtest_enabled()) {
 warn_report("CPU(s) not present in any NUMA nodes: %s",
 s->str);
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index 957ad60560..357aaeda80 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -10,6 +10,9 @@ struct NodeInfo {
 uint64_t node_mem;
 struct HostMemoryBackend *node_memdev;
 bool present;
+bool has_cpu;
+bool initiator_valid;
+uint16_t initiator;
 uint8_t distance[MAX_NODES];
 };
 
diff --git a/numa.c b/numa.c
index 850c7f4573..4a3a1726be 100644
--- a/numa.c
+++ b/numa.c
@@ -131,6 +131,19 @@ static void parse_numa_node(MachineState *ms, 
NumaNodeOptions *node,
 numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
 numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
 }
+
+if (node->has_initiator) {
+if (numa_info[nodenr].initiator_valid &&
+(node->initiator != numa_info[nodenr].initiator)) {
+error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
+   "set to node %" PRIu16, nodenr,
+   numa_info[nodenr].initiator);
+return;
+}
+
+numa_info[nodenr].initiator_valid = true;
+numa_info[nodenr].initiator = node->initiator;
+}
 numa_info[nodenr].present = true;
 max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
 ms->numa_state->num_nodes++;
diff --git a/qapi/misc.json b/qapi/misc.json
index dc4cf9da20..3059bb3119 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2572,6 +2572,9 @@
 # @memdev: memory backend object.  If specified for one node,
 #  it must be specified for all nodes.
 #
+# @initiator: the initiator numa nodeid that is closest (as in directly
+# attached) to this numa node.
+#
 # Since: 2.1
 ##
 { 'struct': 'NumaNodeOptions',
@@ -2579,7 +2582,8 @@
'*nodeid': 'uint16',
'*cpus':   ['uint16'],
'*mem':'size',
-   '*memdev': 'str' }}
+   '*memdev': 'str',
+   '*initiator': 'uint16' }}
 
 ##
 # @NumaDistOptions:

[Qemu-devel] [PATCH v6 08/14] hmat acpi: Build Memory Side Cache Information Structure(s)

2019-07-07 Thread Tao Xu
From: Liu Jingqi 

This structure describes memory side cache information for memory
proximity domains if the memory side cache is present and the
physical device forms the memory side cache.
The software could use this information to effectively place
the data in memory to maximize the performance of the system
memory that use the memory side cache.

Signed-off-by: Liu Jingqi 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Add descriptions from ACPI 6.3 spec
---
 hw/acpi/hmat.c  | 64 -
 hw/acpi/hmat.h  | 20 +
 include/qemu/typedefs.h |  1 +
 include/sysemu/numa.h   |  3 ++
 include/sysemu/sysemu.h |  2 ++
 5 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index 6dd39b0c85..a207581f11 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -133,14 +133,63 @@ static void build_hmat_lb(GArray *table_data, 
HMAT_LB_Info *numa_hmat_lb,
 }
 }
 
+/* ACPI 6.3: 5.2.27.5 Memory Side Cache Information Structure: Table 5-143 */
+static void build_hmat_cache(GArray *table_data, HMAT_Cache_Info *hmat_cache)
+{
+/*
+ * Cache Attributes: Bits [3:0] – Total Cache Levels
+ * for this Memory Proximity Domain
+ */
+uint32_t cache_attr = hmat_cache->total_levels & 0xF;
+
+/* Bits [7:4] : Cache Level described in this structure */
+cache_attr |= (hmat_cache->level & 0xF) << 4;
+
+/* Bits [11:8] - Cache Associativity */
+cache_attr |= (hmat_cache->associativity & 0xF) << 8;
+
+/* Bits [15:12] - Write Policy */
+cache_attr |= (hmat_cache->write_policy & 0xF) << 12;
+
+/* Bits [31:16] - Cache Line size in bytes */
+cache_attr |= (hmat_cache->line_size & 0x) << 16;
+
+cache_attr = cpu_to_le32(cache_attr);
+
+/* Type */
+build_append_int_noprefix(table_data, 2, 2);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 2);
+/* Length */
+build_append_int_noprefix(table_data, 32, 4);
+/* Proximity Domain for the Memory */
+build_append_int_noprefix(table_data, hmat_cache->mem_proximity, 4);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 4);
+/* Memory Side Cache Size */
+build_append_int_noprefix(table_data, hmat_cache->size, 8);
+/* Cache Attributes */
+build_append_int_noprefix(table_data, cache_attr, 4);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 2);
+/*
+ * Number of SMBIOS handles (n)
+ * Linux kernel uses Memory Side Cache Information Structure
+ * without SMBIOS entries for now, so set Number of SMBIOS handles
+ * as 0.
+ */
+build_append_int_noprefix(table_data, 0, 2);
+}
+
 /* Build HMAT sub table structures */
 static void hmat_build_table_structs(GArray *table_data, NumaState *nstat)
 {
 uint16_t flags;
 uint32_t num_initiator = 0;
 uint32_t initiator_pxm[MAX_NODES];
-int i, hrchy, type;
+int i, hrchy, type, level;
 HMAT_LB_Info *numa_hmat_lb;
+HMAT_Cache_Info *numa_hmat_cache;
 
 for (i = 0; i < nstat->num_nodes; i++) {
 flags = 0;
@@ -174,6 +223,19 @@ static void hmat_build_table_structs(GArray *table_data, 
NumaState *nstat)
 }
 }
 }
+
+/*
+ * ACPI 6.3: 5.2.27.5 Memory Side Cache Information Structure:
+ * Table 5-143
+ */
+for (i = 0; i < nstat->num_nodes; i++) {
+for (level = 0; level <= MAX_HMAT_CACHE_LEVEL; level++) {
+numa_hmat_cache = nstat->hmat_cache[i][level];
+if (numa_hmat_cache) {
+build_hmat_cache(table_data, numa_hmat_cache);
+}
+}
+}
 }
 
 void build_hmat(GArray *table_data, BIOSLinker *linker, NumaState *nstat)
diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
index 9d5f407b8a..ba655281cc 100644
--- a/hw/acpi/hmat.h
+++ b/hw/acpi/hmat.h
@@ -79,6 +79,26 @@ struct HMAT_LB_Info {
 uint16_tbandwidth[MAX_NODES][MAX_NODES];
 };
 
+struct HMAT_Cache_Info {
+/* The memory proximity domain to which the memory belongs. */
+uint32_tmem_proximity;
+/* Size of memory side cache in bytes. */
+uint64_tsize;
+/*
+ * Total cache levels for this memory
+ * pr#include "hw/acpi/aml-build.h"oximity domain.
+ */
+uint8_t total_levels;
+/* Cache level described in this structure. */
+uint8_t level;
+/* Cache Associativity: None/Direct Mapped/Comple Cache Indexing */
+uint8_t associativity;
+/* Write Policy: None/Write Back(WB)/Write Through(WT) */
+uint8_t write_policy;
+/* Cache Line size in bytes. */
+uint16_tline_size;
+};
+
 void build_hmat(GArray *table_data, BIOSLinker *linker, NumaState *nstat);
 
 #endif
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index c0257e936b..d971f5109e 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -33,6 +33,7 @@ typedef struct FWCfgEntry FWCfgEntry;
 typedef struct FWCfgIoState FWCfgIoState;
 typedef struct FWCf

[Qemu-devel] [PATCH v6 14/14] tests/bios-tables-test: add test cases for ACPI HMAT

2019-07-07 Thread Tao Xu
ACPI table HMAT has been introduced, QEMU now builds HMAT tables for
Heterogeneous Memory with boot option '-numa node'.

Add test cases on PC and Q35 machines with 2 numa nodes.
Because HMAT is generated when system enable numa, the
following tables need to be added for this test:
  tests/acpi-test-data/pc/*.acpihmat
  tests/acpi-test-data/pc/HMAT.*
  tests/acpi-test-data/q35/*.acpihmat
  tests/acpi-test-data/q35/HMAT.*

Suggested-by: Igor Mammedov 
Signed-off-by: Tao Xu 
---
 tests/bios-tables-test.c | 43 
 1 file changed, 43 insertions(+)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 0ce55182f2..a4f0dbb751 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -844,6 +844,47 @@ static void test_acpi_piix4_tcg_dimm_pxm(void)
 test_acpi_tcg_dimm_pxm(MACHINE_PC);
 }
 
+static void test_acpi_tcg_acpi_hmat(const char *machine)
+{
+test_data data;
+
+memset(&data, 0, sizeof(data));
+data.machine = machine;
+data.variant = ".acpihmat";
+test_acpi_one(" -smp 2,sockets=2"
+  " -m 128M,slots=2,maxmem=1G"
+  " -object memory-backend-ram,size=64M,id=m0"
+  " -object memory-backend-ram,size=64M,id=m1"
+  " -numa node,nodeid=0,memdev=m0"
+  " -numa node,nodeid=1,memdev=m1,initiator=0"
+  " -numa cpu,node-id=0,socket-id=0"
+  " -numa cpu,node-id=0,socket-id=1"
+  " -numa hmat-lb,initiator=0,target=0,hierarchy=memory,"
+  "data-type=access-latency,base-lat=10,latency=5"
+  " -numa hmat-lb,initiator=0,target=0,hierarchy=memory,"
+  "data-type=access-bandwidth,base-bw=20,bandwidth=5"
+  " -numa hmat-lb,initiator=0,target=1,hierarchy=memory,"
+  "data-type=access-latency,base-lat=10,latency=10"
+  " -numa hmat-lb,initiator=0,target=1,hierarchy=memory,"
+  "data-type=access-bandwidth,base-bw=20,bandwidth=10"
+  " -numa hmat-cache,node-id=0,size=0x2,total=1,level=1"
+  ",assoc=direct,policy=write-back,line=8"
+  " -numa hmat-cache,node-id=1,size=0x2,total=1,level=1"
+  ",assoc=direct,policy=write-back,line=8",
+  &data);
+free_test_data(&data);
+}
+
+static void test_acpi_q35_tcg_acpi_hmat(void)
+{
+test_acpi_tcg_acpi_hmat(MACHINE_Q35);
+}
+
+static void test_acpi_piix4_tcg_acpi_hmat(void)
+{
+test_acpi_tcg_acpi_hmat(MACHINE_PC);
+}
+
 static void test_acpi_virt_tcg(void)
 {
 test_data data = {
@@ -888,6 +929,8 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/q35/numamem", test_acpi_q35_tcg_numamem);
 qtest_add_func("acpi/piix4/dimmpxm", test_acpi_piix4_tcg_dimm_pxm);
 qtest_add_func("acpi/q35/dimmpxm", test_acpi_q35_tcg_dimm_pxm);
+qtest_add_func("acpi/piix4/acpihmat", test_acpi_piix4_tcg_acpi_hmat);
+qtest_add_func("acpi/q35/acpihmat", test_acpi_q35_tcg_acpi_hmat);
 } else if (strcmp(arch, "aarch64") == 0) {
 qtest_add_func("acpi/virt", test_acpi_virt_tcg);
 }
-- 
2.20.1




[Qemu-devel] [PATCH v6 09/14] numa: Extend the CLI to provide memory latency and bandwidth information

2019-07-07 Thread Tao Xu
From: Liu Jingqi 

Add -numa hmat-lb option to provide System Locality Latency and
Bandwidth Information. These memory attributes help to build
System Locality Latency and Bandwidth Information Structure(s)
in ACPI Heterogeneous Memory Attribute Table (HMAT).

Signed-off-by: Liu Jingqi 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Update the describes in ACPI 6.3(The entry base unit for latency is
  in picoseconds)
---
 numa.c  | 127 
 qapi/misc.json  | 103 ++-
 qemu-options.hx |  45 -
 3 files changed, 272 insertions(+), 3 deletions(-)

diff --git a/numa.c b/numa.c
index 4a3a1726be..db5fe8c12f 100644
--- a/numa.c
+++ b/numa.c
@@ -40,6 +40,7 @@
 #include "qemu/option.h"
 #include "qemu/config-file.h"
 #include "qemu/cutils.h"
+#include "hw/acpi/hmat.h"
 
 QemuOptsList qemu_numa_opts = {
 .name = "numa",
@@ -186,6 +187,126 @@ void parse_numa_distance(MachineState *ms, 
NumaDistOptions *dist, Error **errp)
 ms->numa_state->have_numa_distance = true;
 }
 
+static void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
+   Error **errp)
+{
+int nb_numa_nodes = ms->numa_state->num_nodes;
+NodeInfo *numa_info = ms->numa_state->nodes;
+HMAT_LB_Info *hmat_lb = NULL;
+
+if (node->data_type <= HMATLB_DATA_TYPE_WRITE_LATENCY) {
+if (!node->has_latency) {
+error_setg(errp, "Missing 'latency' option.");
+return;
+}
+if (node->has_bandwidth) {
+error_setg(errp, "Invalid option 'bandwidth' since "
+   "the data type is latency.");
+return;
+}
+if (node->has_base_bw) {
+error_setg(errp, "Invalid option 'base_bw' since "
+   "the data type is latency.");
+return;
+}
+}
+
+if (node->data_type >= HMATLB_DATA_TYPE_ACCESS_BANDWIDTH) {
+if (!node->has_bandwidth) {
+error_setg(errp, "Missing 'bandwidth' option.");
+return;
+}
+if (node->has_latency) {
+error_setg(errp, "Invalid option 'latency' since "
+   "the data type is bandwidth.");
+return;
+}
+if (node->has_base_lat) {
+error_setg(errp, "Invalid option 'base_lat' since "
+   "the data type is bandwidth.");
+return;
+}
+}
+
+if (node->initiator >= nb_numa_nodes) {
+error_setg(errp, "Invalid initiator=%"
+   PRIu16 ", it should be less than %d.",
+   node->initiator, nb_numa_nodes);
+return;
+}
+if (!numa_info[node->initiator].has_cpu) {
+error_setg(errp, "Invalid initiator=%"
+   PRIu16 ", it isn't an initiator proximity domain.",
+   node->initiator);
+return;
+}
+
+if (node->target >= nb_numa_nodes) {
+error_setg(errp, "Invalid target=%"
+   PRIu16 ", it should be less than %d.",
+   node->target, nb_numa_nodes);
+return;
+}
+if (!numa_info[node->target].initiator_valid) {
+error_setg(errp, "Invalid target=%"
+   PRIu16 ", it hasn't a valid initiator proximity domain.",
+   node->target);
+return;
+}
+
+if (node->has_latency) {
+hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
+
+if (!hmat_lb) {
+hmat_lb = g_malloc0(sizeof(*hmat_lb));
+ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = 
hmat_lb;
+} else if (hmat_lb->latency[node->initiator][node->target]) {
+error_setg(errp, "Duplicate configuration of the latency for "
+   "initiator=%" PRIu16 " and target=%" PRIu16 ".",
+   node->initiator, node->target);
+return;
+}
+
+/* Only the first time of setting the base unit is valid. */
+if ((hmat_lb->base_lat == 0) && (node->has_base_lat)) {
+hmat_lb->base_lat = node->base_lat;
+}
+
+hmat_lb->latency[node->initiator][node->target] = node->latency;
+}
+
+if (node->has_bandwidth) {
+hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
+
+if (!hmat_lb) {
+hmat_lb = g_malloc0(sizeof(*hmat_lb));
+ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = 
hmat_lb;
+} else if (hmat_lb->bandwidth[node->initiator][node->target]) {
+error_setg(errp, "Duplicate configuration of the bandwidth for "
+   "initiator=%" PRIu16 " and target=%" PRIu16 ".",
+   node->initiator, node->target);
+return;
+}
+
+/* Only the first time of setting the base unit is valid. */
+if (hmat_lb->base_bw == 0) {
+if (!node->

[Qemu-devel] [PATCH v6 11/14] acpi: introduce aml_build_runtime_buf for NFIT generalizations

2019-07-07 Thread Tao Xu
Move the _FIT method buff Aml-build codes into
aml_build_runtime_buf(), and then NFIT and HMAT can both use it.

Suggested-by: Igor Mammedov 
Signed-off-by: Tao Xu 
---

Changes in v6:
- Add more commit message and change the function name
---
 hw/acpi/nvdimm.c| 49 +++--
 include/hw/mem/nvdimm.h |  6 +
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 9fdad6dc3f..0eb57245d3 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -1140,12 +1140,11 @@ static void nvdimm_build_device_dsm(Aml *dev, uint32_t 
handle)
 
 static void nvdimm_build_fit(Aml *dev)
 {
-Aml *method, *pkg, *buf, *buf_size, *offset, *call_result;
-Aml *whilectx, *ifcond, *ifctx, *elsectx, *fit;
+Aml *method, *pkg, *buf, *buf_name, *buf_size, *call_result;
 
 buf = aml_local(0);
 buf_size = aml_local(1);
-fit = aml_local(2);
+buf_name = aml_local(2);
 
 aml_append(dev, aml_name_decl(NVDIMM_DSM_RFIT_STATUS, aml_int(0)));
 
@@ -1164,6 +1163,22 @@ static void nvdimm_build_fit(Aml *dev)
 aml_int(1) /* Revision 1 */,
 aml_int(0x1) /* Read FIT */,
 pkg, aml_int(NVDIMM_QEMU_RSVD_HANDLE_ROOT));
+
+aml_build_runtime_buf(method, buf, buf_size,
+  call_result, buf_name, dev,
+  "RFIT", "_FIT",
+  NVDIMM_DSM_RET_STATUS_SUCCESS,
+  NVDIMM_DSM_RET_STATUS_FIT_CHANGED);
+}
+
+void aml_build_runtime_buf(Aml *method, Aml *buf, Aml *buf_size,
+   Aml *call_result, Aml *buf_name, Aml *dev,
+   const char *help_function, const char *method_name,
+   int ret_status_success,
+   int ret_status_changed)
+{
+Aml *offset, *whilectx, *ifcond, *ifctx, *elsectx;
+
 aml_append(method, aml_store(call_result, buf));
 
 /* handle _DSM result. */
@@ -1174,7 +1189,7 @@ static void nvdimm_build_fit(Aml *dev)
  aml_name(NVDIMM_DSM_RFIT_STATUS)));
 
  /* if something is wrong during _DSM. */
-ifcond = aml_equal(aml_int(NVDIMM_DSM_RET_STATUS_SUCCESS),
+ifcond = aml_equal(aml_int(ret_status_success),
aml_name("STAU"));
 ifctx = aml_if(aml_lnot(ifcond));
 aml_append(ifctx, aml_return(aml_buffer(0, NULL)));
@@ -1185,7 +1200,7 @@ static void nvdimm_build_fit(Aml *dev)
 aml_int(4) /* the size of "STAU" */,
 buf_size));
 
-/* if we read the end of fit. */
+/* if we read the end of buff method. */
 ifctx = aml_if(aml_equal(buf_size, aml_int(0)));
 aml_append(ifctx, aml_return(aml_buffer(0, NULL)));
 aml_append(method, ifctx);
@@ -1196,38 +1211,38 @@ static void nvdimm_build_fit(Aml *dev)
 aml_append(method, aml_return(aml_name("BUFF")));
 aml_append(dev, method);
 
-/* build _FIT. */
-method = aml_method("_FIT", 0, AML_SERIALIZED);
+/* build buff method. */
+method = aml_method(method_name, 0, AML_SERIALIZED);
 offset = aml_local(3);
 
-aml_append(method, aml_store(aml_buffer(0, NULL), fit));
+aml_append(method, aml_store(aml_buffer(0, NULL), buf_name));
 aml_append(method, aml_store(aml_int(0), offset));
 
 whilectx = aml_while(aml_int(1));
-aml_append(whilectx, aml_store(aml_call1("RFIT", offset), buf));
+aml_append(whilectx, aml_store(aml_call1(help_function, offset), buf));
 aml_append(whilectx, aml_store(aml_sizeof(buf), buf_size));
 
 /*
- * if fit buffer was changed during RFIT, read from the beginning
- * again.
+ * if buffer was changed during runtime,
+ * read from the beginning again.
  */
 ifctx = aml_if(aml_equal(aml_name(NVDIMM_DSM_RFIT_STATUS),
- aml_int(NVDIMM_DSM_RET_STATUS_FIT_CHANGED)));
-aml_append(ifctx, aml_store(aml_buffer(0, NULL), fit));
+ aml_int(ret_status_changed)));
+aml_append(ifctx, aml_store(aml_buffer(0, NULL), buf_name));
 aml_append(ifctx, aml_store(aml_int(0), offset));
 aml_append(whilectx, ifctx);
 
 elsectx = aml_else();
 
-/* finish fit read if no data is read out. */
+/* finish buff read if no data is read out. */
 ifctx = aml_if(aml_equal(buf_size, aml_int(0)));
-aml_append(ifctx, aml_return(fit));
+aml_append(ifctx, aml_return(buf_name));
 aml_append(elsectx, ifctx);
 
 /* update the offset. */
 aml_append(elsectx, aml_add(offset, buf_size, offset));
-/* append the data we read out to the fit buffer. */
-aml_append(elsectx, aml_concatenate(fit, buf, fit));
+/* append the data we read out to the buffer. */
+aml_append(elsectx, aml_concatenate(buf_name, buf, buf_name));
 aml_append(whilectx, elsectx);
 aml_append(method, whilectx);
 
diff --

Re: [Qemu-devel] [PATCH v15 6/7] ext4: disable map_sync for async flush

2019-07-07 Thread Theodore Ts'o
On Fri, Jul 05, 2019 at 07:33:27PM +0530, Pankaj Gupta wrote:
> Dont support 'MAP_SYNC' with non-DAX files and DAX files
> with asynchronous dax_device. Virtio pmem provides
> asynchronous host page cache flush mechanism. We don't
> support 'MAP_SYNC' with virtio pmem and ext4.
> 
> Signed-off-by: Pankaj Gupta 
> Reviewed-by: Jan Kara 

Acked-by: Theodore Ts'o 




[Qemu-devel] Handling of fall through code (was: [PATCH v8 04/87] target/mips: Mark switch fallthroughs with interpretable comments

2019-07-07 Thread Stefan Weil

Am 13.08.18 um 19:52 schrieb Aleksandar Markovic:


From: Aleksandar Markovic 

Mark switch fallthroughs with comments, in cases fallthroughs
are intentional.



This is a general problem all over the QEMU code. I usually compile with 
nearly all warnings enabled and get now lots of errors with the latest 
code and after updating to gcc-8.3.0 (Debian buster). It should be 
reproducible by enabling -Werror=implicit-fallthrough.


The current situation is like this:

- Some code has fallthrough comments which are accepted by the compiler.

- Other code has fallthrough comments which are not accepted (resulting 
in a compiler error).


- Some code is correct, but has no indication that the fallthrough is 
intentional.


- There is also fallthrough code which is obviously not correct (even in 
target/mips/translate.c).



I suggest to enable -Werror=implicit-fallthrough by default and add a 
new macro to mark all fallthrough locations which are correct, but not 
accepted by the compiler.


This can be done with a definition for GCC compatible compilers in 
include/qemu/compiler.h:


#define QEMU_FALLTHROUGH __attribute__ ((fallthrough))

Then fallthrough code would look like this:

    case 1:
        do_something();
        QEMU_FALLTHROUGH;

    case 2:


VIXL_FALLTHROUGH also needs a similar definition to work with gcc-8.3.0.

Please comment. Would you prefer another macro name or a macro with 
parentheses like this:


#define QEMU_FALLTHROUGH() __attribute__ ((fallthrough))


As soon as there is consensus on the macro name and form, I can send a 
patch which adds it (but would not mind if someone else adds it).


Then I suggest that the maintainers build with the fallthrough warning 
enabled and fix all warnings, either by using the new macro or by adding 
the missing break.


Finally we can enforce the warning by default.


Another macro which is currently missing is a scanf variant of GCC_FMT_ATTR.

I suggest to add and use a GCC_SCANF_ATTR macro:

#define GCC_SCANF_ATTR(n, m) __attribute__((format(gnu_scanf, n, m)))

A more regular solution would require renaming GCC_FMT_ATTR to 
GCC_FMT_PRINTF and use GCC_FMT_SCANF for the new macro.



Regards
Stefan Weil





Re: [Qemu-devel] [PATCH v3 5/6] monitor: adding info tb and tbs to monitor

2019-07-07 Thread Vanderson Martins do Rosario
Markus,

Thank you for your comments! Based on your questions and suggestions of
writing a more complete explanation in my commits, I decided to start to
describe our whole work on the wiki:
https://wiki.qemu.org/Internships/ProjectIdeas/TCGCodeQuality
I will update and expand it weekly, so I can link and cite it in future
patches to give a better whole vision for anyone interested.

Btw, thank you for your QMP tips, I will discuss with Alex how to address
it in our approach.

[]'s
Vanderson M. Rosario


On Sat, Jul 6, 2019 at 3:09 AM Markus Armbruster  wrote:

> Cc: Marc-André, who has patches that might be useful here.
>
> Alex Bennée  writes:
>
> > Markus Armbruster  writes:
> >
> >> vandersonmr  writes:
> >>
> > 
> >
> > I'll leave Vanderson to address your other comments.
> >
> >>
> >> Debugging commands are kind of borderline.  Debugging is commonly a
> >> human activity, where HMP is just fine.  However, humans create tools to
> >> assist with their activities, and then QMP is useful.  While I wouldn't
> >> encourage HMP-only for the debugging use case, I wouldn't veto it.
> >>
> >> Your (overly terse!) commit message and help texts make me guess the
> >> commands are for gathering statistics.  Statistics can have debugging
> >> uses.  But they often have non-debugging uses as well.  What use cases
> >> can you imagine for these commands?
> >
> > So this is all really aimed at making TCG go faster - but before we can
> > make it go faster we need better tools for seeing where the time is
> > being spent and examining the code that we generate. So I expect the
> > main users of this functionality will be QEMU developers.
> >
> > That said I can see a good rationale for supporting QMP because it is
> > more amenable to automation. However this is early days so I would
> > caution about exposing this stuff too early least we bake in a woolly
> > API.
>
> Development tools should exempt themselves from QMP's interface
> stability promise: prefix the command names with 'x-'.
>
> > The other wrinkle is we do have to take control of the emulator to
> > safely calculate some of the numbers we output. This essentially means
> > the HMP commands are asynchronous - we kick of safe work which waits
> > until all vCPU threads are stopped before we go through the records and
> > add up numbers. This is fine for the HMP because we just output to the
> > monitor FD when we are ready. I assume for QMP commands there is more
> > housekeeping to do? Can QMP commands wait for a response to be
> > calculated by another thread? Are there any existing commands that have
> > to support this sort of pattern?
>
> Let me clarify "synchronous" to avoid confusion.
>
> Both QMP and HMP commands are synchronous protocols in the sense that
> commands are executed one after the other, without overlap.  When a
> client sends multiple commands, it can assume that each one starts only
> after the previous one completed.
>
> Both HMP and QMP commands execute synchronously in the sense that the
> command runs to completion without ever yielding the thread.  Any
> blocking operations put the thread to sleep (but see below).
>
> HMP runs in the main thread.  Putting the main thread to sleep is
> generally undesirable.
>
> QMP used to run in the main thread, too.  Nowadays, the QMP core runs in
> an I/O thread shared by all monitors, and dispatches commands to the
> main thread.  Moving command execution out of the main thread as well
> requires careful review of the command's code for hidden assumptions.
> Major project.
>
> Fine print: OOB commands are a special case, but I doubt you want to
> know more.
>
> Fine print: certain character devices can't support use of an I/O
> thread; QMP runs in the main thread then.  The ones you want to use with
> QMP all support I/O threads.
>
> You wrote "we kick of safe work which waits until all vCPU threads are
> stopped before we go through the records and add up numbers [...] we
> just output to the monitor FD".  Does this mean the HMP command kicks
> off the work, terminates, and some time later something else prints
> results to the monitor?  How much later?
>
> If "later" is actually "soon", for a suitable value of "soon",
> Marc-André's work on "asynchronous" QMP might be pertinent.  I put
> "asynchronous" in scare quotes, because of the confusion it has caused.
> My current understanding (Marc-André, please correct me if wrong): it
> lets QMP commands to block without putting their thread to sleep.  It
> does not make QMP an asynchronous protocol.
>
> If "later" need not be "soon", read on.
>
> In QMP, there are two established ways to do potentially long-running
> work.  Both ways use a command that kicks off the work, then terminates
> without waiting for it to complete.
>
> The first way is traditional: pair the kick off command with a query
> command and optionally an event.
>
> When the work completes, it fires off the event.  The event is broadcast
> to all QMP monitors (we could imple

[Qemu-devel] [Bug 1835693] [NEW] s390x binaries segfault

2019-07-07 Thread mcandre
Public bug reported:

Hello World appears to segfault with qemu s390x, on a Debian 10.0.0
Buster amd64 host.

$ cat hello.cpp 
#include 
using std::cout;

int main() {
cout << "Hello World!\n";
return 0;
}

$ s390x-linux-gnu-g++ -o hello hello.cpp

$ qemu-s390x-static hello
Segmentation fault

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1835693

Title:
  s390x binaries segfault

Status in QEMU:
  New

Bug description:
  Hello World appears to segfault with qemu s390x, on a Debian 10.0.0
  Buster amd64 host.

  $ cat hello.cpp 
  #include 
  using std::cout;

  int main() {
  cout << "Hello World!\n";
  return 0;
  }

  $ s390x-linux-gnu-g++ -o hello hello.cpp

  $ qemu-s390x-static hello
  Segmentation fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1835693/+subscriptions



[Qemu-devel] [Bug 1835694] [NEW] hardware-based time keeping

2019-07-07 Thread Abdulla
Public bug reported:

Hi all,

I hope you're all doing well.

As i was looking for a solution for a particular problem in Qemu/KVM
virtualization.

My issue is that I have a virtual machine that runs well in VMware and when
I migrated that to Qemu/KVM-enabled environment, it didn't work! I figured
out that under VMware hypervisor, VMware supplies CPU TSC and Performance
Counters values to the guest VM with the option
"monitor_control.pseudo_perfctr = TRUE" set the vmx configuration file,
Ref.: https://www.vmware.com/pdf/vmware_timekeeping.pdf

My question is, is there any similar option in Qemu/KVM-enabled environment
that I can use to get my VM working the same way as in the VMware
environment?

I almost tried all options in Qemu with regards to CPU but no avail.

To elaborate more, the VM I'm trying to port under Qemu/KVM environment is
a an old version of Cisco virtual ASA Firewall. The VM image is actually
meant to be run under VMware ESXi and with that
"*monitor_control.pseudo_perfctr
= TRUE*" option it can also run in Vware Workstation as well. *Yes, this
option that makes it run under VMware and if it's removed from the
configuration vmx file then the VM boots half way and crashes the same way
it crashes under Qemu*. That dictates it's the option in interest that
needs to be found in Qemu/KVM. I have a copy of this VM in the below link
in case you would like to try its behavior in under VMware. I downloaded it
from a youtube previously to test it out:

https://drive.google.com/open?id=1SEXws18hoj2sWGk8iFqqH8RpBZsBNpRH

Once you power on the VM you can telnet to 127.0.0.1 on port 3000 to see
the boot process. If you remove that option i mentioned to you and boot the
VM again you'll see the crashing in process.


I've converted that vmdk disk images into Qemu disks "qcow2" format and i
ran them using the below command line on Ubuntu:

/opt/qemu/bin/qemu-system-x86_64 -L -nographic -device
e1000-82545em,netdev=net0,mac=50:00:00:6a:00:00 -netdev
tap,id=net0,ifname=vunl0_33_0,script=no -device
e1000-82545em,netdev=net1,mac=50:00:00:6a:00:01 -netdev
tap,id=net1,ifname=vunl0_33_1,script=no -device
e1000-82545em,netdev=net2,mac=50:00:00:6a:00:02 -netdev
tap,id=net2,ifname=vunl0_33_2,script=no -device
e1000-82545em,netdev=net3,mac=50:00:00:6a:00:03 -netdev
tap,id=net3,ifname=vunl0_33_3,script=no -machine type=pc-1.0  *-cpu
host,migratable=off,invtsc=on,pmu=on* -m 4096 -hda hda.qcow2 -hdb hdb.qcow2
-serial telnet:0.0.0.0:3000,server,nowait -monitor
tcp:127.0.0.1:42379,server,nowait
-nographic -display none -enable-kvm


Once you power on the VM you can telnet to xx.xx.xx.xx 3000 (where the xx
IP is the Ubuntu machine IP) to see the crashing in process. You may need
to wait for a while for the status messages to appear in the terminal
window.

I assume it's a cpu issue because in page 9 of the Vmware pdf reference
file; it says there are machine instructions become available when this
option "*monitor_control.pseudo_perfctr = TRUE*" is enabled:

The following machine instructions then become available:

InstructionTime ValueReturned
rdpmc   0x1   Physical host TSC
rdpmc   0x10001   Elapsed real time in ns
rdpmc   0x10002   Elapsed apparent time in ns

Therefore, I used many of the Qemu cpu options such as these:

-cpu host,migratable=no,+invtsc (ref: https://wiki.qemu.org/ChangeLog/2.1)
-cpu host, tsc-frequency= (ref: https://lists.gnu.org/archive/
html/qemu-devel/2017-01/msg03555.html)
 -cpu host,migratable=off,invtsc=true,pmu=true

Not sure if i'm hitting the wrong option!

The log I'm getting when the VM boots up looks like the following crash
happens at the blue colored log:


Loading...

Starting image verification
Hash Computation:100% Done!
Computed Hash   SHA2: 63c1e8aa9de3d0c6e738dc91be8e1784
  5caf64af4cf06cf6a3c5da7200d478dd
  938d380d2b1064f6a349401c7860f50e
  cc4eeb98a0ae16c097dbc9447d4d6626

Get key records from key storage: Primary, key_store_type: 2
Embedded Hash   SHA2: 63c1e8aa9de3d0c6e738dc91be8e1784
  5caf64af4cf06cf6a3c5da7200d478dd
  938d380d2b1064f6a349401c7860f50e
  cc4eeb98a0ae16c097dbc9447d4d6626

The digital signature of the running image verified successfully
Processor memory 3183296512, Reserved memory: 0

Total NICs found: 4
i82545EM rev03 Gigabit Ethernet @ irq10 dev 6 index 03 MAC: 5000.006a.0003
i82545EM rev03 Gigabit Ethernet @ irq10 dev 5 index 02 MAC: 5000.006a.0002
i82545EM rev03 Gigabit Ethernet @ irq11 dev 4 index 01 MAC: 5000.006a.0001
i82545EM rev03 Gigabit Ethernet @ irq11 dev 3 index 00 MAC: 5000.006a.

Thread Name: lina_flash_init_thread
Page fault: Unknown
r8 0x0790
r9 0x7fff3fa8b000
   r10 0x0001
   r11 0x02

Re: [Qemu-devel] Handling of fall through code (was: [PATCH v8 04/87] target/mips: Mark switch fallthroughs with interpretable comments

2019-07-07 Thread Markus Armbruster
Stefan Weil  writes:

> Am 13.08.18 um 19:52 schrieb Aleksandar Markovic:
>
>> From: Aleksandar Markovic 
>>
>> Mark switch fallthroughs with comments, in cases fallthroughs
>> are intentional.
>
>
> This is a general problem all over the QEMU code. I usually compile
> with nearly all warnings enabled and get now lots of errors with the
> latest code and after updating to gcc-8.3.0 (Debian buster). It should
> be reproducible by enabling -Werror=implicit-fallthrough.
>
> The current situation is like this:
>
> - Some code has fallthrough comments which are accepted by the compiler.
>
> - Other code has fallthrough comments which are not accepted
> (resulting in a compiler error).
>
> - Some code is correct, but has no indication that the fallthrough is
> intentional.

I'd treat that as a bug.

> - There is also fallthrough code which is obviously not correct (even
> in target/mips/translate.c).

Bug.

> I suggest to enable -Werror=implicit-fallthrough by default and add a
> new macro to mark all fallthrough locations which are correct, but not
> accepted by the compiler.
>
> This can be done with a definition for GCC compatible compilers in
> include/qemu/compiler.h:
>
> #define QEMU_FALLTHROUGH __attribute__ ((fallthrough))
>
> Then fallthrough code would look like this:
>
>     case 1:
>         do_something();
>         QEMU_FALLTHROUGH;
>
>     case 2:
>
>
> VIXL_FALLTHROUGH also needs a similar definition to work with gcc-8.3.0.
>
> Please comment. Would you prefer another macro name or a macro with
> parentheses like this:
>
> #define QEMU_FALLTHROUGH() __attribute__ ((fallthrough))

In my opinion, the macro is no clearer than proper comments.

I'd prefer -Wimplicit-fallthrough=1 or 2.  The former makes gcc accept
any comment.  The latter makes it accept '.*falls?[ \t-]*thr(ough|u).*',
which should still match the majority of our comments.  Less churn than
the macro.

> As soon as there is consensus on the macro name and form, I can send a
> patch which adds it (but would not mind if someone else adds it).
>
> Then I suggest that the maintainers build with the fallthrough warning
> enabled and fix all warnings, either by using the new macro or by
> adding the missing break.
>
> Finally we can enforce the warning by default.
>
>
> Another macro which is currently missing is a scanf variant of GCC_FMT_ATTR.
>
> I suggest to add and use a GCC_SCANF_ATTR macro:
>
> #define GCC_SCANF_ATTR(n, m) __attribute__((format(gnu_scanf, n, m)))

Do we define our own scanf()-like functions?  If yes, decorating them
with the attribute is a good idea.

However, the gnu_ in gnu_scanf tells the compiler we're linking with the
GNU C Library, which seems unwise.  Hmm, we already use gnu_printf.
Commit 9c9e7d51bf0:

Newer gcc versions support format gnu_printf which is
better suited for use in QEMU than format printf
(QEMU always uses standard format strings (even with mingw32)).

Should we limit the use of gnu_printf to #ifdef _WIN32?

> A more regular solution would require renaming GCC_FMT_ATTR to
> GCC_FMT_PRINTF and use GCC_FMT_SCANF for the new macro.

Quite some churn, but regularity matters.



Re: [Qemu-devel] [PATCH v3 5/6] monitor: adding info tb and tbs to monitor

2019-07-07 Thread Marc-André Lureau
Hi

On Sat, Jul 6, 2019 at 10:09 AM Markus Armbruster  wrote:
>
> Cc: Marc-André, who has patches that might be useful here.
>
> Alex Bennée  writes:
>
> > Markus Armbruster  writes:
> >
> >> vandersonmr  writes:
> >>
> > 
> >
> > I'll leave Vanderson to address your other comments.
> >
> >>
> >> Debugging commands are kind of borderline.  Debugging is commonly a
> >> human activity, where HMP is just fine.  However, humans create tools to
> >> assist with their activities, and then QMP is useful.  While I wouldn't
> >> encourage HMP-only for the debugging use case, I wouldn't veto it.
> >>
> >> Your (overly terse!) commit message and help texts make me guess the
> >> commands are for gathering statistics.  Statistics can have debugging
> >> uses.  But they often have non-debugging uses as well.  What use cases
> >> can you imagine for these commands?
> >
> > So this is all really aimed at making TCG go faster - but before we can
> > make it go faster we need better tools for seeing where the time is
> > being spent and examining the code that we generate. So I expect the
> > main users of this functionality will be QEMU developers.
> >
> > That said I can see a good rationale for supporting QMP because it is
> > more amenable to automation. However this is early days so I would
> > caution about exposing this stuff too early least we bake in a woolly
> > API.
>
> Development tools should exempt themselves from QMP's interface
> stability promise: prefix the command names with 'x-'.
>
> > The other wrinkle is we do have to take control of the emulator to
> > safely calculate some of the numbers we output. This essentially means
> > the HMP commands are asynchronous - we kick of safe work which waits
> > until all vCPU threads are stopped before we go through the records and
> > add up numbers. This is fine for the HMP because we just output to the
> > monitor FD when we are ready. I assume for QMP commands there is more
> > housekeeping to do? Can QMP commands wait for a response to be
> > calculated by another thread? Are there any existing commands that have
> > to support this sort of pattern?
>
> Let me clarify "synchronous" to avoid confusion.
>
> Both QMP and HMP commands are synchronous protocols in the sense that
> commands are executed one after the other, without overlap.  When a
> client sends multiple commands, it can assume that each one starts only
> after the previous one completed.
>
> Both HMP and QMP commands execute synchronously in the sense that the
> command runs to completion without ever yielding the thread.  Any
> blocking operations put the thread to sleep (but see below).
>
> HMP runs in the main thread.  Putting the main thread to sleep is
> generally undesirable.
>
> QMP used to run in the main thread, too.  Nowadays, the QMP core runs in
> an I/O thread shared by all monitors, and dispatches commands to the
> main thread.  Moving command execution out of the main thread as well
> requires careful review of the command's code for hidden assumptions.
> Major project.
>
> Fine print: OOB commands are a special case, but I doubt you want to
> know more.
>
> Fine print: certain character devices can't support use of an I/O
> thread; QMP runs in the main thread then.  The ones you want to use with
> QMP all support I/O threads.
>
> You wrote "we kick of safe work which waits until all vCPU threads are
> stopped before we go through the records and add up numbers [...] we
> just output to the monitor FD".  Does this mean the HMP command kicks
> off the work, terminates, and some time later something else prints
> results to the monitor?  How much later?
>
> If "later" is actually "soon", for a suitable value of "soon",
> Marc-André's work on "asynchronous" QMP might be pertinent.  I put
> "asynchronous" in scare quotes, because of the confusion it has caused.
> My current understanding (Marc-André, please correct me if wrong): it
> lets QMP commands to block without putting their thread to sleep.  It
> does not make QMP an asynchronous protocol.

In "[PATCH v4 00/20] monitor: add asynchronous command type", I
propose a way to defer monitor commands (both QMP and HMP). This
allows the main loop to reenter, and when the work is completed,
return the command. During that time, no other commands are accepted
(except OOB), effectively "blocking" the monitor.

The series probably doesn't apply cleanly anymore, I'll rebase and resubmit it.

>
> If "later" need not be "soon", read on.
>
> In QMP, there are two established ways to do potentially long-running
> work.  Both ways use a command that kicks off the work, then terminates
> without waiting for it to complete.
>
> The first way is traditional: pair the kick off command with a query
> command and optionally an event.
>
> When the work completes, it fires off the event.  The event is broadcast
> to all QMP monitors (we could implement unicast if we have a compelling
> use case).
>
> The query command reports whether the work has complet

Re: [Qemu-devel] Handling of fall through code

2019-07-07 Thread Stefan Weil

Am 08.07.19 um 06:40 schrieb Markus Armbruster:


Stefan Weil  writes:


- Some code is correct, but has no indication that the fallthrough is
intentional.

I'd treat that as a bug.



Sure.





- There is also fallthrough code which is obviously not correct (even
in target/mips/translate.c).

Bug.



Yes, of course.





I suggest to enable -Werror=implicit-fallthrough by default and add a
new macro to mark all fallthrough locations which are correct, but not
accepted by the compiler.

This can be done with a definition for GCC compatible compilers in
include/qemu/compiler.h:

#define QEMU_FALLTHROUGH __attribute__ ((fallthrough))

Then fallthrough code would look like this:

     case 1:
         do_something();
         QEMU_FALLTHROUGH;

     case 2:


VIXL_FALLTHROUGH also needs a similar definition to work with gcc-8.3.0.

Please comment. Would you prefer another macro name or a macro with
parentheses like this:

#define QEMU_FALLTHROUGH() __attribute__ ((fallthrough))

In my opinion, the macro is no clearer than proper comments.

I'd prefer -Wimplicit-fallthrough=1 or 2.  The former makes gcc accept
any comment.  The latter makes it accept '.*falls?[ \t-]*thr(ough|u).*',
which should still match the majority of our comments.  Less churn than
the macro.

[...]

Another macro which is currently missing is a scanf variant of GCC_FMT_ATTR.

I suggest to add and use a GCC_SCANF_ATTR macro:

#define GCC_SCANF_ATTR(n, m) __attribute__((format(gnu_scanf, n, m)))
Do we define our own scanf()-like functions?  If yes, decorating them
with the attribute is a good idea.



xen_device_backend_scanf, xs_node_vscanf, xs_node_scanf, 
xen_device_frontend_scanf


Maybe more. The compiler can tell you missing attributes.




However, the gnu_ in gnu_scanf tells the compiler we're linking with the
GNU C Library, which seems unwise.  Hmm, we already use gnu_printf.
Commit 9c9e7d51bf0:

 Newer gcc versions support format gnu_printf which is
 better suited for use in QEMU than format printf
 (QEMU always uses standard format strings (even with mingw32)).

Should we limit the use of gnu_printf to #ifdef _WIN32?



No, because we don't want lots of conditional code with different format 
strings for POSIX and Windows (I made that commit 9 years ago).




A more regular solution would require renaming GCC_FMT_ATTR to
GCC_FMT_PRINTF and use GCC_FMT_SCANF for the new macro.

Quite some churn, but regularity matters.



I could do that when adding the new macro, but would like to hear more 
opinions on that.


Thank you,

Stefan




Re: [Qemu-devel] [RFC v2 2/2] hw/pvrdma: add live migration support

2019-07-07 Thread Yuval Shaia
On Sat, Jul 06, 2019 at 09:39:40AM +0530, Sukrit Bhatnagar wrote:
> Use VMStateDescription for migrating device state. Currently,

What do you mean by 'Currently'?

> 'vmstate_pvrdma' describes the PCI and MSIX state for pvrdma and
> 'vmstate_pvrdma_dsr_dma' describes a temporary state containing
> some values obtained only after mapping to dsr in the source.
> Since the dsr will not be available on dest until we map to the
> dma address we had on source, these values cannot be migrated
> directly.
> 
> Add PVRDMAMigTmp to store this temporary state which consists of
> dma addresses and ring page information. The 'parent' member is
> used to refer to the device state (PVRDMADev) so that parent PCI
> device object is accessible, which is needed to remap to DSR.
> 
> pvrdma_dsr_dma_pre_save() saves the dsr state into this temporary
> representation and pvrdma_dsr_dma_post_load() loads it back.
> This load function also remaps to the dsr and and calls
> load_dsr() for further map and ring init operations.
> 
> Please note that this call to load_dsr() can be removed from the
> migration flow and included in pvrdma_regs_write() to perform a
> lazy load.

The lazy load was suggested to overcome a potential problem with mapping to
addresses while still in migration process. With that been solved i would
suggest to drop the idea of lazy load.

> As of now, migration will fail if there in an error in load_dsr().
> Also, there might be a considerable amount of pages in the rings,
> which will have dma map operations when the init functions are
> called.
> If this takes noticeable time, it might be better to have lazy
> load instead.

Yeah, make sense but i hope we will not get to this.

> 
> Cc: Marcel Apfelbaum 
> Cc: Yuval Shaia 
> Signed-off-by: Sukrit Bhatnagar 
> ---
>  hw/rdma/vmw/pvrdma_main.c | 87 +++
>  1 file changed, 87 insertions(+)
> 
> diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
> index 6c90db96f9..4a10bd2fc7 100644
> --- a/hw/rdma/vmw/pvrdma_main.c
> +++ b/hw/rdma/vmw/pvrdma_main.c
> @@ -28,6 +28,7 @@
>  #include "sysemu/sysemu.h"
>  #include "monitor/monitor.h"
>  #include "hw/rdma/rdma.h"
> +#include "migration/register.h"
>  
>  #include "../rdma_rm.h"
>  #include "../rdma_backend.h"
> @@ -593,6 +594,91 @@ static void pvrdma_shutdown_notifier(Notifier *n, void 
> *opaque)
>  pvrdma_fini(pci_dev);
>  }
>  
> +struct PVRDMAMigTmp {
> +PVRDMADev *parent;
> +uint64_t dma;
> +uint64_t cmd_slot_dma;
> +uint64_t resp_slot_dma;
> +uint32_t cq_ring_pages_num_pages;
> +uint64_t cq_ring_pages_pdir_dma;
> +uint32_t async_ring_pages_num_pages;
> +uint64_t async_ring_pages_pdir_dma;
> +};
> +
> +static int pvrdma_dsr_dma_pre_save(void *opaque)
> +{
> +struct PVRDMAMigTmp *tmp = opaque;

For me tmp sounds like a very bad name, if it is the convention then i can
live with that, anyways suggesting something like mig_tmp_info or something
like that.

> +DSRInfo *dsr_info = &tmp->parent->dsr_info;

Can you shad some light on how the parent field is initialized with the
pointer to the device object?

> +struct pvrdma_device_shared_region *dsr = dsr_info->dsr;
> +
> +tmp->dma = dsr_info->dma;
> +tmp->cmd_slot_dma = dsr->cmd_slot_dma;
> +tmp->resp_slot_dma = dsr->resp_slot_dma;
> +tmp->cq_ring_pages_num_pages = dsr->cq_ring_pages.num_pages;
> +tmp->cq_ring_pages_pdir_dma = dsr->cq_ring_pages.pdir_dma;
> +tmp->async_ring_pages_num_pages = dsr->async_ring_pages.num_pages;
> +tmp->async_ring_pages_pdir_dma = dsr->async_ring_pages.pdir_dma;
> +
> +return 0;
> +}
> +
> +static int pvrdma_dsr_dma_post_load(void *opaque, int version_id)
> +{
> +struct PVRDMAMigTmp *tmp = opaque;
> +PVRDMADev *dev = tmp->parent;
> +PCIDevice *pci_dev = PCI_DEVICE(dev);
> +DSRInfo *dsr_info = &dev->dsr_info;
> +struct pvrdma_device_shared_region *dsr;
> +
> +dsr_info->dma = tmp->dma;
> +dsr_info->dsr = rdma_pci_dma_map(pci_dev, dsr_info->dma,
> +sizeof(struct pvrdma_device_shared_region));
> +if (!dsr_info->dsr) {
> +rdma_error_report("Failed to map to DSR");
> +return -ENOMEM;

Will this cause the VM on source host to continue functioning besides
failing the migration?

> +}
> +
> +dsr = dsr_info->dsr;
> +dsr->cmd_slot_dma = tmp->cmd_slot_dma;
> +dsr->resp_slot_dma = tmp->resp_slot_dma;
> +dsr->cq_ring_pages.num_pages = tmp->cq_ring_pages_num_pages;
> +dsr->cq_ring_pages.pdir_dma = tmp->cq_ring_pages_pdir_dma;
> +dsr->async_ring_pages.num_pages = tmp->async_ring_pages_num_pages;
> +dsr->async_ring_pages.pdir_dma = tmp->async_ring_pages_pdir_dma;

I expect the above 6 fields to be already populated with the correct values
as we just map to driver's DSR that should be migrated as part of memory
copy of source to dest host.
Can you verify it and if my assumptions are correct to remove these
assignment