Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Zheng Chuan via
Hi, Peter,Lei,Jinpu.

On 2024/5/8 0:28, Peter Xu wrote:
> On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
>> Hello,
>>
>>> -Original Message-
>>> From: Peter Xu [mailto:pet...@redhat.com]
>>> Sent: Monday, May 6, 2024 11:18 PM
>>> To: Gonglei (Arei) 
>>> Cc: Daniel P. Berrangé ; Markus Armbruster
>>> ; Michael Galaxy ; Yu Zhang
>>> ; Zhijian Li (Fujitsu) ; Jinpu 
>>> Wang
>>> ; Elmar Gerdes ;
>>> qemu-devel@nongnu.org; Yuval Shaia ; Kevin Wolf
>>> ; Prasanna Kumar Kalever
>>> ; Cornelia Huck ;
>>> Michael Roth ; Prasanna Kumar Kalever
>>> ; integrat...@gluster.org; Paolo Bonzini
>>> ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
>>> Hanna Reitz ; Michael S. Tsirkin ;
>>> Thomas Huth ; Eric Blake ; Song
>>> Gao ; Marc-André Lureau
>>> ; Alex Bennée ;
>>> Wainer dos Santos Moschetta ; Beraldo Leal
>>> ; Pannengyuan ;
>>> Xiexiangyou 
>>> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
>>>
>>> On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
 Hi, Peter
>>>
>>> Hey, Lei,
>>>
>>> Happy to see you around again after years.
>>>
>> Haha, me too.
>>
 RDMA features high bandwidth, low latency (in non-blocking lossless
 network), and direct remote memory access by bypassing the CPU (As you
 know, CPU resources are expensive for cloud vendors, which is one of
 the reasons why we introduced offload cards.), which TCP does not have.
>>>
>>> It's another cost to use offload cards, v.s. preparing more cpu resources?
>>>
>> Software and hardware offload converged architecture is the way to go for 
>> all cloud vendors 
>> (Including comprehensive benefits in terms of performance, cost, security, 
>> and innovation speed), 
>> it's not just a matter of adding the resource of a DPU card.
>>
 In some scenarios where fast live migration is needed (extremely short
 interruption duration and migration duration) is very useful. To this
 end, we have also developed RDMA support for multifd.
>>>
>>> Will any of you upstream that work?  I'm curious how intrusive would it be
>>> when adding it to multifd, if it can keep only 5 exported functions like 
>>> what
>>> rdma.h does right now it'll be pretty nice.  We also want to make sure it 
>>> works
>>> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
>>> loads to
>>> multifd channels too.
>>>
>>
>> In fact, we sent the patchset to the community in 2021. Pls see:
>> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/
> 

Yes, I have sent the patchset of multifd support for rdma migration by taking 
over my colleague, and also
sorry for not keeping on this work at that time due to some reasons.
And also I am strongly agree with Lei that the RDMA protocol has some special 
advantages against with TCP
in some scenario, and we are indeed to use it in our product.

> I wasn't aware of that for sure in the past..
> 
> Multifd has changed quite a bit in the last 9.0 release, that may not apply
> anymore.  One thing to mention is please look at Dan's comment on possible
> use of rsocket.h:
> 
> https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/
> 
> And Jinpu did help provide an initial test result over the library:
> 
> https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/
> 
> It looks like we have a chance to apply that in QEMU.
> 
>>
>>
>>> One thing to note that the question here is not about a pure performance
>>> comparison between rdma and nics only.  It's about help us make a decision
>>> on whether to drop rdma, iow, even if rdma performs well, the community 
>>> still
>>> has the right to drop it if nobody can actively work and maintain it.
>>> It's just that if nics can perform as good it's more a reason to drop, 
>>> unless
>>> companies can help to provide good support and work together.
>>>
>>
>> We are happy to provide the necessary review and maintenance work for RDMA
>> if the community needs it.
>>
>> CC'ing Chuan Zheng.
> 
> I'm not sure whether you and Jinpu's team would like to work together and
> provide a final solution for rdma over multifd.  It could be much simpler
> than the original 2021 proposal if the rsocket API will work out.
> 
> Thanks,
> 
That's a good news to see the socket abstraction for RDMA!
When I was developed the series above, the most pain is the RDMA migration has 
no QIOChannel abstraction and i need to take a 'fake channel'
for it which is awkward in code implementation.
So, as far as I know, we can do this by
i. the first thing is that we need to evaluate the rsocket is good enough to 
satisfy our QIOChannel fundamental abstraction
ii. if it works right, then we will continue to see if it can give us 
opportunity to hide the detail of rdma protocol
into rsocket by remove most of code in rdma.c and also some hack in 
migration main process.
iii. implement the advanced features like multi-fd and multi-uri for rdma 
migration.

Since I am not familiar wi

Re: [PATCH v3 03/23] multifd: Rename used field to num

2021-12-13 Thread Zheng Chuan via
Hi, Juan,

Sorry, forget to send to qemu-devel, resend it.

On 2021/11/24 18:05, Juan Quintela wrote:
> We will need to split it later in zero_num (number of zero pages) and
> normal_num (number of normal pages).  This name is better.
> 
> Signed-off-by: Juan Quintela 
> ---
>  migration/multifd.h |  2 +-
>  migration/multifd.c | 38 +++---
>  2 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 15c50ca0b2..86820dd028 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -55,7 +55,7 @@ typedef struct {
>  
>  typedef struct {
>  /* number of used pages */
> -uint32_t used;
> +uint32_t num;
>  /* number of allocated pages */
>  uint32_t allocated;
>  /* global number of generated multifd packets */
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 8125d0015c..8ea86d81dc 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -252,7 +252,7 @@ static MultiFDPages_t *multifd_pages_init(size_t size)
>  
>  static void multifd_pages_clear(MultiFDPages_t *pages)
>  {
> -pages->used = 0;
> +pages->num = 0;
>  pages->allocated = 0;
>  pages->packet_num = 0;
>  pages->block = NULL;
> @@ -270,7 +270,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>  
>  packet->flags = cpu_to_be32(p->flags);
>  packet->pages_alloc = cpu_to_be32(p->pages->allocated);
> -packet->pages_used = cpu_to_be32(p->pages->used);
> +packet->pages_used = cpu_to_be32(p->pages->num);
>  packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>  packet->packet_num = cpu_to_be64(p->packet_num);
>  
> @@ -278,7 +278,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>  strncpy(packet->ramblock, p->pages->block->idstr, 256);
>  }
>  
> -for (i = 0; i < p->pages->used; i++) {
> +for (i = 0; i < p->pages->num; i++) {
>  /* there are architectures where ram_addr_t is 32 bit */
>  uint64_t temp = p->pages->offset[i];
>  
> @@ -332,18 +332,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
> *p, Error **errp)
>  p->pages = multifd_pages_init(packet->pages_alloc);
>  }
>  
> -p->pages->used = be32_to_cpu(packet->pages_used);
> -if (p->pages->used > packet->pages_alloc) {
> +p->pages->num = be32_to_cpu(packet->pages_used);
> +if (p->pages->num > packet->pages_alloc) {
>  error_setg(errp, "multifd: received packet "
> "with %d pages and expected maximum pages are %d",
> -   p->pages->used, packet->pages_alloc) ;
> +   p->pages->num, packet->pages_alloc) ;
>  return -1;
>  }
>  
>  p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>  p->packet_num = be64_to_cpu(packet->packet_num);
>  
> -if (p->pages->used == 0) {
> +if (p->pages->num == 0) {
>  return 0;
>  }
>  
> @@ -356,7 +356,7 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
> *p, Error **errp)
>  return -1;
>  }
>  
> -for (i = 0; i < p->pages->used; i++) {
> +for (i = 0; i < p->pages->num; i++) {
>  uint64_t offset = be64_to_cpu(packet->offset[i]);
>  
>  if (offset > (block->used_length - page_size)) {
> @@ -443,13 +443,13 @@ static int multifd_send_pages(QEMUFile *f)
>  }
>  qemu_mutex_unlock(&p->mutex);
>  }
> -assert(!p->pages->used);
> +assert(!p->pages->num);
>  assert(!p->pages->block);
>  
>  p->packet_num = multifd_send_state->packet_num++;
>  multifd_send_state->pages = p->pages;
>  p->pages = pages;
> -transferred = ((uint64_t) pages->used) * qemu_target_page_size()
> +transferred = ((uint64_t) pages->num) * qemu_target_page_size()
>  + p->packet_len;
The size of zero page should not regard as the whole pagesize.
I think the transferred should be updated after you introduce zero_num in 
following patches, such as:
+transferred = ((uint64_t) p->normal_num) * qemu_target_page_size()
+   + ((uint64_t) p->zero_num) * sizeof(uint64_t);
Otherwise, migration time will get worse if we have low bandwidth limit 
parameter.

I tested it with bandwidth limit of 100MB/s and it works fine:)

>  qemu_file_update_transfer(f, transferred);
>  ram_counters.multifd_bytes += transferred;
> @@ -469,12 +469,12 @@ int multifd_queue_page(QEMUFile *f, RAMBlock *block, 
> ram_addr_t offset)
>  }
>  
>  if (pages->block == block) {
> -pages->offset[pages->used] = offset;
> -pages->iov[pages->used].iov_base = block->host + offset;
> -pages->iov[pages->used].iov_len = qemu_target_page_size();
> -pages->used++;
> +pages->offset[pages->num] = offset;
> +pages->iov[pages->num].iov_base = block->host + offset;
> +pages->iov[pages->num].iov_len = qemu_target_page_size();
> +pages->num++;
>  
> -