[PATCH] migration, docs: mark RDMA migration as deprecated

2024-03-31 Thread Li Zhijian via
Except for RDMA migration, other parts of the RDMA subsystem have been removed since 9.1. Due to the lack of unit tests and CI tests for RDMA migration, int the past developing cycles, a few fatal errors were introduced and broke the RDMA migration, and these issues[1][2] were not fixed until some

[PATCH 1/2] CXL/cxl_type3: add first_dvsec_offset() helper

2024-04-01 Thread Li Zhijian via
It helps to figure out where the first dvsec register is located. In addition, replace offset and size hardcore with existing macros. Signed-off-by: Li Zhijian --- hw/mem/cxl_type3.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/hw/mem/cxl_type3.c b/hw/

[PATCH 2/2] CXL/cxl_type3: reset DVSEC CXL Control in ct3d_reset

2024-04-01 Thread Li Zhijian via
After the kernel commit 0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window") CXL type3 devices cannot be enabled again after the reboot because this flag was not reset. This flag could be changed by the firmware or OS, let it have a reset(default) value

[PATCH v2] hw/mem/cxl_type3: reset dvsecs in ct3d_reset()

2024-04-09 Thread Li Zhijian via
After the kernel commit 0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window") CXL type3 devices cannot be enabled again after the reboot because the control register(see 8.1.3.2 in CXL specifiction 2.0 for more details) was not reset. These registers coul

[PATCH] migration/colo: Fix bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.

2024-04-16 Thread Li Zhijian via
bdrv_activate_all() should not be called from the coroutine context, move it to the QEMU thread colo_process_incoming_thread() with the bql_lock protected. The backtrace is as follows: #4 0x561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 #5 0x561af7907a68 i

[PATCH v2] migration/colo: Fix bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.

2024-04-16 Thread Li Zhijian via
bdrv_activate_all() should not be called from the coroutine context, move it to the QEMU thread colo_process_incoming_thread() with the bql_lock protected. The backtrace is as follows: #4 0x561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 #5 0x561af7907a68 i

[PATCH] backends/cryptodev-builtin: Fix local_error leaks

2024-04-22 Thread Li Zhijian via
It seems that this error does not need to be propagated to the upper, directly output the error to avoid the leaks Closes: https://gitlab.com/qemu-project/qemu/-/issues/2283 Signed-off-by: Li Zhijian --- backends/cryptodev-builtin.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-)

[PATCH 1/3] migration/colo: Minor fix for colo error message

2024-05-08 Thread Li Zhijian via
- Explicitly show the missing module name: replication - Fix capability name to x-colo Signed-off-by: Li Zhijian --- migration/migration.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 6502e169a3..b4a09c561c 100644 -

[PATCH 3/3] migration/colo: Tidy up bql_unlock() around bdrv_activate_all()

2024-05-08 Thread Li Zhijian via
Make the code more tight. Cc: Michael Tokarev Signed-off-by: Li Zhijian --- This change/comment suggested by "Michael Tokarev " came a bit late at that time, let's update it together in these minor set this time. --- migration/colo.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff

[PATCH 2/3] migration/colo: make colo_incoming_co() return void

2024-05-08 Thread Li Zhijian via
Currently, it always returns 0, no need to check the return value at all. In addition, enter colo coroutine only if migration_incoming_colo_enabled() is true. Once the destination side enters the COLO* state, the COLO process will take over the remaining processes until COLO exits. Signed-off-by:

[PATCH] cxl: Get rid of unused cfmw_list

2024-05-30 Thread Li Zhijian via
There is no user for this member. All '-M cxl-fmw.N' options have been parsed and saved to CXLState.fixed_windows. Signed-off-by: Li Zhijian --- hw/cxl/cxl-host.c| 1 - include/hw/cxl/cxl.h | 1 - 2 files changed, 2 deletions(-) diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c index c5f5f

[PATCH v2 2/3] migration/colo: make colo_incoming_co() return void

2024-05-15 Thread Li Zhijian via
Currently, it always returns 0, no need to check the return value at all. In addition, enter colo coroutine only if migration_incoming_colo_enabled() is true. Once the destination side enters the COLO* state, the COLO process will take over the remaining processes until COLO exits. Cc: Fabiano Ros

[PATCH v2 1/3] migration/colo: Minor fix for colo error message

2024-05-15 Thread Li Zhijian via
- Explicitly show the missing module name: replication - Fix capability name to x-colo Reviewed-by: Peter Xu Reviewed-by: Zhang Chen Signed-off-by: Li Zhijian --- V2: Collected reviewed-by tags --- migration/migration.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/

[PATCH v2 3/3] migration/colo: Tidy up bql_unlock() around bdrv_activate_all()

2024-05-15 Thread Li Zhijian via
Make the code more tight. Suggested-by: Michael Tokarev Reviewed-by: Peter Xu Reviewed-by: Zhang Chen Signed-off-by: Li Zhijian --- V2: Collected reviewed-by tags This change/comment suggested by "Michael Tokarev " came a bit late at that time, let's update it together in these minor set this

[PATCH] hw/cxl: Fix msix_notify: Assertion `vector < dev->msix_entries_nr`

2024-12-12 Thread Li Zhijian via
This assertion always happens when we sanitize the CXL memory device. $ echo 1 > /sys/bus/cxl/devices/mem0/security/sanitize It is incorrect to register an MSIX number beyond the device's capability. Expand the device's MSIX to 10 and introduce the `request_msix_number()` helper function to dynam

[PATCH v2] hw/cxl: Fix msix_notify: Assertion `vector < dev->msix_entries_nr`

2024-12-13 Thread Li Zhijian via
This assertion always happens when we sanitize the CXL memory device. $ echo 1 > /sys/bus/cxl/devices/mem0/security/sanitize It is incorrect to register an MSIX number beyond the device's capability. Expand the device's MSIX number and use the enum to maintain the *USED* and MAX MSIX number Fixe

[PATCH 1/3] hw/mem/cxl_type3: Add paired msix_uninit_exclusive_bar() call

2025-01-19 Thread Li Zhijian via
msix_uninit_exclusive_bar() should be paired with msix_init_exclusive_bar() Ensure proper resource cleanup by adding the missing `msix_uninit_exclusive_bar()` call for the Type3 CXL device. Signed-off-by: Li Zhijian --- hw/mem/cxl_type3.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw

[PATCH 2/3] hw/mem/cxl_type3: Fix special_ops memory leak on msix_init_exclusive_bar() failure

2025-01-19 Thread Li Zhijian via
Address a memory leak issue by ensuring `regs->special_ops` is freed when `msix_init_exclusive_bar()` encounters an error during CXL Type3 device initialization. Additionally, this patch renames err_address_space_free to err_msix_uninit for better clarity and logical flow Signed-off-by: Li Zhijia

[PATCH 3/3] hw/mem/cxl_type3: Ensure errp is set on realization failure

2025-01-19 Thread Li Zhijian via
Simply pass the errp to its callee which will set errp if needed, to enhance error reporting for CXL Type 3 device initialization by setting the errp when realization functions fail. Previously, failing to set `errp` could result in errors being overlooked, causing the system to mistakenly treat f

[PATCH v3] hw/cxl: Fix msix_notify: Assertion `vector < dev->msix_entries_nr`

2025-01-14 Thread Li Zhijian via
This assertion always happens when we sanitize the CXL memory device. $ echo 1 > /sys/bus/cxl/devices/mem0/security/sanitize It is incorrect to register an MSIX number beyond the device's capability. Increase the device's MSIX number to cover the mailbox msix number(9). Fixes: 43efb0bfad2b ("hw/

[PATCH] hw/cxl: Introduce CXL_T3_MSIX_VECTOR enumeration

2025-01-14 Thread Li Zhijian via
Introduce the `CXL_T3_MSIX_VECTOR` enumeration to specify MSIX vector assignments specific to the Type 3 (T3) CXL device. The primary goal of this change is to encapsulate the MSIX vector uses that are unique to the T3 device within an enumeration, improving code readability and maintenance by avo

Re: [PATCH 2/2] [NOT-FOR-MERGE] Add qtest for migration over RDMA

2025-02-20 Thread Li Zhijian via
On 19/02/2025 22:11, Peter Xu wrote: > then > in the test it tries to detect rdma link and fetch the ip only It should work without root permission if we just*detect* and*fetch ip*. Do you also mean we can split new-rdma-link.sh to 2 separate scripts - add-rdma-link.sh

[PATCH v2 4/8] migration: Integrate control_save_page() logic into ram_save_target_page()

2025-02-20 Thread Li Zhijian via
Refactor the page saving logic by integrating the control_save_page() function directly into ram_save_target_page(). This change consolidates the RDMA migration decision-making process into a single function, enhancing clarity and maintainability. Signed-off-by: Li Zhijian --- migration/ram.c |

[PATCH v2 8/8] migration: Add qtest for migration over RDMA

2025-02-20 Thread Li Zhijian via
This qtest requires there is a RDMA(RoCE) link in the host. In order to make the test work smoothly, introduce a scripts/rdma-migration-helper.sh to - setup a new Soft-RoCE(aka RXE) if it's root - detect existing RoCE link Test will be skipped if there is no available RoCE link. # Start of rdma t

[PATCH v2 2/8] migration/rdma: Remove redundant RAM_SAVE_CONTROL_NOT_SUPP check

2025-02-20 Thread Li Zhijian via
qemu_rdma_save_page() no longer returns RAM_SAVE_CONTROL_NOT_SUPP since commit a4832d299dd ("migration/rdma: Check sooner if we are in postcopy for save_page()") Signed-off-by: Li Zhijian --- migration/rdma.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/migration/rdma.c

[PATCH v2 5/8] migration: Add migration_capabilities_and_transport_compatible() helper

2025-02-20 Thread Li Zhijian via
Similar to migration_channels_and_transport_compatible(), introduce a new helper migration_capabilities_and_transport_compatible() to check if the capabilites is compatible with the transport. Currently, only move the capabilities vs RDMA transport to this function. Signed-off-by: Li Zhijian ---

[PATCH v2 1/8] migration: Prioritize RDMA in ram_save_target_page()

2025-02-20 Thread Li Zhijian via
Address an error in RDMA-based migration by ensuring RDMA is prioritized when saving pages in `ram_save_target_page()`. Previously, the RDMA protocol's page-saving step was placed after other protocols due to a refactoring in commit bc38dc2f5f3. This led to migration failures characterized by unkn

[PATCH v2 6/8] migraion: disable RDMA + postcopy-ram

2025-02-20 Thread Li Zhijian via
It's believed that RDMA + postcopy-ram has been broken for a while. Rather than spending time re-enabling it, let's simply disable it as a trade-off. Signed-off-by: Li Zhijian --- migration/migration.c | 4 1 file changed, 4 insertions(+) diff --git a/migration/migration.c b/migration/migr

[PATCH v2 3/8] migration: Kill RAM_SAVE_CONTROL_NOT_SUPP

2025-02-20 Thread Li Zhijian via
Refactor the migration control logic by eliminating the `RAM_SAVE_CONTROL_NOT_SUPP` return value within the migration codebase. This involves moving the checks for RDMA migration status and postcopy state from rdma_control_save_page() to control_save_page() With this change, control_save_page() n

[PATCH v2 7/8] migration/rdma: Remove redundant migration_in_postcopy checks

2025-02-20 Thread Li Zhijian via
Since we have disabled RDMA + postcopy, it's safe to remove the migration_in_postcopy() that follows the migration_rdma(). Signed-off-by: Li Zhijian --- migration/ram.c | 2 +- migration/rdma.c | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/migration/ram.c b/migratio

[PATCH v2 0/8] migration/rdma: fixes, refactor and cleanup

2025-02-20 Thread Li Zhijian via
- It fix the RDMA migration broken issue - disable RDMA + postcopy - some cleanups - Add a qtest for RDMA at last Changs since V1[0]: Add some saparate patches to refactor and cleanup based on V1 [0] https://lore.kernel.org/qemu-devel/20250218074345.638203-1-lizhij...@fujitsu.com/ Li Zhijian (8

[PATCH v4 6/6] migration: Add qtest for migration over RDMA

2025-02-25 Thread Li Zhijian via
This qtest requires there is a RDMA(RoCE) link in the host. In order to make the test work smoothly, introduce a scripts/rdma-migration-helper.sh to - setup a new Soft-RoCE(aka RXE) if it's root - detect existing RoCE link Test will be skipped if there is no available RoCE link. # Start of rdma t

[PATCH v4 2/6] migration: check RDMA and capabilities are compatible on both sides

2025-02-25 Thread Li Zhijian via
Depending on the order of starting RDMA and setting capability, the following scenarios can be categorized into the following scenarios: Source: S1: [set capabilities] -> [Start RDMA outgoing] Destination: D1: [set capabilities] -> [Start RDMA incoming] D2: [Start RDMA incoming] -> [set capabili

[PATCH v4 0/6] migration/rdma: fixes, refactor and cleanup

2025-02-25 Thread Li Zhijian via
- It fix the RDMA migration broken issue - disable RDMA + postcopy - some cleanups - Add a qtest for RDMA at last Changes since V3: - check RDMA and capabilities are compatible on both sides # renamed from previous V3's "migration: Add migration_capabilities_and_transport_compatible()" Changes

[PATCH v4 4/6] migration/rdma: Remove redundant migration_in_postcopy checks

2025-02-25 Thread Li Zhijian via
Since we have disabled RDMA + postcopy, it's safe to remove the migration_in_postcopy() that follows the migrate_rdma(). Signed-off-by: Li Zhijian --- V3: reorder: 7th->4th --- migration/rdma.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/migration/rdma.c b/migrati

[PATCH v4 5/6] migration: Unfold control_save_page()

2025-02-25 Thread Li Zhijian via
control_save_page() is for RDMA only, unfold it to make the code more clear. In addition: - Similar to other branches style in ram_save_target_page(), involve RDMA only if the condition 'migrate_rdma()' is true. - Further simplify the code by removing the RAM_SAVE_CONTROL_NOT_SUPP. Signed-off

[PATCH v4 1/6] migration: Prioritize RDMA in ram_save_target_page()

2025-02-25 Thread Li Zhijian via
Address an error in RDMA-based migration by ensuring RDMA is prioritized when saving pages in `ram_save_target_page()`. Previously, the RDMA protocol's page-saving step was placed after other protocols due to a refactoring in commit bc38dc2f5f3. This led to migration failures characterized by unkn

[PATCH v4 3/6] migration: disable RDMA + postcopy-ram

2025-02-25 Thread Li Zhijian via
It's believed that RDMA + postcopy-ram has been broken for a while. Rather than spending time re-enabling it, let's simply disable it as a trade-off. Reviewed-by: Peter Xu Signed-off-by: Li Zhijian --- V3: - collect Reviewed tag - reoder: 6th -> 3th --- migration/options.c | 4 1 file

[PATCH v3 2/6] migration: Add migration_capabilities_and_transport_compatible() helper

2025-02-25 Thread Li Zhijian via
Similar to migration_channels_and_transport_compatible(), introduce a new helper migration_capabilities_and_transport_compatible() to check if the capabilites is compatible with the transport. Currently, only move the capabilities vs RDMA transport to this function. Reviewed-by: Peter Xu Signed-

[PATCH v3 6/6] migration: Add qtest for migration over RDMA

2025-02-25 Thread Li Zhijian via
This qtest requires there is a RDMA(RoCE) link in the host. In order to make the test work smoothly, introduce a scripts/rdma-migration-helper.sh to - setup a new Soft-RoCE(aka RXE) if it's root - detect existing RoCE link Test will be skipped if there is no available RoCE link. # Start of rdma t

[PATCH v3 5/6] migration: Unfold control_save_page()

2025-02-25 Thread Li Zhijian via
control_save_page() is for RDMA only, unfold it to make the code more clear. In addition: - Similar to other branches style in ram_save_target_page(), involve RDMA only if the condition 'migrate_rdma()' is true. - Further simplify the code by removing the RAM_SAVE_CONTROL_NOT_SUPP. Signed-off

[PATCH v3 1/6] migration: Prioritize RDMA in ram_save_target_page()

2025-02-25 Thread Li Zhijian via
Address an error in RDMA-based migration by ensuring RDMA is prioritized when saving pages in `ram_save_target_page()`. Previously, the RDMA protocol's page-saving step was placed after other protocols due to a refactoring in commit bc38dc2f5f3. This led to migration failures characterized by unkn

[PATCH v3 3/6] migration: disable RDMA + postcopy-ram

2025-02-25 Thread Li Zhijian via
It's believed that RDMA + postcopy-ram has been broken for a while. Rather than spending time re-enabling it, let's simply disable it as a trade-off. Reviewed-by: Peter Xu Signed-off-by: Li Zhijian --- V3: - collect Reviewed tag - reoder: 6th -> 3th --- migration/migration.c | 4 1 fil

[PATCH v3 0/6] migration/rdma: fixes, refactor and cleanup

2025-02-25 Thread Li Zhijian via
- It fix the RDMA migration broken issue - disable RDMA + postcopy - some cleanups - Add a qtest for RDMA at last Chnages since V2: - squash previous 2/3/4 to '[PATCH v3 5/6] migration: Unfold control_save_page()' - reorder the patch layout to prevent recently added code from being deleted agai

[PATCH v3 4/6] migration/rdma: Remove redundant migration_in_postcopy checks

2025-02-25 Thread Li Zhijian via
Since we have disabled RDMA + postcopy, it's safe to remove the migration_in_postcopy() that follows the migrate_rdma(). Signed-off-by: Li Zhijian --- V3: reorder: 7th->4th --- migration/rdma.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/migration/rdma.c b/migrati

[PATCH 2/2] [NOT-FOR-MERGE] Add qtest for migration over RDMA

2025-02-17 Thread Li Zhijian via
This qtest requirs there is RXE link in the host. Here is an example to show how to add this RXE link: $ ./new-rdma-link.sh 192.168.22.93 Signed-off-by: Li Zhijian --- The RDMA migration was broken again...due to lack of sufficient test/qtest. It's urgly to add and execute a script to establish

[PATCH 1/2] migration: Prioritize RDMA in ram_save_target_page()

2025-02-17 Thread Li Zhijian via
Address an error in RDMA-based migration by ensuring RDMA is prioritized when saving pages in `ram_save_target_page()`. Previously, the RDMA protocol's page-saving step was placed after other protocols due to a refactoring in commit bc38dc2f5f3. This led to migration failures characterized by unkn

[PATCH v5 2/6] migration: check RDMA and capabilities are compatible on both sides

2025-03-04 Thread Li Zhijian via
Depending on the order of starting RDMA and setting capability, they can be categorized into the following scenarios: Source: S1: [set capabilities] -> [Start RDMA outgoing] Destination: D1: [set capabilities] -> [Start RDMA incoming] D2: [Start RDMA incoming] -> [set capabilities] Previously,

[PATCH v5 3/6] migration: disable RDMA + postcopy-ram

2025-03-04 Thread Li Zhijian via
It's believed that RDMA + postcopy-ram has been broken for a while. Rather than spending time re-enabling it, let's simply disable it as a trade-off. Reviewed-by: Peter Xu Signed-off-by: Li Zhijian --- V3: - collect Reviewed tag - reoder: 6th -> 3th --- migration/options.c | 4 1 file

[PATCH v5 6/6] migration: Add qtest for migration over RDMA

2025-03-04 Thread Li Zhijian via
This qtest requires there is a RDMA(RoCE) link in the host. In order to make the test work smoothly, introduce a scripts/rdma-migration-helper.sh to - setup a new Soft-RoCE(aka RXE) if it's root - detect existing RoCE link Test will be skipped if there is no available RoCE link. # Start of rdma t

[PATCH v5 0/6] migration/rdma: fixes, refactor and cleanup

2025-03-04 Thread Li Zhijian via
- It fix the RDMA migration broken issue - disable RDMA + postcopy - some cleanups - Add a qtest for RDMA at last Changes since V4: - collect Reviewed-tags - Address comments in patch "migration: Add qtest for migration over RDMA" from Fabiano Rosas Changes since V3: - check RDMA and capabili

[PATCH v5 1/6] migration: Prioritize RDMA in ram_save_target_page()

2025-03-04 Thread Li Zhijian via
Address an error in RDMA-based migration by ensuring RDMA is prioritized when saving pages in `ram_save_target_page()`. Previously, the RDMA protocol's page-saving step was placed after other protocols due to a refactoring in commit bc38dc2f5f3. This led to migration failures characterized by unkn

[PATCH v5 4/6] migration/rdma: Remove redundant migration_in_postcopy checks

2025-03-04 Thread Li Zhijian via
Since we have disabled RDMA + postcopy, it's safe to remove the migration_in_postcopy() that follows the migrate_rdma(). Reviewed-by: Peter Xu Signed-off-by: Li Zhijian --- V3: reorder: 7th->4th --- migration/rdma.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mig

[PATCH v5 5/6] migration: Unfold control_save_page()

2025-03-04 Thread Li Zhijian via
control_save_page() is for RDMA only, unfold it to make the code more clear. In addition: - Similar to other branches style in ram_save_target_page(), involve RDMA only if the condition 'migrate_rdma()' is true. - Further simplify the code by removing the RAM_SAVE_CONTROL_NOT_SUPP. Reviewed-b

[PATCH v6] migration: Add qtest for migration over RDMA

2025-03-10 Thread Li Zhijian via
This qtest requires there is a RDMA(RoCE) link in the host. In order to make the test work smoothly, introduce a scripts/rdma-migration-helper.sh to detect existing RoCE link before running the test. Test will be skipped if there is no available RoCE link. # Start of rdma tests # Running /x86_64

[PATCH] hw/pci-bridge/pci_expander_bridge: Fix HDM passthrough condition

2025-03-23 Thread Li Zhijian via
Reverse the logical condition for HDM passthrough support in pci_expander_bridge. This patch ensures the HDM passthrough condition is evaluated only when hdm_for_passthrough is set to true, aligning behavior with intended semantics and comments. Signed-off-by: Li Zhijian --- This change corrects

[PATCH] qtest/migration/rdma: Add test for rdma migration with ipv6

2025-03-26 Thread Li Zhijian via
Recently, we removed ipv6 restriction[0] from RDMA migration, add a test for it. [0] https://lore.kernel.org/qemu-devel/20250326095224.9918-1-jinpu.w...@ionos.com/ Cc: Jack Wang Cc: Michael R. Galaxy Cc: Peter Xu Cc: Yu Zhang Signed-off-by: Li Zhijian --- This test is added based on [1] S