from:"Dexuan Cui"

RE: [PATCH v1] hyperv: reduce size of ms_hyperv_info

2023-09-18 Thread Dexuan Cui

> From: Olaf Hering 
> Sent: Monday, September 18, 2023 9:02 AM
> [...]
> Use the hole prior shared_gpa_boundary to store the result of get_vtl.
> This reduces the size by 8 bytes.
>  [...]
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -53,8 +53,8 @@ struct ms_hyperv_info {
>   u32 reserved_b2 : 20;
>   };
>   };
> - u64 shared_gpa_boundary;
>   u8 vtl;
> + u64 shared_gpa_boundary;
>  };
>  extern struct ms_hyperv_info ms_hyperv;
>  extern bool hv_nested;

How about moving the 'vtl' field to an even earlier place:

--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -36,6 +36,9 @@ struct ms_hyperv_info {
u32 nested_features;
u32 max_vp_index;
u32 max_lp_index;
+
+   u8 vtl;
+
union {
u32 isolation_config_a;
struct {
@@ -54,7 +57,6 @@ struct ms_hyperv_info {
};
};
u64 shared_gpa_boundary;
-   u8 vtl;
 };
 extern struct ms_hyperv_info ms_hyperv;
 extern bool hv_nested;

This also reduces the size by 8 bytes, and keeps
the isolation_config_a/ isolation_config_b/ shared_gpa_boundary
together, which are related.

RE: [PATCH v4 13/15] uapi: hyperv: Add mshv driver headers defining hypervisor ABIs

2023-10-04 Thread Dexuan Cui

> From: Greg KH 
> Sent: Tuesday, October 3, 2023 11:10 PM
> [...]
> On Tue, Oct 03, 2023 at 04:37:01PM -0700, Nuno Das Neves wrote:
> > On 9/30/2023 11:19 PM, Greg KH wrote:
> > > On Sat, Sep 30, 2023 at 10:01:58PM +, Wei Liu wrote:
> > > > On Sat, Sep 30, 2023 at 08:09:19AM +0200, Greg KH wrote:
> > > > > On Fri, Sep 29, 2023 at 11:01:39AM -0700, Nuno Das Neves wrote:
> > > > > > +/* Define connection identifier type. */
> > > > > > +union hv_connection_id {
> > > > > > +   __u32 asu32;
> > > > > > +   struct {
> > > > > > +   __u32 id:24;
> > > > > > +   __u32 reserved:8;
> > > > > > +   } __packed u;

IMO the "__packed" is unnecessary.

> > > > > bitfields will not work properly in uapi .h files, please never do 
> > > > > that.
> > > >
> > > > Can you clarify a bit more why it wouldn't work? Endianess? Alignment?
> > >
> > > Yes to both.
> > >
> > > Did you all read the documentation for how to write a kernel api?  If
> > > not, please do so.  I think it mentions bitfields, but it not, it really
> > > should as of course, this will not work properly with different endian
> > > systems or many compilers.
> >
> > Yes, in
> https://docs.k/
> ernel.org%2Fdriver-
> api%2Fioctl.html&data=05%7C01%7Cdecui%40microsoft.com%7Ce404769e0f
> 85493f0aa108dbc4a08a27%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
> 0%7C638319966071263290%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%
> 7C%7C&sdata=RiLNA5DRviWBQK6XXhxC4m77raSDBb%2F0BB6BDpFPUJY%3D
> &reserved=0 it says that it is
> > "better to avoid" bitfields.
> >
> > Unfortunately bitfields are used in the definition of the hypervisor
> > ABI. We import these definitions directly from the hypervisor code.
>
> So why do you feel you have to use this specific format for your
> user/kernel api?  That is not what is going to the hypervisor.

If it's hard to avoid bitfield here, maybe we can refer to the definition of
struct iphdr in include/uapi/linux/ip.h

Thanks,
Dexuan

[PATCH] x86/mm: Print the encryption features correctly when a paravisor is present

2023-10-18 Thread Dexuan Cui

Hyper-V provides two modes for running a TDX/SNP VM:

1) In TD Partitioning mode (TDX) or vTOM mode (SNP) with a paravisor;
2) In "fully enlightened" mode with the normal TDX shared bit or SNP C-bit
   control over page encryption, and no paravisor.

In the first mode (i.e. paravisor mode), the native TDX/SNP CPUID
capability is hidden from the VM, but cc_platform_has(CC_ATTR_MEM_ENCRYPT)
and cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) are true; as a result,
mem_encrypt_init() incorrectly prints the below message when an Intel TDX
VM with a paravisor runs on Hyper-V:
"Memory Encryption Features active: AMD SEV".

Introduce x86_platform.print_mem_enc_feature_info and allow hv_vtom_init()
to override the function pointer so that the correct message is printed.

BTW, when a VBS (Virtualization-based Security) VM running on Hyper-V
(the physical CPU can be an AMD CPU or an Intel CPU), the VM's memory is
not encrypted, but mem_encrypt_init() also prints the same incorrect
message. The introduction of x86_platform.print_mem_enc_feature_info can
also fix the issue.

Signed-off-by: Dexuan Cui 
---

Some open questions:

1. Should we refactor the existing print_memory_encrypt_feature_info()
into a TDX-specific function and an SEV-specific function?  The
function pointer in x86_platform_ops would be initialized to a no-op
function, and then early_tdx_init(), sme_enable() and hv_vtom_init()
would fill it in accordingly.

2. Should we rename "print_mem_enc_feature_info()" to
"print_coco_feature_info()"?  CC_ATTR_GUEST_STATE_ENCRYPT (and
CC_ATTR_GUEST_SEV_SNP?) may not look like *memory* encryption to me?

 arch/x86/hyperv/ivm.c  | 11 +++
 arch/x86/include/asm/mem_encrypt.h |  1 +
 arch/x86/include/asm/x86_init.h|  2 ++
 arch/x86/kernel/x86_init.c |  1 +
 arch/x86/mm/mem_encrypt.c  |  4 ++--
 5 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index e68051eba25a..fdc2fab0415e 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -450,6 +450,16 @@ static bool hv_is_private_mmio(u64 addr)
return false;
 }
 
+static void hv_print_mem_enc_feature_info(void)
+{
+   enum hv_isolation_type type = hv_get_isolation_type();
+
+   if (type == HV_ISOLATION_TYPE_SNP)
+   pr_info("Memory Encryption Features active: AMD SEV\n");
+   else if (type == HV_ISOLATION_TYPE_TDX)
+   pr_info("Memory Encryption Features active: Intel TDX\n");
+}
+
 void __init hv_vtom_init(void)
 {
enum hv_isolation_type type = hv_get_isolation_type();
@@ -479,6 +489,7 @@ void __init hv_vtom_init(void)
cc_set_mask(ms_hyperv.shared_gpa_boundary);
physical_mask &= ms_hyperv.shared_gpa_boundary - 1;
 
+   x86_platform.print_mem_enc_feature_info = hv_print_mem_enc_feature_info;
x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
x86_platform.guest.enc_cache_flush_required = 
hv_vtom_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index 7f97a8a97e24..6e8050a9138e 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -19,6 +19,7 @@
 
 #ifdef CONFIG_X86_MEM_ENCRYPT
 void __init mem_encrypt_init(void);
+void print_mem_encrypt_feature_info(void);
 #else
 static inline void mem_encrypt_init(void) { }
 #endif
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 14b0562c1d8b..7798174d4b8d 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -291,6 +291,7 @@ struct x86_hyper_runtime {
  * semantics.
  * @realmode_reserve:  reserve memory for realmode trampoline
  * @realmode_init: initialize realmode trampoline
+ * @print_mem_enc_feature_info:print the supported memory encryption 
features
  * @hyper: x86 hypervisor specific runtime callbacks
  */
 struct x86_platform_ops {
@@ -309,6 +310,7 @@ struct x86_platform_ops {
void (*set_legacy_features)(void);
void (*realmode_reserve)(void);
void (*realmode_init)(void);
+   void (*print_mem_enc_feature_info)(void);
struct x86_hyper_runtime hyper;
struct x86_guest guest;
 };
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 6117662ae4e6..ccb53db1b51e 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -149,6 +149,7 @@ struct x86_platform_ops x86_platform __ro_after_init = {
.restore_sched_clock_state  = tsc_restore_sched_clock_state,
.realmode_reserve   = reserve_real_mode,
.realmode_init  = init_real_mode,
+   .print_mem_enc_feature_info = print_mem_encrypt_feature_info,
.hyper.pin_v

RE: [PATCH] x86/mm: Print the encryption features correctly when a paravisor is present

2023-10-19 Thread Dexuan Cui

> From: Dave Hansen 
> Sent: Thursday, October 19, 2023 8:54 AM
> To: Dexuan Cui ; KY Srinivasan
> [...]
> > --- a/arch/x86/hyperv/ivm.c
> > +++ b/arch/x86/hyperv/ivm.c
> > @@ -450,6 +450,16 @@ static bool hv_is_private_mmio(u64 addr)
> > return false;
> >  }
> >
> > +static void hv_print_mem_enc_feature_info(void)
> > +{
> > +   enum hv_isolation_type type = hv_get_isolation_type();
> > +
> > +   if (type == HV_ISOLATION_TYPE_SNP)
> > +   pr_info("Memory Encryption Features active: AMD
> SEV\n");
> > +   else if (type == HV_ISOLATION_TYPE_TDX)
> > +   pr_info("Memory Encryption Features active: Intel
> > TDX\n");
> > +}
> 
> If we draw this to its logical conclusion, every paravisor will need a
> pr_info() for every hardware CoCo implementation.  That M*N pr_info()s.
> That seems nuts.

This patch only modifies x86 related files. I think it's unlikely to see
a third hardware Coco implementation for x86 in the foreseeable feature (?)
When we have a third implementation, I suppose more code, e.g., the existing
print_mem_encrypt_feature_info(),  will have to be changed as well.

Currently it looks like there is only 1 paravisor implementation. 
I think we'll know if some code can be shared only when a second paravisor
implementation appears.

I can use the below version if you think it's better:

static const char *hv_mem_enc_features[] = {
[ HV_ISOLATION_TYPE_SNP ] = "AMD SEV",
[ HV_ISOLATION_TYPE_TDX ] = "Intel TDX",
};

static void hv_print_mem_enc_feature_info(void)
{
enum hv_isolation_type type = hv_get_isolation_type();

if (type < HV_ISOLATION_TYPE_SNP || type > HV_ISOLATION_TYPE_TDX)
return;

pr_info("Memory Encryption Features active:: %s\n",
hv_mem_enc_features[type]);
}

Thanks,
Dexuan

RE: [PATCH] x86/mm: Print the encryption features correctly when a paravisor is present

2023-10-20 Thread Dexuan Cui

> From: Dave Hansen 
> Sent: Friday, October 20, 2023 11:40 AM
> To: Dexuan Cui ; KY Srinivasan
> [...]
> On 10/19/23 23:01, Dexuan Cui wrote:
> > This patch only modifies x86 related files. I think it's unlikely to see
> > a third hardware Coco implementation for x86 in the foreseeable feature
> (?)
> 
> OK, then what good is this patch in the first place?  If you are right,
> then this would give equivalent information:
> 
> cat /proc/cpuinfo | grep -q Intel && echo 'TDX'
> cat /proc/cpuinfo | grep -q AMD   && echo 'SEV'
> 
> No kernel patching needed, right?

Currently arch/x86/mm/mem_encrypt.c: print_mem_encrypt_feature_info()
prints an incorrect and confusing message 
"Memory Encryption Features active: AMD SEV".
when an Intel TDX VM with a paravisor runs on Hyper-V.

So I think a kernel patch is needed.

RE: [PATCH] x86/mm: Print the encryption features correctly when a paravisor is present

2023-10-20 Thread Dexuan Cui

> From: Dave Hansen 
> Sent: Friday, October 20, 2023 1:14 PM
> To: Dexuan Cui ; KY Srinivasan
> [...]
> On 10/20/23 13:00, Dexuan Cui wrote:
> >> OK, then what good is this patch in the first place?  If you are right,
> >> then this would give equivalent information:
> >>
> >> cat /proc/cpuinfo | grep -q Intel && echo 'TDX'
> >> cat /proc/cpuinfo | grep -q AMD   && echo 'SEV'
> >>
> >> No kernel patching needed, right?
> > Currently arch/x86/mm/mem_encrypt.c:
> print_mem_encrypt_feature_info()
> > prints an incorrect and confusing message
> > "Memory Encryption Features active: AMD SEV".
> > when an Intel TDX VM with a paravisor runs on Hyper-V.
> >
> > So I think a kernel patch is needed.
> 
> How about either removing the message entirely or removing the ": AMD
> SEV" part?

It looks good to me if we can update/remove the message in
print_mem_encrypt_feature_info(), but I guess AMD folks might want to
keep the ": AMD SEV" part?

RE: [PATCH net,v4, 1/3] hv_netvsc: fix race of netvsc and VF register_netdevice

2023-11-12 Thread Dexuan Cui

> From: LKML haiyangz  On Behalf Of Haiyang Zhang
> Sent: Friday, November 10, 2023 9:39 AM
> [...]
> 
> The rtnl lock also needs to be held before rndis_filter_device_add()
> which advertises nvsp_2_vsc_capability / sriov bit, and triggers
> VF NIC offering and registering. If VF NIC finished register_netdev()
> earlier it may cause name based config failure.
> 
> To fix this issue, move the call to rtnl_lock() before
> rndis_filter_device_add(), so VF will be registered later than netvsc
> / synthetic NIC, and gets a name numbered (ethX) after netvsc.
> 
> Cc: sta...@vger.kernel.org
> Fixes: e04e7a7bbd4b ("hv_netvsc: Fix a deadlock by getting rtnl lock earlier 
> in
> netvsc_probe()")
> Reported-by: Dexuan Cui 
> Signed-off-by: Haiyang Zhang 
> Reviewed-by: Wojciech Drewek 

Reviewed-by: Dexuan Cui

RE: [PATCH net,v4, 2/3] hv_netvsc: Fix race of register_netdevice_notifier and VF register

2023-11-12 Thread Dexuan Cui

> From: LKML haiyangz  On Behalf Of Haiyang Zhang
> Sent: Friday, November 10, 2023 9:39 AM
> [...]
> If VF NIC is registered earlier, NETDEV_REGISTER event is replayed,
> but NETDEV_POST_INIT is not.
> 
> Move register_netdevice_notifier() earlier, so the call back
> function is set before probing.
> 
> Cc: sta...@vger.kernel.org
> Fixes: e04e7a7bbd4b ("hv_netvsc: Fix a deadlock by getting rtnl lock earlier 
> in
> netvsc_probe()")
> Reported-by: Dexuan Cui 
> Signed-off-by: Haiyang Zhang 
> Reviewed-by: Wojciech Drewek 

Reviewed-by: Dexuan Cui 

It's better to post a new version that follows Simon Horman's suggestion,
i.e., use a more idiomatic form for the error path.

RE: [PATCH net,v5, 3/3] hv_netvsc: Mark VF as slave before exposing it to user-mode

2023-11-20 Thread Dexuan Cui

> From: LKML haiyangz  On Behalf Of Haiyang
> Zhang
> [...]
> From: Long Li 
> 
> When a VF is being exposed form the kernel, it should be marked as
> "slave"
> before exposing to the user-mode. The VF is not usable without netvsc
> running as master. The user-mode should never see a VF without the
> "slave"
> flag.
> 
> This commit moves the code of setting the slave flag to the time before
> VF is exposed to user-mode.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 0c195567a8f6 ("netvsc: transparent VF management")
> Signed-off-by: Long Li 
> Signed-off-by: Haiyang Zhang 
> ---
> v5:
> Change function name netvsc_prepare_slave() to
> netvsc_prepare_bonding().
> v4:
> Add comments in get_netvsc_byslot() explaining the need to check
> dev_addr
> v2:
> Use a new function to handle NETDEV_POST_INIT.

Acked-by: Dexuan Cui

RE: [PATCH net] hv_netvsc: Fix race condition between netvsc_probe and netvsc_remove

2024-01-30 Thread Dexuan Cui

> From: Souradeep Chakrabarti 
> Sent: Tuesday, January 30, 2024 2:16 AM
> [...]
> In commit ac5047671758 ("hv_netvsc: Disable NAPI before closing the
> VMBus channel"), napi_disable was getting called for all channels,
> including all subchannels without confirming if they are enabled or not.

s/enabled/created/

> Which caused hv_netvsc getting hung at napi_disable, when
> netvsc_probe()
> and netvsc_remove() are happening simultaneously and netvsc_remove()

Technically, they are not happening simultaneously: netvsc_probe() itself has
finished, but the work item scheduled by it has not started yet.

> calls cancel_work_sync(&nvdev->subchan_work) before netvsc_sc_open()
> calls napi_enable for the sub channels. Which causes NAPIF_STATE_SCHED

Technically, nvdev->subchan_work has not started to run yet, i.e.
netvsc_subchan_work() -> rndis_set_subchannel() has not created the
sub-channels yet, so netvsc_sc_open() can't run.

It would be great if you could briefly explain how the NAPIF_STATE_SCHED bit
is set and cleared, e.g. it's pre-set in rndis_filter_device_add() -> 
netif_napi_add()
so if the sub-channels are not created, netvsc_sc_open() -> napi_enable() won't
clear the flag and the flag remains set for ever for the sub-channels. 

> bit not getting cleared for the subchannels.
> 
> Now during netvsc_device_remove(), when napi_disable is called for those
> subchannels, napi_disable gets stuck on infinite msleep.

The patch body looks good to me. Please post v2 with an updated changelog.

Reviewed-by: Dexuan Cui

RE: [PATCH] hv_netvsc:Register VF in netvsc_probe if NET_DEVICE_REGISTER missed

2024-01-30 Thread Dexuan Cui

> From: Shradha Gupta 
> Sent: Monday, January 29, 2024 11:19 PM
>  [...]
> If hv_netvsc driver is removed and reloaded, the NET_DEVICE_REGISTER

s/removed/unloaded/
unloaded looks more accurate to me :-)

> [...]
> Tested-on: Ubuntu22
> Testcases: LISA testsuites
>  verify_reload_hyperv_modules, perf_tcp_ntttcp_sriov
IMO the 3 lines can be removed: this bug is not specific to Ubuntu, and the
test case names don't provide extra value to help understand the issue
here and they might cause more questions unnecessarily, e.g. what's LISA,
what exactly do the test cases do.

> +/* Macros to define the context of vf registration */
> +#define VF_REG_IN_PROBE  1
> +#define VF_REG_IN_RECV_CBACK 2

I think VF_REG_IN_NOTIFIER is a better name?
RECV_CBALL looks inaccurate to me.

> @@ -2205,8 +2209,11 @@ static int netvsc_vf_join(struct net_device
> *vf_netdev,
>  ndev->name, ret);
>   goto upper_link_failed;
>   }
> -
> - schedule_delayed_work(&ndev_ctx->vf_takeover,
> VF_TAKEOVER_INT);
> + /* If this registration is called from probe context vf_takeover
> +  * is taken care of later in probe itself.
I suspect "later in probe itself" is not accurate.
If 'context' is VF_REG_IN_PROBE, I suppose what happens here is:
after netvsc_probe() finishes, the netvsc interface becomes available,
so the user space will ifup it, and netvsc_open() will UP the VF interface,
and netvsc_netdev_event() is called for the VF with event ==
NETDEV_POST_INIT (?) and NETDEV_CHANGE, and the data path is
switched to the VF.

If my understanding is correct, I think in the case of 'context' ==
VF_REG_IN_PROBE, I suspect the "Align MTU of VF with master"
and the "sync address list from ndev to VF" in __netvsc_vf_setup() are
omitted? If so, should this be fixed? e.g. Not sure if the below is an issue or 
not:
1) a VF is bound to a NetVSC interface, and a user sets the MTUs to 1024.
2) rmmod hv_netvsc
3) modprobe hv_netvsc
4) the netvsc interface uses MTU=1500 (the default), and the VF still uses 1024.

> @@ -2597,6 +2604,34 @@ static int netvsc_probe(struct hv_device *dev,
>   }
> 
>   list_add(&net_device_ctx->list, &netvsc_dev_list);
> +
> + /* When the hv_netvsc driver is removed and readded, the

s/removed and readded/unloaded and reloaded/

> +  * NET_DEVICE_REGISTER for the vf device is replayed before
> probe
> +  * is complete. This is because register_netdevice_notifier() gets
> +  * registered before vmbus_driver_register() so that callback func
> +  * is set before probe and we don't miss events like
> NETDEV_POST_INIT
> +  * So, in this section we try to register each matching

Looks like there are spaces at the end of the line. We can move up a few words
from the next line :-)

s/each matching/the matching/
A NetVSC interface has only 1 matching VF device.

> +  * vf device that is present as a netdevice, knowing that it's register

s/it's/its/

> +  * call is not processed in the netvsc_netdev_notifier(as probing is
> +  * progress and get_netvsc_byslot fails).
> +  */
> + for_each_netdev(dev_net(net), vf_netdev) {
> + if (vf_netdev->netdev_ops == &device_ops)
> + continue;
> +
> + if (vf_netdev->type != ARPHRD_ETHER)
> + continue;
> +
> + if (is_vlan_dev(vf_netdev))
> + continue;
> +
> + if (netif_is_bond_master(vf_netdev))
> + continue;

The code above is duplicated from netvsc_netdev_event().
Can we add a new helper function is_matching_vf() to avoid the duplication?

> + netvsc_prepare_bonding(vf_netdev);
> + netvsc_register_vf(vf_netdev, VF_REG_IN_PROBE);
> + __netvsc_vf_setup(net, vf_netdev);

add a "break;' ?

> + }
>   rtnl_unlock();

RE: [PATCH] hv_netvsc:Register VF in netvsc_probe if NET_DEVICE_REGISTER missed

2024-01-30 Thread Dexuan Cui

> From: Haiyang Zhang 
> Sent: Tuesday, January 30, 2024 2:05 PM
> [...]
> > > @@ -2205,8 +2209,11 @@ static int netvsc_vf_join(struct net_device
> > > *vf_netdev,
> > >  ndev->name, ret);
> > >   goto upper_link_failed;
> > >   }
> > > -
> > > - schedule_delayed_work(&ndev_ctx->vf_takeover,
> > > VF_TAKEOVER_INT);
> > > + /* If this registration is called from probe context vf_takeover
> > > +  * is taken care of later in probe itself.
> > I suspect "later in probe itself" is not accurate.
> > If 'context' is VF_REG_IN_PROBE, I suppose what happens here is:
> > after netvsc_probe() finishes, the netvsc interface becomes available,
> > so the user space will ifup it, and netvsc_open() will UP the VF
> > interface,
> > and netvsc_netdev_event() is called for the VF with event ==
> > NETDEV_POST_INIT (?) and NETDEV_CHANGE, and the data path is
> > switched to the VF.
> 
> In register_netdevice(), NETDEV_POST_INIT is earlier than
> NETDEV_REGISTER.
> This case: netvsc_open >> dev_open(vf) >> NETDEV_UP >>
>  netvsc_vf_changed(event_dev, event);

I see. So there should be no issue here. Thanks for the clarification!

> > If my understanding is correct, I think in the case of 'context' ==
> > VF_REG_IN_PROBE, I suspect the "Align MTU of VF with master"
> > and the "sync address list from ndev to VF" in __netvsc_vf_setup() are
> > omitted? If so, should this be fixed? e.g. Not sure if the below is an
> > issue or not:
> > 1) a VF is bound to a NetVSC interface, and a user sets the MTUs to 1024.
> > 2) rmmod hv_netvsc
> > 3) modprobe hv_netvsc
> > 4) the netvsc interface uses MTU=1500 (the default), and the VF still
> > uses 1024.
> 
> __netvsc_vf_setup() is skipped from the netvsc_register_vf >>
> netvsc_vf_join(),
> but called from netvsc_probe(), so the VF mtu is sync-ed to 1500.
> I verified mtu sync in test.
You're correct. Sorry, I didn't notice that in the patch __netvsc_vf_setup()
now is also called from netvsc_probe().

RE: [PATCH] hv_netvsc:Register VF in netvsc_probe if NET_DEVICE_REGISTER missed

2024-01-31 Thread Dexuan Cui

> From: Haiyang Zhang 
> Sent: Wednesday, January 31, 2024 8:46 AM
>  [...]
> > From: Shradha Gupta 
> > Sent: Wednesday, January 31, 2024 2:54 AM
> > > [...]
> > > > +   netvsc_prepare_bonding(vf_netdev);
> > > > +   netvsc_register_vf(vf_netdev, VF_REG_IN_PROBE);
> > > > +   __netvsc_vf_setup(net, vf_netdev);
> > >
> > > add a "break;' ?
> > With MANA devices and multiport support there, the individual ports are
> > also net_devices.
> > Wouldn't this be needed for such scenario(where we have multiple mana
> > port net devices) to
> > register them all?
> 
> Each device has separate probe() call, so only one VF will match in one
> netvsc_probe().
> 
> netvsc_prepare_bonding() &  netvsc_register_vf() have
> get_netvsc_byslot(vf_netdev), but __netvsc_vf_setup() doesn't have. So,
> in case of multi-Vfs, this code will run "this" netvsc NIC with multiple VFs 
> by
> __netvsc_vf_setup() which isn't correct.
> 
> You need to add the following lines before
> netvsc_prepare_bonding(vf_netdev)
> in netvsc_probe() to skip non-matching VFs:
> 
> if (net != get_netvsc_byslot(vf_netdev))
>   continue;

Haiyang is correct.
I think it's still good to add a "break;", e.g. my understanding is something
like the below (this is untested):

+static struct net_device *get_matching_netvsc_dev(net_device *event_ndev)
+{
+   /* Skip NetVSC interfaces */
+   if (event_ndev->netdev_ops == &device_ops)
+   return NULL;
+
+   /* Avoid non-Ethernet type devices */
+   if (event_ndev->type != ARPHRD_ETHER)
+   return NULL;
+
+   /* Avoid Vlan dev with same MAC registering as VF */
+   if (is_vlan_dev(event_ndev))
+   return NULL;
+
+   /* Avoid Bonding master dev with same MAC registering as VF */
+   if (netif_is_bond_master(event_ndev))
+   return NULL;
+
+   return get_netvsc_byslot(event_ndev);
+}

+   for_each_netdev(dev_net(net), vf_netdev) {
+   if (get_matching_netvsc_dev(event_dev) != net)
+   continue;
+
+   netvsc_prepare_bonding(vf_netdev);
+   netvsc_register_vf(vf_netdev, VF_REG_IN_PROBE);
+   __netvsc_vf_setup(net, vf_netdev);
+
+   break;
+   }

We can also use get_matching_netvsc_dev() in netvsc_netdev_event().

BTW, please add a space between "hv_netvsc:" and "Register" in the Subject.

RE: [PATCH net,v2] hv_netvsc: Register VF in netvsc_probe if NET_DEVICE_REGISTER missed

2024-02-02 Thread Dexuan Cui

> From: Haiyang Zhang 
> Sent: Friday, February 2, 2024 9:10 AM
>  [...]
> Reviewed-by: Haiyang Zhang 

Reviewed-by: Dexuan Cui

RE: [PATCH 1/1] Drivers: hv: vmbus: Calculate ring buffer size for more efficient use of memory

2024-02-26 Thread Dexuan Cui

> From: mhkelle...@gmail.com 
> Sent: Monday, February 12, 2024 10:20 PM
>  [...]
> Fixes: c1135c7fd0e9 ("Drivers: hv: vmbus: Introduce types of GPADL")
> Signed-off-by: Michael Kelley 
> ---

Reviewed-by: Dexuan Cui 

The info below would need to be added:

Fixes: 6941f67ad37d ("hv_netvsc: Calculate correct ring size when PAGE_SIZE is 
not 4 Kbytes")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218502

RE: [PATCH net] net: mana: Fix Rx DMA datasize and skb_over_panic

2024-03-29 Thread Dexuan Cui

> From: LKML haiyangz  On Behalf Of Haiyang Zhang
> Sent: Friday, March 29, 2024 2:37 PM
> [...]
> mana_get_rxbuf_cfg() aligns the RX buffer's DMA datasize to be
> multiple of 64. So a packet slightly bigger than mtu+14, say 1536,
> can be received and cause skb_over_panic.
> 
> Sample dmesg:
> [ 5325.237162] skbuff: skb_over_panic: text:c043277a len:1536
> put:1536 head:ff1100018b517000 data:ff1100018b517100 tail:0x700
> end:0x6ea dev:
> [ 5325.243689] [ cut here ]
> [ 5325.245748] kernel BUG at net/core/skbuff.c:192!
> [ 5325.247838] invalid opcode:  [#1] PREEMPT SMP NOPTI
> [ 5325.258374] RIP: 0010:skb_panic+0x4f/0x60
> [ 5325.302941] Call Trace:
> [ 5325.304389]  
> [ 5325.315794]  ? skb_panic+0x4f/0x60
> [ 5325.317457]  ? asm_exc_invalid_op+0x1f/0x30
> [ 5325.319490]  ? skb_panic+0x4f/0x60
> [ 5325.321161]  skb_put+0x4e/0x50
> [ 5325.322670]  mana_poll+0x6fa/0xb50 [mana]
> [ 5325.324578]  __napi_poll+0x33/0x1e0
> [ 5325.326328]  net_rx_action+0x12e/0x280
> 
> As discussed internally, this alignment is not necessary. To fix
> this bug, remove it from the code. So oversized packets will be
> marked as CQE_RX_TRUNCATED by NIC, and dropped.
> 
> Cc: sta...@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> Adapter (MANA)")

While the unnecessary alignment indeed first appeared in ca9c54d2d6a5
(Apr 2021), ca9c54d2d6a5 didn't cause any crash because the only
supported MTU size was 1500, and the RX buffer was a page (which is
big enough to hold an incoming packet of a size up to 1536 bytes), and
mana_rx_skb() created a big skb of 1 page (which is big enough to
hold the incoming packet); the only issue with ca9c54d2d6a5 was that
an incoming packet of 1515~1536 bytes (if such a packet was ever
received by the NIC) was still properly delivered to the upper layer
network stack, while ideally such a packet should be dropped by the
NIC driver as we would have received CQE_RX_TRUNCATED if
ca9c54d2d6a5 hadn't done the unnecessary alignment.

So,  my understanding is that ca9c54d2d6a5 is imperfect
but it doesn't really cause any real issue.

It looks like the real issue started in the commit (Apr 12 14:16:02 2023)
2fbbd712baf1 ("net: mana: Enable RX path to handle various MTU sizes")
which introduces "rxq->alloc_size", which I think can be
smaller than rxq->datasize, and  is used in  mana_poll() --> ... ->
mana_build_skb(), which I think causes the crash because
the size of the allocated skb may not be able to hold a big
incoming packet.

Later, the commit (Apr 12 14:16:03 2023)
80f6215b450e ("net: mana: Add support for jumbo frame")
does code refactoring and introduces mana_get_rxbuf_cfg().

I suggest the Fixes tag should be updated to:
Fixes: 2fbbd712baf1 ("net: mana: Enable RX path to handle various MTU sizes")
, because if we used "Fixes: ca9c54d2d6a5", the distro vendors
would ask us: does this fix need to be backported to old kernels
that don't have 2fbbd712baf1? The fix can't apply cleanly there
and is not really needed there. The stable tree maintainers would
ask the same question.

>  drivers/net/ethernet/microsoft/mana/mana_en.c | 2 +-
>  include/net/mana/mana.h   | 1 -
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 59287c6e6cee..d8af5e7e15b4 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -601,7 +601,7 @@ static void mana_get_rxbuf_cfg(int mtu, u32
> *datasize, u32 *alloc_size,
> 
>   *alloc_size = mtu + MANA_RXBUF_PAD + *headroom;
> 
> - *datasize = ALIGN(mtu + ETH_HLEN, MANA_RX_DATA_ALIGN);
> + *datasize = mtu + ETH_HLEN;
>  }
> 

I suggest the Fixes tag should be updated. Otherwise the fix
looks good to me.

Reviewed-by: Dexuan Cui

RE: [PATCH net] net: mana: Fix Rx DMA datasize and skb_over_panic

2024-04-01 Thread Dexuan Cui

> From: Haiyang Zhang 
> Sent: Monday, April 1, 2024 4:21 PM
> > [...]
> > I suggest the Fixes tag should be updated. Otherwise the fix
> > looks good to me.
> 
> Thanks for the suggestion. I actually thought about this before
> submission.
> I was worried about someone back ports the jumbo frame feature,
> they may not automatically know this patch should be backported
> too. 

The jumbo frame commit (2fbbd712baf1) depends on the MTU
commit (2fbbd712baf1), so adding "Fixes: 2fbbd712baf1" (
instead of "Fixes: ca9c54d2d6a5") might make it easier for people
to notice and pick up this fix.

I'm OK if the patch remains as is. Just wanted to make  sure I
understand the issue here.

> Also, I suspect that a bigger than MTU packet may cause
> unexpected problem at NVA application.

Good point. + Wei Hu, who can check the FreeBSD MANA driver and
the DPDK MANA PMD. The code there may also need to be fixed.

> If anyone have questions on back porting, I can provide a back
> ported patch, which is just one line change.
> 
> - Haiyang

If the patch remains as is, gregkg will send us "failed to apply"
emails for v6.1.y and v5.15.y and we'll need to make a backport
for the 2 stable kernels.

With "Fixes: 2fbbd712baf1", we won't receive the "failed to apply"
emails.

Thanks,
Dexuan

RE: [PATCH v2] Add a header in ifcfg and nm keyfiles describing the owner of the files

2024-04-18 Thread Dexuan Cui

> From: Easwar Hariharan 
> Sent: Thursday, April 18, 2024 9:16 AM
> 
> On 4/18/2024 5:05 AM, Ani Sinha wrote:
> > A comment describing the source of writing the contents of the ifcfg and
> > network manager keyfiles (hyperv kvp daemon) is useful. It is valuable

s/hyperv/Hyper-V/

> > +#define CFG_HEADER "# Generated by hyperv key-value pair daemon.
> Please do not modify.\n"

s/hyperv/Hyper-V/

> Looks good to me, I'll defer to other folks on the recipient list on whether
> "hyperv" should be capitalized as HyperV or other such feedback.

It's recommended to use "Hyper-V". Wei can help fix this so I guess
there is no need to resend the patch :-)

[PATCH] PCI: Add a mutex to protect the global list pci_domain_busn_res_list

2024-04-18 Thread Dexuan Cui

There has been an effort to make the pci-hyperv driver support
async-probing to reduce the boot time. With async-probing, multiple
kernel threads can be running hv_pci_probe() -> create_root_hv_pci_bus() ->
pci_scan_root_bus_bridge() -> pci_bus_insert_busn_res() at the same time to
update the global list, causing list corruption.

Add a mutex to protect the list.

Signed-off-by: Dexuan Cui 
---
 drivers/pci/probe.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index e19b79821dd6..1327fd820b24 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -37,6 +37,7 @@ LIST_HEAD(pci_root_buses);
 EXPORT_SYMBOL(pci_root_buses);
 
 static LIST_HEAD(pci_domain_busn_res_list);
+static DEFINE_MUTEX(pci_domain_busn_res_list_lock);
 
 struct pci_domain_busn_res {
struct list_head list;
@@ -47,14 +48,22 @@ struct pci_domain_busn_res {
 static struct resource *get_pci_domain_busn_res(int domain_nr)
 {
struct pci_domain_busn_res *r;
+   struct resource *ret;
 
-   list_for_each_entry(r, &pci_domain_busn_res_list, list)
-   if (r->domain_nr == domain_nr)
-   return &r->res;
+   mutex_lock(&pci_domain_busn_res_list_lock);
+
+   list_for_each_entry(r, &pci_domain_busn_res_list, list) {
+   if (r->domain_nr == domain_nr) {
+   ret = &r->res;
+   goto out;
+   }
+   }
 
r = kzalloc(sizeof(*r), GFP_KERNEL);
-   if (!r)
-   return NULL;
+   if (!r) {
+   ret = NULL;
+   goto out;
+   }
 
r->domain_nr = domain_nr;
r->res.start = 0;
@@ -62,8 +71,10 @@ static struct resource *get_pci_domain_busn_res(int 
domain_nr)
r->res.flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED;
 
list_add_tail(&r->list, &pci_domain_busn_res_list);
-
-   return &r->res;
+   ret = &r->res;
+out:
+   mutex_unlock(&pci_domain_busn_res_list_lock);
+   return ret;
 }
 
 /*
-- 
2.25.1

RE: [PATCH] net: mana: Fix the extra HZ in mana_hwc_send_request

2024-05-16 Thread Dexuan Cui

> From: Souradeep Chakrabarti 
> Sent: Thursday, May 16, 2024 9:06 AM
> ...
> Commit 62c1bff593b7 added an extra HZ along with msecs_to_jiffies.
> This patch fixes that.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 62c1bff593b7 ("net: mana: Configure hwc timeout from hardware")
> Signed-off-by: Souradeep Chakrabarti 
> ---
Looks good to me.
Reviewed-by: Dexuan Cui

[PATCH] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

2024-05-20 Thread Dexuan Cui

When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.

Co-developed-by: Kirill A. Shutemov 
Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Michael Kelley 
Reviewed-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Rick Edgecombe 
Reviewed-by: Dave Hansen 
Acked-by: Kai Huang 
Cc: sta...@vger.kernel.org # 6.6+
Signed-off-by: Dexuan Cui 
---

This is basically a repost of the second patch of the 2023 patchset:
https://lwn.net/ml/linux-kernel/20230811214826.9609-3-de...@microsoft.com/

The first patch of the patchset got merged into mainline, but unluckily the
second patch didn't, and I kind of lost track of it. Sorry.

Changes since the previous patchset (please refer to the link above):
  Added Rick's and Dave's Reviewed-by.
  Added Kai's Acked-by.
  Removeda the test "if (offset_in_page(start) != 0)" since we know the
  'start' is page-aligned: see __set_memory_enc_pgtable().

Please review. Thanks!
Dexuan

 arch/x86/coco/tdx/tdx.c | 35 ---
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index c1cb90369915b..abf3cd591afd3 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -778,6 +779,19 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t 
end, bool enc)
return false;
 }
 
+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+   bool enc)
+{
+   if (!tdx_map_gpa(start, end, enc))
+   return false;
+
+   /* shared->private conversion requires memory to be accepted before use 
*/
+   if (enc)
+   return tdx_accept_memory(start, end);
+
+   return true;
+}
+
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest.  The VMM is expected to change its mapping
@@ -785,15 +799,22 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t 
end, bool enc)
  */
 static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
-   phys_addr_t start = __pa(vaddr);
-   phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+   unsigned long start = vaddr;
+   unsigned long end = start + numpages * PAGE_SIZE;
+   unsigned long step = end - start;
+   unsigned long addr;
 
-   if (!tdx_map_gpa(start, end, enc))
-   return false;
+   /* Step through page-by-page for vmalloc() mappings */
+   if (is_vmalloc_addr((void *)vaddr))
+   step = PAGE_SIZE;
 
-   /* shared->private conversion requires memory to be accepted before use 
*/
-   if (enc)
-   return tdx_accept_memory(start, end);
+   for (addr = start; addr < end; addr += step) {
+   phys_addr_t start_pa = slow_virt_to_phys((void *)addr);
+   phys_addr_t end_pa   = start_pa + step;
+
+   if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
+   return false;
+   }
 
return true;
 }
-- 
2.25.1

[RFC PATCH] clocksource: hyper-v: Enable the tsc_page for a TDX VM in TD mode

2024-05-22 Thread Dexuan Cui

A TDX VM on Hyper-V may run in TD mode or Partitioned TD mode (L2). For
the former, the VM has not enabled the Hyper-V TSC page (which is defined
in drivers/clocksource/hyperv_timer.c: "... tsc_pg __bss_decrypted ...")
because, for such a VM, the hypervisor requires that the page should be
shared, but currently the __bss_decrypted is not working for such a VM
yet.

Hyper-V TSC page can work as a clocksource device similar to KVM pv
clock, and it's also used by the Hyper-V timer code to get the current
time: see hv_init_tsc_clocksource(), which sets the global function
pointer hv_read_reference_counter to read_hv_clock_tsc(); when
Hyper-V TSC page is not enabled, hv_read_reference_counter defaults to
be drivers/hv/hv_common.c: __hv_read_ref_counter(), which is suboptimal
as it uses the slow MSR interface to get the time info.

The attribute __bss_decrypted was added for a native SNP VM by the
commit 45f46b1ac95e ("clocksource: hyper-v: Mark hyperv tsc page unencrypted in 
sev-snp enlightened guest")

The attribute works for SNP because of the commit below
commit b3f0907c71e0 ("x86/mm: Add .bss..decrypted section to hold shared 
variables")

The attribute is not working for TDX because __startup_64() ->
sme_postprocess_startup() is not for TDX; we can't just call
set_memory_decrypted() in sme_postprocess_startup() because
sme_postprocess_startup() runs too early and set_memory_decrypted() is not
working there yet.

This RFC patch calls set_memory_decrypted() in a later place, i.e., in
start_kernel() -> setup_arch() -> init_hypervisor_platform() ->
ms_hyperv_init_platform(), so set_memory_decrypted() works there;
accordingly, mem_encrypt_free_decrypted_mem() -> set_memory_encrypted()
must be called for TDX now.

When a TDX VM runs in Partitioned TD mode (L2), the Hyper-V TSC page should
be a private page, so set_memory_decrypted() should not be called for the
page in such a VM. Introduce a global variable "tdx_partitioned_td_l2" to
handle this type of VM differently.

BTW, the 4KB Hyper-V TSC page is enabled very early in
hv_init_tsc_clocksource(), where set_memory_decrypted() is not working yet,
so we can't simply call set_memory_decrypted() in hv_init_tsc_clocksource()
for a TDX VM in TD mode, and we need to get the attribute __bss_decrypted
to work for such a VM.

The changes to arch/x86/kernel/cpu/mshyperv.c and
arch/x86/mm/mem_encrypt_amd.c are not satisfying to me. I wish there
could be a better way to support __bss_decrypted for a native TDX VM
so that a TDX VM on KVM could also benefit from __bss_decrypted, if
some one wants to use it in future. BTW, kvm_init_platform() has
similar code for SNP.

This is just a RFC patch. I apprecite your insight. Thanks!

Signed-off-by: Dexuan Cui 
---
 arch/x86/coco/core.c   | 15 +++
 arch/x86/coco/tdx/tdx.c|  2 ++
 arch/x86/hyperv/ivm.c  |  3 ++-
 arch/x86/include/asm/tdx.h |  1 +
 arch/x86/kernel/cpu/mshyperv.c |  8 ++--
 arch/x86/mm/mem_encrypt_amd.c  |  3 ++-
 6 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index b31ef2424d194..61cec405f1084 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
@@ -25,8 +26,22 @@ static struct cc_attr_flags {
  __resv: 63;
 } cc_flags;
 
+static bool noinstr intel_cc_platform_td_l2(enum cc_attr attr)
+{
+   switch (attr) {
+   case CC_ATTR_GUEST_MEM_ENCRYPT:
+   case CC_ATTR_MEM_ENCRYPT:
+   return true;
+   default:
+   return false;
+   }
+}
+
 static bool noinstr intel_cc_platform_has(enum cc_attr attr)
 {
+   if (tdx_partitioned_td_l2)
+   return intel_cc_platform_td_l2(attr);
+
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
case CC_ATTR_HOTPLUG_DISABLED:
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index abf3cd591afd3..8e6ab42add7c0 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -39,6 +39,8 @@
 
 #define TDREPORT_SUBTYPE_0 0
 
+bool tdx_partitioned_td_l2 __ro_after_init;
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __noreturn __tdx_hypercall_failed(void)
 {
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 768d73de0d098..52cd44e507846 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -626,7 +626,7 @@ static bool hv_is_private_mmio(u64 addr)
return false;
 }
 
-void __init hv_vtom_init(void)
+void __init hv_vtom_init(void) /* TODO: rename the function for TDX */
 {
enum hv_isolation_type type = hv_get_isolation_type();
 
@@ -650,6 +650,7 @@ void __init hv_vtom_init(void)
 
case HV_ISOLATION_TYPE_TDX:
cc_vendor = CC_VENDOR_INTEL;
+   tdx_partitioned_td_

RE: [RFC PATCH] clocksource: hyper-v: Enable the tsc_page for a TDX VM in TD mode

2024-05-24 Thread Dexuan Cui

> From: Kirill A. Shutemov 
> Sent: Thursday, May 23, 2024 5:06 AM
> To: Dexuan Cui 
> [...]
> On Wed, May 22, 2024 at 07:24:41PM -0700, Dexuan Cui wrote:
> > A TDX VM on Hyper-V may run in TD mode or Partitioned TD mode (L2). For
> > the former, the VM has not enabled the Hyper-V TSC page (which is defined
> > in drivers/clocksource/hyperv_timer.c: "... tsc_pg __bss_decrypted ...")
> > because, for such a VM, the hypervisor requires that the page should be
> > shared, but currently the __bss_decrypted is not working for such a VM
> > yet.
> 
> I don't see how it is safe. It opens guest clock for manipulations form
> VMM. Could you elaborate on security implications?

The intention of the patch is not to enable Hyper-V TSC page as a clocksource
in such a VM. The default clocksource is still TSC.

The intention of the patch is to enable Hyper-V TSC page only for Hyper-V
timer. My understanding is that: "Hyper-V timer + Hyper-V TSC page" should
be as safe as "local APIC timer + TSC" because the VM needs the hypervisor
to emulate the timers, anyway. The guest may get a bad value of Hyper-V
TSC page from a malicious hypervisor, and consequently the Hyper-V timer
may fire too early or too late, but the similar thing can also happen to a local
APIC timer, if a malicious hypervisor decides to deliver the timer interrupt
too early or too late.

> > Hyper-V TSC page can work as a clocksource device similar to KVM pv
> > clock, and it's also used by the Hyper-V timer code to get the current
> > time: see hv_init_tsc_clocksource(), which sets the global function
> > pointer hv_read_reference_counter to read_hv_clock_tsc(); when
> > Hyper-V TSC page is not enabled, hv_read_reference_counter defaults to
> > be drivers/hv/hv_common.c: __hv_read_ref_counter(), which is suboptimal
> > as it uses the slow MSR interface to get the time info.
> 
> Why can't the guest just read the TSC directly? Why do we need the page?
> I am confused.
> 
>   Kiryl Shutsemau / Kirill A. Shutemov

Both Hyper-V TSC page and Hyper-V Reference Counter MSR return the
absolute time in the unit of 0.1us since the VM starts to run, i.e. the 
"frequency"
is 10M, which has a higher resolution than the emulated local APIC timer. 

Hyper-V timer depends on Hyper-V TSC page or Hyper-V Reference Counter MSR
as it also uses the absolute time in the unit of 0.1us. We could potentially 
read the
TSC and convert it to the absolute time in the unit of 0.1us, but the required 
math
calculation is not very easy and can be unreliable (e.g. I suppose an AP's TSC 
is 0
when it's just being brought up online? But the Hyper-V TSC page value on the AP
should be much bigger than 0) And, there will be an inaccuracy with the Hyper-V
side conversion that may use a slightly different math calculation (e.g. 
slightly
different TSC frequency or 'multi' or 'shift' values).

It turns out the local APIC timer in such a VM also works! So probably the VM
can just use local APIC timer and doesn't use Hyper-V timer. However, I noticed 
a
strange thing: in my 128-VP VM, each local APIC timer constantly generates
100 interrupts per second, even if the VM is idle. Trying to find out why. I do
enable tickles in my .config:

CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
CONFIG_HZ_100=y
CONFIG_HZ=100

Thanks,
-- Dexuan

RE: [RFC PATCH] clocksource: hyper-v: Enable the tsc_page for a TDX VM in TD mode

2024-05-24 Thread Dexuan Cui

> From: Dave Hansen 
> Sent: Thursday, May 23, 2024 7:26 AM
> [...]
> On 5/22/24 19:24, Dexuan Cui wrote:
> ...
> > +static bool noinstr intel_cc_platform_td_l2(enum cc_attr attr)
> > +{
> > +   switch (attr) {
> > +   case CC_ATTR_GUEST_MEM_ENCRYPT:
> > +   case CC_ATTR_MEM_ENCRYPT:
> > +   return true;
> > +   default:
> > +   return false;
> > +   }
> > +}
> > +
> >  static bool noinstr intel_cc_platform_has(enum cc_attr attr)
> >  {
> > +   if (tdx_partitioned_td_l2)
> > +   return intel_cc_platform_td_l2(attr);
> > +
> > switch (attr) {
> > case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > case CC_ATTR_HOTPLUG_DISABLED:
> 
> On its face, this _looks_ rather troubling.  It just hijacks all of the
> attributes.  It totally bifurcates the code.  Anything that gets added
> to intel_cc_platform_has() now needs to be considered for addition to
> intel_cc_platform_td_l2().

Maybe the bifurcation is necessary? TD mode is different from
Partitioned TD mode (L2), after all. Another reason for the bifurcation
is:  currently online/offline'ing is disallowed for a TD VM, but actually
Hyper-V is able to support CPU online/offline'ing for a TD VM in
Partitioned TD mode (L2) -- how can we allow online/offline'ing for such
a VM?

BTW, the bifurcation code is copied from amd_cc_platform_has(), where
an AMD SNP VM may run in the vTOM mode.

> > --- a/arch/x86/mm/mem_encrypt_amd.c
> > +++ b/arch/x86/mm/mem_encrypt_amd.c
> ...
> > @@ -529,7 +530,7 @@ void __init
> mem_encrypt_free_decrypted_mem(void)
> >  * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a
> Hyper-V VM
> >  * using vTOM, where sme_me_mask is always zero.
> >  */
> > -   if (sme_me_mask) {
> > +   if (sme_me_mask || (cc_vendor == CC_VENDOR_INTEL
> && !tdx_partitioned_td_l2)) {
> > r = set_memory_encrypted(vaddr, npages);
> > if (r) {
> > pr_warn("failed to free unused decrypted
> pages\n");
> 
> If _ever_ there were a place for a new CC_ attribute, this would be it.
Not sure how to add a new CC attribute for the __bss_decrypted support.

For the cpu online/offline'ing support, I'm not sure how to add a new
CC attribute and not introduce the bifurcation.

> It's also a bit concerning that now we've got a (cc_vendor ==
> CC_VENDOR_INTEL) check in an amd.c file.
I agree my change here is ugly...
Currently the __bss_decrypted support is only used for SNP.
Not sure if we should get it to work for TDX as well.

> So all of that on top of Kirill's "why do we need this in the first
> place" questions leave me really scratching my head on this one.
Probably I'll just use local APIC timer in such a VM or delay enabling
Hyper-V TSC page to a later place where set_memory_decrypted()
works for me. However, I still would like to find out how to allow
CPU online/offline'ing for a TDX VM in Partitioned TD mode (L2).

Thanks,
Dexuan

[PATCH] clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor

2024-06-18 Thread Dexuan Cui

In a TDX VM without paravisor, currently the default timer is the Hyper-V
timer, which depends on the slow VM Reference Counter MSR: the Hyper-V TSC
page is not enabled in such a VM because the VM uses Invariant TSC as a
better clocksource and it's challenging to mark the Hyper-V TSC page shared
in very early boot.

Lower the rating of the Hyper-V timer so the local APIC timer becomes the
the default timer in such a VM. This change should cause no perceivable
performance difference.

Cc: sta...@vger.kernel.org # 6.6+
Signed-off-by: Dexuan Cui 
---
 arch/x86/kernel/cpu/mshyperv.c |  6 +-
 drivers/clocksource/hyperv_timer.c | 16 +++-
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba840..745af47ca0459 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -449,9 +449,13 @@ static void __init ms_hyperv_init_platform(void)
ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
 
if (!ms_hyperv.paravisor_present) {
-   /* To be supported: more work is required.  */
+   /* Use Invariant TSC as a better clocksource. */
ms_hyperv.features &= 
~HV_MSR_REFERENCE_TSC_AVAILABLE;
 
+   /* Use the Ref Counter in case Invariant TSC is 
unavailable. */
+   if (!(ms_hyperv.features & 
HV_ACCESS_TSC_INVARIANT))
+   pr_warn("Hyper-V: Invariant TSC is 
unavailable\n");
+
/* HV_MSR_CRASH_CTL is unsupported. */
ms_hyperv.misc_features &= 
~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
 
diff --git a/drivers/clocksource/hyperv_timer.c 
b/drivers/clocksource/hyperv_timer.c
index b2a080647e413..99177835cadec 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -137,7 +137,21 @@ static int hv_stimer_init(unsigned int cpu)
ce->name = "Hyper-V clockevent";
ce->features = CLOCK_EVT_FEAT_ONESHOT;
ce->cpumask = cpumask_of(cpu);
-   ce->rating = 1000;
+
+   /*
+* Lower the rating of the Hyper-V timer in a TDX VM without paravisor,
+* so the local APIC timer (lapic_clockevent) is the default timer in
+* such a VM. The Hyper-V timer is not preferred in such a VM because
+* it depends on the slow VM Reference Counter MSR (the Hyper-V TSC
+* page is not enbled in such a VM because the VM uses Invariant TSC
+* as a better clocksource and it's challenging to mark the Hyper-V
+* TSC page shared in very early boot).
+*/
+   if (!ms_hyperv.paravisor_present && hv_isolation_type_tdx())
+   ce->rating = 90;
+   else
+   ce->rating = 1000;
+
ce->set_state_shutdown = hv_ce_shutdown;
ce->set_state_oneshot = hv_ce_set_oneshot;
ce->set_next_event = hv_ce_set_next_event;
-- 
2.25.1

RE: [PATCH] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

2024-06-18 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Monday, May 20, 2024 7:13 PM
> []
> When a TDX guest runs on Hyper-V, the hv_netvsc driver's
> netvsc_init_buf()
> allocates buffers using vzalloc(), and needs to share the buffers with the
> host OS by calling set_memory_decrypted(), which is not working for
> vmalloc() yet. Add the support by handling the pages one by one.
> 
> Co-developed-by: Kirill A. Shutemov 
> Signed-off-by: Kirill A. Shutemov 
> Reviewed-by: Michael Kelley 
> Reviewed-by: Kuppuswamy Sathyanarayanan
> 
> Reviewed-by: Rick Edgecombe 
> Reviewed-by: Dave Hansen 
> Acked-by: Kai Huang 
> Cc: sta...@vger.kernel.org # 6.6+
> Signed-off-by: Dexuan Cui 
> ---
> 
> This is basically a repost of the second patch of the 2023 patchset:
> https://lwn.net/ml/linux-kernel/20230811214826.9609-3-de...@microsoft.com/
> 
> The first patch of the patchset got merged into mainline, but unluckily the
> second patch didn't, and I kind of lost track of it. Sorry.
> 
> Changes since the previous patchset (please refer to the link above):
>   Added Rick's and Dave's Reviewed-by.
>   Added Kai's Acked-by.
>   Removeda the test "if (offset_in_page(start) != 0)" since we know the
>   'start' is page-aligned: see __set_memory_enc_pgtable().
> 
> Please review. Thanks!
> Dexuan

The patch still applies cleanly to 6.10-rc4.

A gentle ping.

RE: [PATCH] clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor

2024-06-20 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Thursday, June 20, 2024 2:19 PM
> To: Dexuan Cui ; KY Srinivasan
> [...]
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -449,9 +449,13 @@ static void __init ms_hyperv_init_platform(void)
> > ms_hyperv.hints &=
> ~HV_X64_APIC_ACCESS_RECOMMENDED;
> >
> > if (!ms_hyperv.paravisor_present) {
> > -   /* To be supported: more work is required.
> */
> > +   /* Use Invariant TSC as a better
> clocksource. */
> 
> I got confused by this comment, partly because I've forgotten the
> meaning of the ms_hyperv.feature flags. :-( Perhaps you could be
> more explicit in the comment and say "Mark the Hyper-V TSC page
> feature as disabled in a TDX VM so that the Invariant TSC, which is
> a better clocksource anyway, is used instead."
> 
> > ms_hyperv.features &=
> ~HV_MSR_REFERENCE_TSC_AVAILABLE;
> >
> > +   /* Use the Ref Counter in case Invariant
> TSC is unavailable. */
> > +   if (!(ms_hyperv.features &
> HV_ACCESS_TSC_INVARIANT))
> > +   pr_warn("Hyper-V: Invariant TSC is
> unavailable\n");
> 
> The above comment was even more confusing, because the code block
> doesn't do anything except print a message. The code doesn't force
> the use of the Ref Counter. I'd suggest something like: "The Invariant
> TSC is expected to be available, but if not, print a warning message.
> The slower Hyper-V MSR-based Ref Counter should end up being
> the clocksource."
> 
> Michael

Thanks for the good "comments"! :-)

I'm going to post v2 with the change below.

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba840..954b7cbfa2f02 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -449,9 +449,23 @@ static void __init ms_hyperv_init_platform(void)
ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;

if (!ms_hyperv.paravisor_present) {
-   /* To be supported: more work is required.  */
+   /*
+* Mark the Hyper-V TSC page feature as disabled
+* in a TDX VM without paravisor so that the
+* Invariant TSC, which is a better clocksource
+* anyway, is used instead.
+*/
ms_hyperv.features &= 
~HV_MSR_REFERENCE_TSC_AVAILABLE;

+   /*
+* The Invariant TSC is expected to be available
+* in a TDX VM without paravisor, but if not,
+* print a warning message. The slower Hyper-V 
MSR-based
+* Ref Counter should end up being the 
clocksource.
+*/
+   if (!(ms_hyperv.features & 
HV_ACCESS_TSC_INVARIANT))
+   pr_warn("Hyper-V: Invariant TSC is 
unavailable\n");
+
/* HV_MSR_CRASH_CTL is unsupported. */
ms_hyperv.misc_features &= 
~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;

RE: [PATCH] Drivers: hv: Remove deprecated hv_fcopy declarations

2024-06-20 Thread Dexuan Cui

> From: Rachel Menge 
> Sent: Thursday, June 20, 2024 3:51 PM
> [...]
LGTM
Reviewed-by: Dexuan Cui

[PATCH v2] clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor

2024-06-20 Thread Dexuan Cui

In a TDX VM without paravisor, currently the default timer is the Hyper-V
timer, which depends on the slow VM Reference Counter MSR: the Hyper-V TSC
page is not enabled in such a VM because the VM uses Invariant TSC as a
better clocksource and it's challenging to mark the Hyper-V TSC page shared
in very early boot.

Lower the rating of the Hyper-V timer so the local APIC timer becomes the
the default timer in such a VM, and print a warning in case Invariant TSC
is unavailable in such a VM. This change should cause no perceivable
performance difference.

Cc: sta...@vger.kernel.org # 6.6+
Reviewed-by: Roman Kisel 
Signed-off-by: Dexuan Cui 
---

Changes in v2:
Improved the comments in ms_hyperv_init_platform() [Michael Kelley]
Added "print a warning in case Invariant TSC  unavailable" in the changelog.
Added Roman's Reviewed-by.

 arch/x86/kernel/cpu/mshyperv.c | 16 +++-
 drivers/clocksource/hyperv_timer.c | 16 +++-
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba840..954b7cbfa2f02 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -449,9 +449,23 @@ static void __init ms_hyperv_init_platform(void)
ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
 
if (!ms_hyperv.paravisor_present) {
-   /* To be supported: more work is required.  */
+   /*
+* Mark the Hyper-V TSC page feature as disabled
+* in a TDX VM without paravisor so that the
+* Invariant TSC, which is a better clocksource
+* anyway, is used instead.
+*/
ms_hyperv.features &= 
~HV_MSR_REFERENCE_TSC_AVAILABLE;
 
+   /*
+* The Invariant TSC is expected to be available
+* in a TDX VM without paravisor, but if not,
+* print a warning message. The slower Hyper-V 
MSR-based
+* Ref Counter should end up being the 
clocksource.
+*/
+   if (!(ms_hyperv.features & 
HV_ACCESS_TSC_INVARIANT))
+   pr_warn("Hyper-V: Invariant TSC is 
unavailable\n");
+
/* HV_MSR_CRASH_CTL is unsupported. */
ms_hyperv.misc_features &= 
~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
 
diff --git a/drivers/clocksource/hyperv_timer.c 
b/drivers/clocksource/hyperv_timer.c
index b2a080647e413..99177835cadec 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -137,7 +137,21 @@ static int hv_stimer_init(unsigned int cpu)
ce->name = "Hyper-V clockevent";
ce->features = CLOCK_EVT_FEAT_ONESHOT;
ce->cpumask = cpumask_of(cpu);
-   ce->rating = 1000;
+
+   /*
+* Lower the rating of the Hyper-V timer in a TDX VM without paravisor,
+* so the local APIC timer (lapic_clockevent) is the default timer in
+* such a VM. The Hyper-V timer is not preferred in such a VM because
+* it depends on the slow VM Reference Counter MSR (the Hyper-V TSC
+* page is not enbled in such a VM because the VM uses Invariant TSC
+* as a better clocksource and it's challenging to mark the Hyper-V
+* TSC page shared in very early boot).
+*/
+   if (!ms_hyperv.paravisor_present && hv_isolation_type_tdx())
+   ce->rating = 90;
+   else
+   ce->rating = 1000;
+
ce->set_state_shutdown = hv_ce_shutdown;
ce->set_state_oneshot = hv_ce_set_oneshot;
ce->set_next_event = hv_ce_set_next_event;
-- 
2.25.1

RE: [PATCH] PCI: hv: fix reading of PCI_INTERRUPT_LINE and PCI_INTERRUPT_PIN

2024-06-21 Thread Dexuan Cui

From: Jake Oshins  
Sent: Friday, June 21, 2024 9:51 AM
> [...]
>On Fri, Jun 21, 2024 at 06:19:05AM +, Wei Liu wrote:
> On Fri, Jun 21, 2024 at 03:15:19AM +, Michael Kelley wrote:
> > From: Wei Liu  Sent: Thursday, June 20, 2024 
> > 6:48 PM
> > >
> > > The intent of the code snippet is to always return 0 for both fields.
> > > The check is wrong though. Fix that.
> > >
> > > This is discovered by this call in VFIO:
> > >
> > > pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
> > >
> > > The old code does not set *val to 0 because the second half of the check 
> > > is
> > > incorrect.

Hi Wei, so you got a non-zero 'pin' value returned by the host when the guest 
reads
from the MMIO config page. What's the consequence? Will VFIO try to use the 
legacy INTx 
rather than MSI/MSI-X? I'm curious how you noticed the bug. I'm also curious 
why the
host doesn't return 0 for the 'PIN' register when the guest reads it from the 
config page.

>  I believe that this fix is correct.  (And I'm frankly surprised that this 
> bug didn't
> cause a problem before this.  It's been there since I first wrote the code.)
> -- Jake Oshins

I suppose it didn't cause any issue because the PCI device drivers use 
MSI/MSI-X,
so they don't care about the values of the 'PIN' and 'LINE' registers.

Thanks,
Dexuan

RE: [PATCH] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

2024-06-28 Thread Dexuan Cui

> From: Kirill A. Shutemov 
> Sent: Friday, June 28, 2024 3:05 AM
> To: Dexuan Cui 
> >   [...]
> >  static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
>  [...]
> This patch collied with kexec changes. tdx_kexec_finish() calls
> tdx_enc_status_changed() after clearing pte, so slow_virt_to_phys()
> crashes on in.
> 
> Daxuan, could you check if the fixup below works for you on vmalloc
> addresses?
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index ef8ec2425998..5e455c883bcc 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -813,8 +813,15 @@ static bool tdx_enc_status_changed(unsigned
> long vaddr, int numpages, bool enc)
>   step = PAGE_SIZE;
> 
>   for (addr = start; addr < end; addr += step) {
> - phys_addr_t start_pa = slow_virt_to_phys((void *)addr);
> - phys_addr_t end_pa   = start_pa + step;
> + phys_addr_t start_pa;
> + phys_addr_t end_pa;
> +
> + if (virt_addr_valid(addr))
> + start_pa = __pa(addr);
> + else
> + start_pa = slow_virt_to_phys((void *)addr);
> +
> + end_pa = start_pa + step;
> 
>   if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
>   return false;
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Hi Kirill, your fixup works for me.

BTW, I just realized that virt_addr_valid() returns false for a vmalloc'd 
address.

Thanks,
Dexuan

RE: [PATCH] Drivers: hv: vmbus: Deferring per cpu tasks

2024-07-25 Thread Dexuan Cui

> From: Saurabh Singh Sengar 
> Sent: Thursday, July 25, 2024 8:35 AM
> Subject: Re: [PATCH] Drivers: hv: vmbus: Deferring per cpu tasks

Without the patch, I think the current CPU uses IPIs to let the other
CPUs, one by one,  run the function calls, and synchronously waits
for the function calls to finish.

IMO the patch is not "Deferring per cpu tasks". "Defer" means "let it
happen later". Here it schedules work items to different CPUs, and
the work items immediately start to run on these CPUs.

I would suggest a more accurate subject:
Drivers: hv: vmbus: Run hv_synic_init() concurrently

> - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> "hyperv/vmbus:online",
> - hv_synic_init, hv_synic_cleanup);
> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + struct work_struct *work = per_cpu_ptr(works, cpu);
> +
> + INIT_WORK(work, vmbus_percpu_work);
> + schedule_work_on(cpu, work);
> + }
> +
> + for_each_online_cpu(cpu)
> + flush_work(per_cpu_ptr(works, cpu));
> +

Can you please add a comment to explain we need this for CPU online/offline'ing:
> + ret = __cpuhp_setup_state_cpuslocked(CPUHP_AP_ONLINE_DYN,
> "hyperv/vmbus:online", false,
> +  hv_synic_init, hv_synic_cleanup,
> false);
> + cpus_read_unlock();

Add an empty line here to make it slightly more readable? :-)
> + free_percpu(works);
>   if (ret < 0)
>   goto err_alloc;

Thanks,
Dexuan

RE: [PATCH] Drivers: hv: vmbus: Deferring per cpu tasks

2024-07-25 Thread Dexuan Cui

> From: Saurabh Singh Sengar 
> Sent: Thursday, July 25, 2024 10:27 PM
> [...]
> > > + ret = __cpuhp_setup_state_cpuslocked(CPUHP_AP_ONLINE_DYN,
> > > "hyperv/vmbus:online", false,
> > > +  hv_synic_init, hv_synic_cleanup,
> > > false);
> > > + cpus_read_unlock();
> >
> > Add an empty line here to make it slightly more readable? :-)
> 
> My personal preference was to have empty line as well here, but then I
> looked the
> other places in this file where we used cpus_read_unlock, hence I
> maintained that style consistent.
> 
> Please let me know if you have strong opinion about this empty line, I can
> add in V2.
> 
> - Saurabh

I have no strong opinion here :-)

RE: [PATCH v2] Drivers: hv: vmbus: Optimize boot time by concurrent execution of hv_synic_init()

2024-07-29 Thread Dexuan Cui

> From: Saurabh Sengar 
> Sent: Monday, July 29, 2024 12:57 AM
>  [...]
> + /* register the callback for hotplug CPUs */
> + ret =
> cpuhp_setup_state_nocalls_cpuslocked(CPUHP_AP_ONLINE_DYN,
> "hyperv/vmbus:online",

AFAIK, Hyper-V doesn't support vCPU "hotplug" for VMs; it does
support vCPU online/offline'ing.

To be more accurate, I suggested the comment below instead.
/* Register the callbacks for possible CPU online/offline'ing */

Otherwise, LGTM
Reviewed-by: Dexuan Cui

RE: [PATCH net] net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings

2024-08-06 Thread Dexuan Cui

> From: lon...@linuxonhyperv.com 
> Sent: Monday, August 5, 2024 4:38 PM
> [...]
> After napi_complete_done() is called, another NAPI may be running on
> another CPU and ring the doorbell before the current CPU does. When

Can you please share more details about "another NAPI"? Is it about busy_poll?

> combined with unnecessary rings when there is no need to ARM the CQ, this
> triggers error paths in the hardware.
>
> Fix this by always ring the doorbell in sequence and avoid unnecessary
> rings.

I'm not sure what "error paths in the hardware" means. It's better to describe
the user-visible consequence.

Maybe this is clearer:

When there is no need to arm the CQ from NAPI's perspective, the driver must
not combine "too many" arming operations due to a MANA hardware requirement:
the driver must ring the doorbell at least once within every 8 wraparounds of 
the CQ,
otherwise "XXX" would happen. //Dexuan: I don't know what the "XXX" is

Add a per-CQ counter cq->work_done_since_doorbell, and make sure the CQ is
armed within 4 wraparounds of the CQ. //Dexuan: why not 8 or 7?

 
> + if (w < cq->budget) {
> + mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
> + cq->work_done_since_doorbell = 0;
> + napi_complete_done(&cq->napi, w);
> + } else if (cq->work_done_since_doorbell >
> +cq->gdma_cq->queue_size / COMP_ENTRY_SIZE * 4) {
> + /* MANA hardware requires at least one doorbell ring every 8
s/ring every 8/arming within every 8/ ?

> +  * wraparounds of CQ even there is no need to ARM. This
> driver

s/ARM/arming/ ?
s/even/even if/ ?

Thanks,
Dexuan

RE: [PATCH v2] tools/hv: Add memory allocation check in hv_fcopy_start

2024-08-30 Thread Dexuan Cui

> From: Zhu Jun 
> Sent: Wednesday, August 28, 2024 7:45 PM
> @@ -296,6 +296,18 @@ static int hv_fcopy_start(struct hv_start_fcopy
> *smsg_in)
> file_name = (char *)malloc(file_size * sizeof(char));
> path_name = (char *)malloc(path_size * sizeof(char));
> 
> +if (!file_name) {
> +free(file_name);
> +syslog(LOG_ERR, "Can't allocate file_name memory!");
> +exit(EXIT_FAILURE);
> +}
> +
> +if (!path_name) {
> +free(path_name);
> +syslog(LOG_ERR, "Can't allocate path_name memory!");
> +exit(EXIT_FAILURE);
> +}

If we're calling exit() just 2 lines later, it doesn't make a lot of sense
to call free().

How about this:

@@ -296,6 +296,12 @@ static int hv_fcopy_start(struct hv_start_fcopy *smsg_in)
file_name = (char *)malloc(file_size * sizeof(char));
path_name = (char *)malloc(path_size * sizeof(char));

+   if (!file_name || !path_name) {
+   free(file_name);
+   free(path_name);
+  syslog(LOG_ERR, "Can't allocate memory for file name and/or path 
name");
+   return HV_E_FAIL;
+   }

Note: free(NULL) is valid (refer to "man 3 free").

RE: [PATCH v2] tools/hv: Add memory allocation check in hv_fcopy_start

2024-09-06 Thread Dexuan Cui

> From: Saurabh Singh Sengar 
> Sent: Wednesday, September 4, 2024 10:21 PM
>  [...]
> hv_fcopy_send_data is the parent function which calls hv_fcopy_start.
> Possibly a good solution is to check the return value from
> hv_fcopy_send_data

The return value of hv_fcopy_send_data() is saved into icmsghdr->status, which
is sent to the host, so the PowerShell command on the host will report an error
immediately.

> in fcopy_pkt_process function. Otherwise I prefer exit over returning error.
> 
> And as you rightly pointed out if we use exit, there is no sense of using 
> free.
> 
> - Saurabh

exit() here is not ideal in that the host doesn't know what's going on inside
the VM, and I guess the host PowerShell command will time out after quite
a while. Without exit(), the daemon can continue to run to accept the next
requests from the host; with exit(), we rely on systemd's Restart=on-failure.
I prefer not to call exit() here.

Thanks,
Dexuan

RE: [PATCH v3] tools/hv: Add memory allocation check in hv_fcopy_start

2024-09-06 Thread Dexuan Cui

> From: Zhu Jun 
> Sent: Friday, September 6, 2024 2:14 AM

Reviewed-by: Dexuan Cui

[PATCH] Drivers: hv: kvp/vss: Avoid accessing a ringbuffer not initialized yet

2024-09-09 Thread Dexuan Cui

If the KVP (or VSS) daemon starts before the VMBus channel's ringbuffer is
fully initialized, we can hit the panic below:

hv_utils: Registering HyperV Utility Driver
hv_vmbus: registering driver hv_utils
...
BUG: kernel NULL pointer dereference, address: 
CPU: 44 UID: 0 PID: 2552 Comm: hv_kvp_daemon Tainted: G E 6.11.0-rc3+ #1
RIP: 0010:hv_pkt_iter_first+0x12/0xd0
Call Trace:
...
 vmbus_recvpacket
 hv_kvp_onchannelcallback
 vmbus_on_event
 tasklet_action_common
 tasklet_action
 handle_softirqs
 irq_exit_rcu
 sysvec_hyperv_stimer0
 
 
 asm_sysvec_hyperv_stimer0
...
 kvp_register_done
 hvt_op_read
 vfs_read
 ksys_read
 __x64_sys_read

This can happen because the KVP/VSS channel callback can be invoked
even before the channel is fully opened:
1) as soon as hv_kvp_init() -> hvutil_transport_init() creates
/dev/vmbus/hv_kvp, the kvp daemon can open the device file immediately and
register itself to the driver by writing a message KVP_OP_REGISTER1 to the
file (which is handled by kvp_on_msg() ->kvp_handle_handshake()) and
reading the file for the driver's response, which is handled by
hvt_op_read(), which calls hvt->on_read(), i.e. kvp_register_done().

2) the problem with kvp_register_done() is that it can cause the
channel callback to be called even before the channel is fully opened,
and when the channel callback is starting to run, util_probe()->
vmbus_open() may have not initialized the ringbuffer yet, so the
callback can hit the panic of NULL pointer dereference.

To reproduce the panic consistently, we can add a "ssleep(10)" for KVP in
__vmbus_open(), just before the first hv_ringbuffer_init(), and then we
unload and reload the driver hv_utils, and run the daemon manually within
the 10 seconds.

Fix the panic by checking the channel state in the channel callback.
To avoid the race condition with __vmbus_open(), we disable and enable
the channel callback temporarily in __vmbus_open().

The channel callbacks of the other VMBus devices don't need to check
the channel state since they can't run before the channels are fully
initialized.

Note: we would also need to fix the fcopy driver code, but that has
been removed in commit ec314f61e4fc ("Drivers: hv: Remove fcopy driver") in
the mainline kernel since v6.10. For old 6.x LTS kernels, and the 5.x
and 4.x LTS kernels, the fcopy driver needs to be fixed when the
fix is backported to the stable kernel branches.

Fixes: e0fa3e5e7df6 ("Drivers: hv: utils: fix a race on userspace daemons 
registration")
Cc: sta...@vger.kernel.org
Signed-off-by: Dexuan Cui 
---
 drivers/hv/channel.c | 11 +++
 drivers/hv/hv_kvp.c  |  3 +++
 drivers/hv/hv_snapshot.c |  3 +++
 3 files changed, 17 insertions(+)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index fb8cd8469328..685e407a3fdf 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -657,6 +657,14 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
return -ENOMEM;
}
 
+   /*
+* The channel callbacks of KVP/VSS may run before __vmbus_open()
+* finishes (see kvp_register_done() -> ... -> kvp_poll_wrapper()), so
+* they check newchannel->state to tell the ringbuffer has been fully
+* initialized or not. Disable and enable the tasklet to avoid the race.
+*/
+   tasklet_disable(&newchannel->callback_event);
+
newchannel->state = CHANNEL_OPENING_STATE;
newchannel->onchannel_callback = onchannelcallback;
newchannel->channel_callback_context = context;
@@ -750,6 +758,8 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
}
 
newchannel->state = CHANNEL_OPENED_STATE;
+   tasklet_enable(&newchannel->callback_event);
+
kfree(open_info);
return 0;
 
@@ -766,6 +776,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
hv_ringbuffer_cleanup(&newchannel->inbound);
vmbus_free_requestor(&newchannel->requestor);
newchannel->state = CHANNEL_OPEN_STATE;
+   tasklet_enable(&newchannel->callback_event);
return err;
 }
 
diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
index d35b60c06114..ec098067e579 100644
--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c
@@ -662,6 +662,9 @@ void hv_kvp_onchannelcallback(void *context)
if (kvp_transaction.state > HVUTIL_READY)
return;
 
+   if (channel->state != CHANNEL_OPENED_STATE)
+   return;
+
if (vmbus_recvpacket(channel, recv_buffer, HV_HYP_PAGE_SIZE * 4, 
&recvlen, &requestid)) {
pr_err_ratelimited("KVP request received. Could not read into 
recv buf\n");
return;
diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c
index 0d2184be1691..f7924c2fc62e 100644
--- a/drivers/hv/hv_snapshot.c
+++ b/drivers/hv/hv_sna

[PATCH] tools: hv: Fix a complier warning in the fcopy uio daemon

2024-09-09 Thread Dexuan Cui

hv_fcopy_uio_daemon.c:436:53: warning: '%s' directive output may be truncated
writing up to 14 bytes into a region of size 10 [-Wformat-truncation=]
  436 |  snprintf(uio_dev_path, sizeof(uio_dev_path), "/dev/%s", uio_name);

Also added 'static' for the array 'desc[]'.

Fixes: 82b0945ce2c2 ("tools: hv: Add new fcopy application based on uio driver")
Cc: sta...@vger.kernel.org # 6.10+
Signed-off-by: Dexuan Cui 
---
 tools/hv/hv_fcopy_uio_daemon.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/tools/hv/hv_fcopy_uio_daemon.c b/tools/hv/hv_fcopy_uio_daemon.c
index 3ce316cc9f97..f7741af08a79 100644
--- a/tools/hv/hv_fcopy_uio_daemon.c
+++ b/tools/hv/hv_fcopy_uio_daemon.c
@@ -35,8 +35,6 @@
 #define WIN8_SRV_MINOR 1
 #define WIN8_SRV_VERSION   (WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
 
-#define MAX_FOLDER_NAME15
-#define MAX_PATH_LEN   15
 #define FCOPY_UIO  
"/sys/bus/vmbus/devices/eb765408-105f-49b6-b4aa-c123b64d17d4/uio"
 
 #define FCOPY_VER_COUNT1
@@ -51,7 +49,7 @@ static const int fw_versions[] = {
 
 #define HV_RING_SIZE   0x4000 /* 16KB ring buffer size */
 
-unsigned char desc[HV_RING_SIZE];
+static unsigned char desc[HV_RING_SIZE];
 
 static int target_fd;
 static char target_fname[PATH_MAX];
@@ -402,8 +400,8 @@ int main(int argc, char *argv[])
struct vmbus_br txbr, rxbr;
void *ring;
uint32_t len = HV_RING_SIZE;
-   char uio_name[MAX_FOLDER_NAME] = {0};
-   char uio_dev_path[MAX_PATH_LEN] = {0};
+   char uio_name[NAME_MAX] = {0};
+   char uio_dev_path[PATH_MAX] = {0};
 
static struct option long_options[] = {
{"help",no_argument,   0,  'h' },
-- 
2.25.1

RE: [PATCH] Drivers: hv: kvp/vss: Avoid accessing a ringbuffer not initialized yet

2024-10-30 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Tuesday, October 29, 2024 4:45 PM
> [...]
> An alternate approach occurs to me. util_probe() does these three
> things in order:
> 
> 1) Allocates the receive buffer
> 2) Calls the util_init() function, which for KVP and VSS creates the char dev
> 3) Sets up the VMBus channel, including calling vmbus_open()
> 
> What if the order of #2 and #3 were swapped in util_probe()? I
> don't immediately see any interdependency between #2 and #3
> for KVP and VSS, nor for Shutdown and Timesync. With the swap,
> the VMBus channel would be fully open by the time the /dev entry
> appears and the user space daemon can do anything.
> 
> I haven't though too deeply about this, so maybe there's a problem
> somewhere. But if not, it seems a lot cleaner.
> 
> Michael

I think #3 depends on #2, e.g. hv_kvp_init() sets the channel's
preferred max_pkt_size, which is tested later in __vmbus_open().

Another example of dependency is: hv_timesync_init() initializes
host_ts.lock and adj_time_work, which are used by
timesync_onchannelcallback() -> adj_guesttime().
Note: the channel callback can be already running before
vmbus_open() returns.

Thanks,
Dexuan

RE: [PATCH] Drivers: hv: kvp/vss: Avoid accessing a ringbuffer not initialized yet

2024-11-01 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Thursday, October 31, 2024 6:39 PM
> > > From: Michael Kelley 
> > > Sent: Wednesday, October 30, 2024 5:12 PM
> > > [...]
> > > What do you think about this (compile tested only), which splits the
> > > "init" function into two parts for devices that have char devs? I'm
> > > trying to avoid adding yet another synchronization point by just
> > > doing the init operations in the right order -- i.e., don't create the
> > > user space /dev entry until the VMBus channel is ready.
> > >
> > > Michael
> >
> > Thanks, I think this works! This is a better fix.

Michael, will you post a formal patch or want me to do it?
Either works for me.

> > > + if (srv->util_init_transport) {
> > > + ret = srv->util_init_transport();
> > > + if (ret) {
> > > + ret = -ENODEV;
> > IMO we don't need the line above, since the 'ret' from
> > srv->util_init_transport()  is already a standard error code.
> 
> I was just now looking at call_driver_probe(), and it behaves
> slightly differently for ENODEV and ENXIO vs. other error
> codes. ENODEV and ENXIO don't output a message to the
> console unless debugging is enabled, while other error codes
> always output a message to the console. Forcing the error to
> ENODEV has been there since the util_probe() code came out
> of staging in year 2011. But I don't really have a preference
> either way.

util_probe() is called by vmbus_probe(), which uses pr_err() to print
the 'ret'. If the 'ret' is forced to ENODEV, the message in vmbus_probe()
may be a little misleading since the real error code is hidden,
especially when srv->util_init_transport() doesn't print any error
message.

vmbus_probe() is called by call_driver_probe. I guess originally
KY wanted to use ENODEV to avoid the extra message for the util
devices in call_driver_probe() in non-debugging mode, but the other
VSC drivers don't follow this usage.

util_probe() can return a non-ENODEV error code anyway, e.g.
ENOMEM and whatever error code from vmbus_open(). IMO,
srv->util_init and srv->util_init_transport should not be treated
specially.

IMO it's better to not add new code to force the 'ret' to
ENODEV, and we'd want to clean up the existing use of ENODEV
in util_probe().
 
Thanks,
Dexuan

RE: [PATCH] Drivers: hv: kvp/vss: Avoid accessing a ringbuffer not initialized yet

2024-11-01 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Friday, November 1, 2024 2:11 PM
> > Michael, will you post a formal patch or want me to do it?
> > Either works for me.
> 
> I can do it. You probably have more pressing issues to keep
> you busy  :-)
Thanks a lot! :-)

> Fair enough. I'll do it that way.
> 
> Michael

Thanks!

RE: [PATCH] Drivers: hv: kvp/vss: Avoid accessing a ringbuffer not initialized yet

2024-10-31 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Wednesday, October 30, 2024 5:12 PM
> [...]
> What do you think about this (compile tested only), which splits the
> "init" function into two parts for devices that have char devs? I'm
> trying to avoid adding yet another synchronization point by just
> doing the init operations in the right order -- i.e., don't create the
> user space /dev entry until the VMBus channel is ready.
> 
> Michael

Thanks, I think this works! This is a better fix.

> + if (srv->util_init_transport) {
> + ret = srv->util_init_transport();
> + if (ret) {
> + ret = -ENODEV;
IMO we don't need the line above, since the 'ret' from 
srv->util_init_transport()  is already a standard error code.

BTW, I noticed that the line "ret = -ENODEV;"
if (srv->util_init) {
ret = srv->util_init(srv);
if (ret) {
ret = -ENODEV;
goto error1;
}
}
I think we don't really need that line, either. 
The existing 4 .util_init callbacks also already return a
standard error code. We can make a separate patch to clean
that up.

Thanks,
Dexuan

RE: [PATCH] tools: hv: Fix a complier warning in the fcopy uio daemon

2024-09-20 Thread Dexuan Cui

> From: Saurabh Singh Sengar 
> Sent: Friday, September 13, 2024 12:31 AM
> To: Dexuan Cui 
> Cc: KY Srinivasan ; Haiyang Zhang
> ; Wei Liu ; Long Li
> ; Greg Kroah-Hartman
> ; open list:Hyper-V/Azure CORE AND DRIVERS
> ; open list ;
> sta...@vger.kernel.org
> Subject: Re: [PATCH] tools: hv: Fix a complier warning in the fcopy uio
> daemon
> 
> On Tue, Sep 10, 2024 at 12:44:32AM +, Dexuan Cui wrote:
> > hv_fcopy_uio_daemon.c:436:53: warning: '%s' directive output may be
> truncated
> > writing up to 14 bytes into a region of size 10 [-Wformat-truncation=]
> >   436 |  snprintf(uio_dev_path, sizeof(uio_dev_path), "/dev/%s",
> uio_name);
> 
> Makefile today doesn't have -Wformat-truncation flag enabled, I tried to add
> -Wformat-truncation=2 but I don't see any error in this file.
> 
> Do you mind sharing more details how you get this error ?
> 
> - Saurabh

This repros in a Ubuntu 20.04 VM:

root@decui-u2004-2024-0920:~/linux/tools/hv# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
...

root@decui-u2004-2024-0920:~/linux/tools/hv# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@decui-u2004-2024-0920:~/linux/tools/hv# make clean; make
...
make -f /root/linux/tools/build/Makefile.build dir=. obj=hv_fcopy_uio_daemon
make[1]: Entering directory '/root/linux/tools/hv'
  CC  hv_fcopy_uio_daemon.o
hv_fcopy_uio_daemon.c: In function 'main':
hv_fcopy_uio_daemon.c:443:53: warning: '%s' directive output may be truncated 
writing up to 14 bytes into a region of size 10 [-Wformat-truncation=]
  443 |  snprintf(uio_dev_path, sizeof(uio_dev_path), "/dev/%s", uio_name);
  | ^~   
In file included from /usr/include/stdio.h:867,
 from hv_fcopy_uio_daemon.c:20:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:67:10: note: 
'__builtin___snprintf_chk' output between 6 and 20 bytes into a destination of 
size 15
   67 |   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
  |  ^~~~
   68 |__bos (__s), __fmt, __va_arg_pack ());
  |~
  CC  vmbus_bufring.o
  LD  hv_fcopy_uio_daemon-in.o
make[1]: Leaving directory '/root/linux/tools/hv'
  LINKhv_fcopy_uio_daemon

RE: [PATCH 1/2] Drivers: hv: vmbus: Wait for offers during boot

2024-10-25 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Tuesday, October 22, 2024 11:04 AM
> [...]
> I wasn't aware of the VF handling. Where does the guest notify the
> host that it is planning to hibernate? I went looking for such code, but
> couldn't immediately find it.  Is it in the netvsc driver? Is this the
> sequence?
> 
> 1) The guest notifies the host of the hibernate
> 2) The host sends a RESCIND_CHANNELOFFER message for each VF
> in the VM
> 3) The guest waits for all VF rescind processing to complete, and
> also must ensure that no new VFs get added in the meantime
> 4) Then the guest proceeds with the hibernation, knowing that there
> are no open channels for VF devices

When a hibernated VM resumes on a different host, it looks like the host team
thinks that it's difficult to remember the VMBus device Instance GUID for the
VF, and use the same GUID on the new host. When the new host uses a new
Instance GUID for the VF, a Windows VM panics, and a Linux VM prints a
warning and IIRC loses the ability to hibernate again due to a check in the
VMBus driver.

So, as a workaround, the host team decides to remove the VF(s) before
asking the VM to hibernate. The sequence of a "host-initiated VM hibernation"
is:
1) a user clicks the "Hibernation" button on the portal (or uses the equivalent
cmd line or API).

2) Internally, the host temporarily disables AccelNet for the vNICs, i.e. 
sending
PCI_EJECT and RESCIND_CHANNELOFFER for each VF.

3) The guest responds accordingly, including sending PCI_EJECTION_COMPLETE
and CHANNELMSG_RELID_RELEASED.

4) Once the host knows that AccelNet has been disabled for the VM, the host
Sends a "please hibernate" message to the guest via the Shutdown IC.

5) The guest proceeds with the hibernation, knowing that there are no open
channels for VF devices and assuming no new VF will be offered during the
hibernation operation.

6) When the VM finishes hibernation and powers off, the host internally enables
AccelNet for the VM so that when the VM resumes on a new host, the new host
can offer a VF with a different VMBus device instance GUID.

The above is for a "host-initiated VM hibernation".

Currently, the Azure team doesn't support a "VM-initiated hibernation", where
the host has no opportunity to temporarily disable AccelNet. Maybe 
"VM-initiated hibernation" can be supported when MANA-Direct is used (i.e.
no more NetVSC NICs and there are only MANA VF NICs): in this scenario, I
suppose the host must remember the MANA VF's VMBus device Instance GUID
and use the same GUID on the new host.

> > The behavior we want is for the guest to hot remove the MLX device
> > driver on resume, even if the MLX device was still present at suspend,
> > so that the host does not need this special pre-hibernate behavior. This
> > patch series may not be sufficient to ensure this, though. It just moves
> > things in the right direction, by handling the all-offers-delivered
> > message.

I'm not sure if it's a good idea to add new code to try to remove an 
stale MLX VF since the scenario should not exist on Azure nowadays 
(currently the host temporarily disables AccelNet during hibernation so there
should be no stale MLX VF upon resume.)

On a local Hyper-V host, after a VM hibernates, we can manually disable
AccelNet (i.e. NIC SR-IOV) for the VM, and the VM will see a stale unresponsive
MLX VF upon resume. It would be tricky to clean up the VF gracefully:
we would have to wait for the resume callback in the Mellanox VF driver
to time out on the unresponsive VF (this can take 1 minute) and clean up the
related VMBus pass-through device backing the VF; what happens if a
host-initiated or VM-initiated hibernation is triggered during the 1 minute?
I suspect there may be some tricky race condition issues, e.g. we may 
need to figure out how to synchronize the .resume with the .remove callbacks
of the MLX driver.

I think the general assumption is that the VM's configuration should not
change at all across hibernation, but it looks like this assumption is found
to be false under some conditions from time to time... I wish the assumption
can be always true with OpenHCL.

Thanks,
Dexuan

RE: [PATCH] uio_hv_generic: Fix sysfs creation path for ring buffer

2025-03-11 Thread Dexuan Cui

> From: Naman Jain 
> Sent: Monday, March 10, 2025 9:45 PM
> > [...]
> > Hi Greg,
> > I understand this is deviating from the discussions that we have had
> > till now, but I wanted to kindly request your opinion on the following
> > approach, which came up in one of our internal discussions.
> >
> > By moving the sysfs creation logic from hv_uio_probe to hv_uio_open, I
> > believe we can address this problem. Here are some of the benefits of
> > this approach:
> >
> > * This approach involves minimal changes and provides a localized
> > solution.

I prefer the "Approach 1" below, which requires only 1/10 of the changes
of "Approach 2".

> > * Since the use-case of ring sysfs is specific to uio_hv_generic and
> > DPDK, this will give us the flexibility to modify it based on
> > requirements. For example, ring_buffer_bin_attr.size should depend on
> > the ring buffer's allocated size, which is easier to manage if the
> > current code resides in uio_hv_generic.

The 'ring' sysfs file is used by DPDK, the user-mode driver for fcopy
(tools/hv/hv_fcopy_uio_daemon.c) and other out-of-tree user-mode drivers.

Before the hv_uio_generic driver and the user-mode driver load, the
hv_vmbus driver doesn't know the correct size of the 'ring'; currently the
patch of "Approach 2" hardcodes ring_buffer_bin_attr.size to 4MB, which
is incorrect for the Fcopy VMBus device and other VMBus devices.

 "Approach 1" also uses a hardcoded 4MB, but it can't be easily fixed by
adding one line in hv_uio_probe(). With "Approach 2", the fix would be
difficult as we would need to pass one more param 'size' from the driver
hv_uio_generic to hv_vmbus.

The patch of "Approach 2" already passes two params from hv_uio_generic
to hv_vmbus: 'ring_sysfs_visible'  and 'mmap_ring_buffer', and adds and
exports 2 APIs to hv_uio_generic.

With Approach 1, we can avoid all the trouble.

> > * The use-case of DPDK is such that at any given time, either the
> > hv_netvsc kernel driver or the userspace (DPDK) will be controlling this
> > HV_NIC device. We do not want to expose the ring buffer to userspace
> > when hv_netvsc is using the device. This is where the "awareness" of the
> > current user comes into play, and we need a way to control the
> > visibility of the "ring" sysfs from uio_hv_generic.
> >
> >
> > Alternatively, I could focus on resolving the race condition you
> > mentioned and proceed with refining the patch. This approach addresses
> > most of the general practice concerns you highlighted.
> >
> > Regards,
> > Naman
> 
> Hello Greg,
> Here are the two approaches that we discussed.
> Can you please suggest the approach which looks better to you
> for next version.
> 
> **Approach 1: move sysfs creation to hv_uio_open**
>  [...]
> **Approach 2: Move sysfs creation to hyperv drivers**

In general, indeed we would want to avoid manipulating sysfs and kobj directly,
but here IMO calling sysfs_create_bin_file() in hv_uio_generic is reasonable as
hv_uio_generic is the only user of the 'ring' sysfs file, and it has more info 
about
the 'ring' (i.e. the correct size; when the 'ring' is used); I prefer "Approach 
1"
 since the patch is much smaller and cleaner.

Greg, can we have your opinions?

Thanks,
Dexuan

47 matches

Mail list logo