Re: GPL vs non-GPL device drivers
I am still a kernel newbie, and I am still not very much aware about the GPL vs. Non-GPL drivers debate. I personally think it'd be better that all drivers should be GPL'd but if that's the case, what would be the legal position of such vendors as ATI or NVIDIA who supply closed source drivers? Would it be illegal to use them? I know this question might sound silly, but as I said before I have only little awareness about the issue. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xen-netfront, xen-netback: Use correct minimum MTU values
RFC791 specifies the minimum MTU to be 68, while xen-net{front|back} drivers use a minimum value of 0. When set MTU to 0~67 with xen_net{front|back} driver, the network will become unreachable immediately, the guest can no longer be pinged. xen_net{front|back} should not allow the user to set this value which causes network problems. Reported-by: Chen Shi Signed-off-by: Mohammed Gamal --- drivers/net/xen-netback/interface.c | 2 +- drivers/net/xen-netfront.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index ee8ed9da..4491ca5 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -486,7 +486,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid, dev->tx_queue_len = XENVIF_QUEUE_LENGTH; - dev->min_mtu = 0; + dev->min_mtu = ETH_MIN_MTU; dev->max_mtu = ETH_MAX_MTU - VLAN_ETH_HLEN; /* diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 523387e..8b8689c 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -1316,7 +1316,7 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev) netdev->features |= netdev->hw_features; netdev->ethtool_ops = &xennet_ethtool_ops; - netdev->min_mtu = 0; + netdev->min_mtu = ETH_MIN_MTU; netdev->max_mtu = XEN_NETIF_MAX_TX_SIZE; SET_NETDEV_DEV(netdev, &dev->dev); -- 1.8.3.1
[PATCH] vmxnet3: Use correct minimum MTU value
Currently the vmxnet3 driver has a minimum MTU value of 60. Which goes against the RFC791 spec which specifies it at 68. Setting MTU to values between 60 <= MTU <= 67 causes the network interface to lose its IP, and it fails to restart. This sets the minimum value to ETH_MIN_MTU (68) which is compatible with is according to spec. Reported-by: Bo Yang Signed-off-by: Mohammed Gamal --- drivers/net/vmxnet3/vmxnet3_defs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/vmxnet3/vmxnet3_defs.h b/drivers/net/vmxnet3/vmxnet3_defs.h index c3a3164..4ad905a 100644 --- a/drivers/net/vmxnet3/vmxnet3_defs.h +++ b/drivers/net/vmxnet3/vmxnet3_defs.h @@ -749,7 +749,7 @@ struct Vmxnet3_DriverShared { ((vfTable[vid >> 5] & (1 << (vid & 31))) != 0) #define VMXNET3_MAX_MTU 9000 -#define VMXNET3_MIN_MTU 60 +#define VMXNET3_MIN_MTU ETH_MIN_MTU #define VMXNET3_LINK_UP (1 << 16 | 1)/* 10 Gbps, up */ #define VMXNET3_LINK_DOWN 0 -- 1.8.3.1
[RFC PATCH 2/2] hv_netvsc: Change GPADL teardown order according to Hyper-V version
Commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") introduced a regression causing VMs not to shutdown on pre-Wind2016 hosts after netvsc_remove_device() is called. This was caused as the GPADL teardown sequence was changed. This patch restores the old behavior for pre-Win2016 hosts, while keeping the changes from 0cf7378 for Win2016 and higher hosts. Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 3982f76..d09bb3b 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -575,8 +575,17 @@ void netvsc_device_remove(struct hv_device *device) cancel_work_sync(&net_device->subchan_work); + /* +* Revoke receive buffer. If host is pre-Win2016 then tear down +* receive buffer GPADL. Do the same for send buffer. +*/ netvsc_revoke_recv_buf(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) + netvsc_teardown_recv_buf_gpadl(device, net_device); + netvsc_revoke_send_buf(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) + netvsc_teardown_send_buf_gpadl(device, net_device); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -589,8 +598,14 @@ void netvsc_device_remove(struct hv_device *device) /* Now, we can close the channel safely */ vmbus_close(device->channel); - netvsc_teardown_recv_buf_gpadl(device, net_device); - netvsc_teardown_send_buf_gpadl(device, net_device); + /* +* If host is Win2016 or higher then we do the GPADL tear down +* here after VMBus is closed, instead of doing it earlier. +*/ + if (vmbus_proto_version >= VERSION_WIN10) { + netvsc_teardown_recv_buf_gpadl(device, net_device); + netvsc_teardown_send_buf_gpadl(device, net_device); + } /* And dissassociate NAPI context from device */ for (i = 0; i < net_device->num_chn; i++) -- 1.8.3.1
[RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
Split each of the functions into two for each of send/recv buffers Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 35 +++ 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index bfc7969..3982f76 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -100,8 +100,8 @@ static void free_netvsc_device_rcu(struct netvsc_device *nvdev) call_rcu(&nvdev->rcu, free_netvsc_device); } -static void netvsc_revoke_buf(struct hv_device *device, - struct netvsc_device *net_device) +static void netvsc_revoke_recv_buf(struct hv_device *device, + struct netvsc_device *net_device) { struct nvsp_message *revoke_packet; struct net_device *ndev = hv_get_drvdata(device); @@ -146,6 +146,14 @@ static void netvsc_revoke_buf(struct hv_device *device, } net_device->recv_section_cnt = 0; } +} + +static void netvsc_revoke_send_buf(struct hv_device *device, + struct netvsc_device *net_device) +{ + struct nvsp_message *revoke_packet; + struct net_device *ndev = hv_get_drvdata(device); + int ret; /* Deal with the send buffer we may have setup. * If we got a send section size, it means we received a @@ -189,8 +197,8 @@ static void netvsc_revoke_buf(struct hv_device *device, } } -static void netvsc_teardown_gpadl(struct hv_device *device, - struct netvsc_device *net_device) +static void netvsc_teardown_recv_buf_gpadl(struct hv_device *device, + struct netvsc_device *net_device) { struct net_device *ndev = hv_get_drvdata(device); int ret; @@ -215,6 +223,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device, vfree(net_device->recv_buf); net_device->recv_buf = NULL; } +} + +static void netvsc_teardown_send_buf_gpadl(struct hv_device *device, + struct netvsc_device *net_device) +{ + struct net_device *ndev = hv_get_drvdata(device); + int ret; if (net_device->send_buf_gpadl_handle) { ret = vmbus_teardown_gpadl(device->channel, @@ -425,8 +440,10 @@ static int netvsc_init_buf(struct hv_device *device, goto exit; cleanup: - netvsc_revoke_buf(device, net_device); - netvsc_teardown_gpadl(device, net_device); + netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device); + netvsc_teardown_recv_buf_gpadl(device, net_device); + netvsc_teardown_send_buf_gpadl(device, net_device); exit: return ret; @@ -558,7 +575,8 @@ void netvsc_device_remove(struct hv_device *device) cancel_work_sync(&net_device->subchan_work); - netvsc_revoke_buf(device, net_device); + netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -571,7 +589,8 @@ void netvsc_device_remove(struct hv_device *device) /* Now, we can close the channel safely */ vmbus_close(device->channel); - netvsc_teardown_gpadl(device, net_device); + netvsc_teardown_recv_buf_gpadl(device, net_device); + netvsc_teardown_send_buf_gpadl(device, net_device); /* And dissassociate NAPI context from device */ for (i = 0; i < net_device->num_chn; i++) -- 1.8.3.1
[RFC PATCH 0/2] hv_netvsc: Fix shutdown regression on Win2012 hosts
Commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") introduced a regression that caused VMs not to shutdown after netvsc_device_remove() is called. This is caused by GPADL teardown sequence change, and while that was necessary to fix issues with Win2016 hosts, it did introduce a regression for earlier versions. Prior to commit 0cf737808 the call sequence in netvsc_device_remove() was as follows (as implemented in netvsc_destroy_buf()): 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Teardown receive buffer GPADL 3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 4- Teardown send buffer GPADL 5- Close vmbus This didn't work for WS2016 hosts. Commit 0cf737808 split netvsc_destroy_buf() into two functions and rearranged the order as follows 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 3- Close vmbus 4- Teardown receive buffer GPADL 5- Teardown send buffer GPADL That worked well for WS2016 hosts, but for WS2012 hosts it prevented VMs from shutting down. This patch series works around this problem. The first patch splits netvsc_revoke_buf() and netvsc_teardown_gpadl() into two finer grained functions for tearing down send and receive buffers individally. The second patch uses the finer grained functions to implement the teardown sequence according to the host's version. We keep the behavior introduced in 0cf737808ae7 for Windows 2016 hosts, while we re-introduce the old sequence for earlier verions. Mohammed Gamal (2): hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl() hv_netvsc: Change GPADL teardown order according to Hyper-V version drivers/net/hyperv/netvsc.c | 50 + 1 file changed, 42 insertions(+), 8 deletions(-) -- 1.8.3.1
Re: [RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
On Tue, 2018-01-30 at 11:29 -0800, Stephen Hemminger wrote: > On Tue, 23 Jan 2018 10:34:04 +0100 > Mohammed Gamal wrote: > > > Split each of the functions into two for each of send/recv buffers > > > > Signed-off-by: Mohammed Gamal > > Splitting these functions is not necessary How so? We need to send each message independently, and hence the split (see cover letter). Is there another way?
Re: [RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
On Wed, 2018-01-31 at 15:01 -0800, Stephen Hemminger wrote: > On Wed, 31 Jan 2018 12:16:49 +0100 > Mohammed Gamal wrote: > > > On Tue, 2018-01-30 at 11:29 -0800, Stephen Hemminger wrote: > > > On Tue, 23 Jan 2018 10:34:04 +0100 > > > Mohammed Gamal wrote: > > > > > > > Split each of the functions into two for each of send/recv > > > > buffers > > > > > > > > Signed-off-by: Mohammed Gamal > > > > > > Splitting these functions is not necessary > > > > How so? We need to send each message independently, and hence the > > split > > (see cover letter). Is there another way? > > This is all that is needed. > > > Subject: [PATCH] hv_netvsc: work around for gpadl teardown on older > windows > server > > On WS2012 the host ignores messages after vmbus channel is closed. > Workaround this by doing what Windows does and send the teardown > before close on older versions of NVSP protocol. > > Reported-by: Mohammed Gamal > Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") > Signed-off-by: Stephen Hemminger > --- > drivers/net/hyperv/netvsc.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/hyperv/netvsc.c > b/drivers/net/hyperv/netvsc.c > index 17e529af79dc..1a3df0eff42f 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -574,10 +574,17 @@ void netvsc_device_remove(struct hv_device > *device) > */ > netdev_dbg(ndev, "net device safe to remove\n"); > > + /* Workaround for older versions of Windows require that > + * buffer be revoked before channel is disabled > + */ > + if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4) > + netvsc_teardown_gpadl(device, net_device); > + > /* Now, we can close the channel safely */ > vmbus_close(device->channel); > > - netvsc_teardown_gpadl(device, net_device); > + if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4) > + netvsc_teardown_gpadl(device, net_device); > > /* And dissassociate NAPI context from device */ > for (i = 0; i < net_device->num_chn; i++) I've tried a similar workaround before by calling netvsc_teardown_gpadl() after netvsc_revoke_buf(), but before setting net_device_ctx->nvdev to NULL and it caused the guest to hang when trying to change MTU. Let me try that change and see if it behaves differently.
Re: [RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
On Thu, 2018-02-01 at 09:37 +0100, Mohammed Gamal wrote: > On Wed, 2018-01-31 at 15:01 -0800, Stephen Hemminger wrote: > > On Wed, 31 Jan 2018 12:16:49 +0100 > > Mohammed Gamal wrote: > > > > > On Tue, 2018-01-30 at 11:29 -0800, Stephen Hemminger wrote: > > > > On Tue, 23 Jan 2018 10:34:04 +0100 > > > > Mohammed Gamal wrote: > > > > > > > > > Split each of the functions into two for each of send/recv > > > > > buffers > > > > > > > > > > Signed-off-by: Mohammed Gamal > > > > > > > > Splitting these functions is not necessary > > > > > > How so? We need to send each message independently, and hence the > > > split > > > (see cover letter). Is there another way? > > > > This is all that is needed. > > > > > > Subject: [PATCH] hv_netvsc: work around for gpadl teardown on older > > windows > > server > > > > On WS2012 the host ignores messages after vmbus channel is closed. > > Workaround this by doing what Windows does and send the teardown > > before close on older versions of NVSP protocol. > > > > Reported-by: Mohammed Gamal > > Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") > > Signed-off-by: Stephen Hemminger > > --- > > drivers/net/hyperv/netvsc.c | 9 - > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/hyperv/netvsc.c > > b/drivers/net/hyperv/netvsc.c > > index 17e529af79dc..1a3df0eff42f 100644 > > --- a/drivers/net/hyperv/netvsc.c > > +++ b/drivers/net/hyperv/netvsc.c > > @@ -574,10 +574,17 @@ void netvsc_device_remove(struct hv_device > > *device) > > */ > > netdev_dbg(ndev, "net device safe to remove\n"); > > > > + /* Workaround for older versions of Windows require that > > + * buffer be revoked before channel is disabled > > + */ > > + if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4) > > + netvsc_teardown_gpadl(device, net_device); > > + > > /* Now, we can close the channel safely */ > > vmbus_close(device->channel); > > > > - netvsc_teardown_gpadl(device, net_device); > > + if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4) > > + netvsc_teardown_gpadl(device, net_device); > > > > /* And dissassociate NAPI context from device */ > > for (i = 0; i < net_device->num_chn; i++) > > I've tried a similar workaround before by calling > netvsc_teardown_gpadl() after netvsc_revoke_buf(), but before setting > net_device_ctx->nvdev to NULL and it caused the guest to hang when > trying to change MTU. > > Let me try that change and see if it behaves differently. I tested the patch, but I've actually seen some unexpected behavior. First, net_device->nvsp_version is actually NVSP_PROTOCOL_VERSION_5 on both my Win2012 and Win2016 hosts that I tested on, so the condition is never executed. Second, when doing the check instead as if (vmbus_proto_version < VERSION_WIN10), I get the same behavior I described above where the guest hangs as the kernel waits indefinitely in vmbus_teardown_gpadl() for a completion to be signaled. This is actually what lead me to propose splitting netvsc_revoke_buf() and netvsc_teardown_gpadl() in my initial patchset so that we keep the same order of messages and avoid that indefinite wait.
[PATCH net-next] hv_netvsc: Correct filter setting for multicast/broadcast
Commit 009f766 intended to filter multicast/broadcast, however the NDIS filter wasn't set properly in non-promiscuous modes, which resulted in issues like DHCP timeouts. This patch sets the filter flags correctly. Fixes: 009f766 ("hv_netvsc: filter multicast/broadcast") Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/rndis_filter.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 8927c48..411a3ae 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -861,9 +861,9 @@ static void rndis_set_multicast(struct work_struct *w) filter = NDIS_PACKET_TYPE_PROMISCUOUS; } else { if (flags & IFF_ALLMULTI) - flags |= NDIS_PACKET_TYPE_ALL_MULTICAST; + filter |= NDIS_PACKET_TYPE_ALL_MULTICAST; if (flags & IFF_BROADCAST) - flags |= NDIS_PACKET_TYPE_BROADCAST; + filter |= NDIS_PACKET_TYPE_BROADCAST; } rndis_filter_set_packet_filter(rdev, filter); -- 1.8.3.1
[PATCH] hv_netvsc: Make sure out channel is fully opened on send
Dring high network traffic changes to network interface parameters such as number of channels or MTU can cause a kernel panic with a NULL pointer dereference. This is due to netvsc_device_remove() being called and deallocating the channel ring buffers, which can then be accessed by netvsc_send_pkt() before they're allocated on calling netvsc_device_add() The patch fixes this problem by checking the channel state and returning ENODEV if not yet opened. We also move the call to hv_ringbuf_avail_percent() which may access the uninitialized ring buffer. Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 0265d70..44a8358 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -757,7 +757,7 @@ static inline int netvsc_send_pkt( struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx); u64 req_id; int ret; - u32 ring_avail = hv_ringbuf_avail_percent(&out_channel->outbound); + u32 ring_avail; nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; if (skb) @@ -773,7 +773,7 @@ static inline int netvsc_send_pkt( req_id = (ulong)skb; - if (out_channel->rescind) + if (out_channel->rescind || out_channel->state != CHANNEL_OPENED_STATE) return -ENODEV; if (packet->page_buf_cnt) { @@ -791,6 +791,7 @@ static inline int netvsc_send_pkt( VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED); } + ring_avail = hv_ringbuf_avail_percent(&out_channel->outbound); if (ret == 0) { atomic_inc_return(&nvchan->queue_sends); -- 1.8.3.1
Re: [PATCH] hv_netvsc: Make sure out channel is fully opened on send
On Tue, 2018-03-13 at 12:35 -0700, Stephen Hemminger wrote: > On Tue, 13 Mar 2018 20:06:50 +0100 > Mohammed Gamal wrote: > > > Dring high network traffic changes to network interface parameters > > such as number of channels or MTU can cause a kernel panic with a > > NULL > > pointer dereference. This is due to netvsc_device_remove() being > > called and deallocating the channel ring buffers, which can then be > > accessed by netvsc_send_pkt() before they're allocated on calling > > netvsc_device_add() > > > > The patch fixes this problem by checking the channel state and > > returning > > ENODEV if not yet opened. We also move the call to > > hv_ringbuf_avail_percent() > > which may access the uninitialized ring buffer. > > > > Signed-off-by: Mohammed Gamal > > --- > > drivers/net/hyperv/netvsc.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/hyperv/netvsc.c > > b/drivers/net/hyperv/netvsc.c > > index 0265d70..44a8358 100644 > > --- a/drivers/net/hyperv/netvsc.c > > +++ b/drivers/net/hyperv/netvsc.c > > @@ -757,7 +757,7 @@ static inline int netvsc_send_pkt( > > struct netdev_queue *txq = netdev_get_tx_queue(ndev, > > packet->q_idx); > > u64 req_id; > > int ret; > > - u32 ring_avail = hv_ringbuf_avail_percent(&out_channel- > > >outbound); > > + u32 ring_avail; > > > > nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; > > if (skb) > > @@ -773,7 +773,7 @@ static inline int netvsc_send_pkt( > > > > req_id = (ulong)skb; > > > > - if (out_channel->rescind) > > + if (out_channel->rescind || out_channel->state != > > CHANNEL_OPENED_STATE) > > return -ENODEV; > > > > if (packet->page_buf_cnt) { > > @@ -791,6 +791,7 @@ static inline int netvsc_send_pkt( > > VMBUS_DATA_PACKET_FLAG_COMP > > LETION_REQUESTED); > > } > > > > + ring_avail = hv_ringbuf_avail_percent(&out_channel- > > >outbound); > > if (ret == 0) { > > atomic_inc_return(&nvchan->queue_sends); > > > > Thanks for your patch. Yes there are races with the current update > logic. The root cause goes higher up in the flow; the send queues > should > be stopped before netvsc_device_remove is called. Solving it where > you tried > to is racy and not going to work reliably. > > Network patches should go to net...@vger.kernel.org > > You can't move the ring_avail check until after the vmbus_sendpacket > because > that will break the flow control logic. > Why? I don't see ring_avail being used before that point. > Instead, you should just move the avail_read check until just after > the existing rescind > check. > > Also, you shouldn't need to check for OPENED_STATE, just rescind is > enough. That rarely mitigated the race. channel->rescind flag is set on vmbus exit - called on module unload - and when a rescind offer is received from the host, which AFAICT doesn't happen on every call to netvsc_device_remove, so it's quite possible that the ringbuffer is accessed before it's allocated again on channel open and hence the check for OPENED_STAT - which is only set after all vmbus data is initialized. > > > >
[PATCH 4/4] hv_netvsc: Pass net_device parameter to revoke and teardown functions
The callers to netvsc_revoke_*_buf() and netvsc_teardown_*_gpadl() already have their net_device instances. Pass them as a paramaeter to the function instead of obtaining them from netvsc_device struct everytime Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 37 ++--- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index df92c2f..04f611e 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -110,9 +110,9 @@ static void free_netvsc_device_rcu(struct netvsc_device *nvdev) } static void netvsc_revoke_recv_buf(struct hv_device *device, - struct netvsc_device *net_device) + struct netvsc_device *net_device, + struct net_device *ndev) { - struct net_device *ndev = hv_get_drvdata(device); struct nvsp_message *revoke_packet; int ret; @@ -160,9 +160,9 @@ static void netvsc_revoke_recv_buf(struct hv_device *device, } static void netvsc_revoke_send_buf(struct hv_device *device, - struct netvsc_device *net_device) + struct netvsc_device *net_device, + struct net_device *ndev) { - struct net_device *ndev = hv_get_drvdata(device); struct nvsp_message *revoke_packet; int ret; @@ -211,9 +211,9 @@ static void netvsc_revoke_send_buf(struct hv_device *device, } static void netvsc_teardown_recv_gpadl(struct hv_device *device, - struct netvsc_device *net_device) + struct netvsc_device *net_device, + struct net_device *ndev) { - struct net_device *ndev = hv_get_drvdata(device); int ret; if (net_device->recv_buf_gpadl_handle) { @@ -233,9 +233,9 @@ static void netvsc_teardown_recv_gpadl(struct hv_device *device, } static void netvsc_teardown_send_gpadl(struct hv_device *device, - struct netvsc_device *net_device) + struct netvsc_device *net_device, + struct net_device *ndev) { - struct net_device *ndev = hv_get_drvdata(device); int ret; if (net_device->send_buf_gpadl_handle) { @@ -452,10 +452,10 @@ static int netvsc_init_buf(struct hv_device *device, goto exit; cleanup: - netvsc_revoke_recv_buf(device, net_device); - netvsc_revoke_send_buf(device, net_device); - netvsc_teardown_recv_gpadl(device, net_device); - netvsc_teardown_send_gpadl(device, net_device); + netvsc_revoke_recv_buf(device, net_device, ndev); + netvsc_revoke_send_buf(device, net_device, ndev); + netvsc_teardown_recv_gpadl(device, net_device, ndev); + netvsc_teardown_send_gpadl(device, net_device, ndev); exit: return ret; @@ -474,7 +474,6 @@ static int negotiate_nvsp_ver(struct hv_device *device, init_packet->hdr.msg_type = NVSP_MSG_TYPE_INIT; init_packet->msg.init_msg.init.min_protocol_ver = nvsp_ver; init_packet->msg.init_msg.init.max_protocol_ver = nvsp_ver; - trace_nvsp_send(ndev, init_packet); /* Send the init request */ @@ -596,13 +595,13 @@ void netvsc_device_remove(struct hv_device *device) * Revoke receive buffer. If host is pre-Win2016 then tear down * receive buffer GPADL. Do the same for send buffer. */ - netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_recv_buf(device, net_device, ndev); if (vmbus_proto_version < VERSION_WIN10) - netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_recv_gpadl(device, net_device, ndev); - netvsc_revoke_send_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device, ndev); if (vmbus_proto_version < VERSION_WIN10) - netvsc_teardown_send_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device, ndev); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -624,8 +623,8 @@ void netvsc_device_remove(struct hv_device *device) * here after VMBus is closed. */ if (vmbus_proto_version >= VERSION_WIN10) { - netvsc_teardown_recv_gpadl(device, net_device); - netvsc_teardown_send_gpadl(device, net_device); + netvsc_teardown_recv_gpadl(device, net_device, ndev); + netvsc_teardown_send_gpadl(device, net_device, ndev); } /* Release all resources */ -- 1.8.3.1
[PATCH 2/4] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
Split each of the functions into two for each of send/recv buffers. This will be needed in order to implement a fine-grained messaging sequence to the host so tht we accommodate the requirements of different Windows versions Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions") Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 46 + 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index d65b7fc..f4df5de 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -109,11 +109,11 @@ static void free_netvsc_device_rcu(struct netvsc_device *nvdev) call_rcu(&nvdev->rcu, free_netvsc_device); } -static void netvsc_revoke_buf(struct hv_device *device, - struct netvsc_device *net_device) +static void netvsc_revoke_recv_buf(struct hv_device *device, + struct netvsc_device *net_device) { - struct nvsp_message *revoke_packet; struct net_device *ndev = hv_get_drvdata(device); + struct nvsp_message *revoke_packet; int ret; /* @@ -157,6 +157,14 @@ static void netvsc_revoke_buf(struct hv_device *device, } net_device->recv_section_cnt = 0; } +} + +static void netvsc_revoke_send_buf(struct hv_device *device, + struct netvsc_device *net_device) +{ + struct net_device *ndev = hv_get_drvdata(device); + struct nvsp_message *revoke_packet; + int ret; /* Deal with the send buffer we may have setup. * If we got a send section size, it means we received a @@ -202,8 +210,8 @@ static void netvsc_revoke_buf(struct hv_device *device, } } -static void netvsc_teardown_gpadl(struct hv_device *device, - struct netvsc_device *net_device) +static void netvsc_teardown_recv_gpadl(struct hv_device *device, + struct netvsc_device *net_device) { struct net_device *ndev = hv_get_drvdata(device); int ret; @@ -222,6 +230,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device, } net_device->recv_buf_gpadl_handle = 0; } +} + +static void netvsc_teardown_send_gpadl(struct hv_device *device, + struct netvsc_device *net_device) +{ + struct net_device *ndev = hv_get_drvdata(device); + int ret; if (net_device->send_buf_gpadl_handle) { ret = vmbus_teardown_gpadl(device->channel, @@ -437,8 +452,10 @@ static int netvsc_init_buf(struct hv_device *device, goto exit; cleanup: - netvsc_revoke_buf(device, net_device); - netvsc_teardown_gpadl(device, net_device); + netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device); + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device); exit: return ret; @@ -575,7 +592,8 @@ void netvsc_device_remove(struct hv_device *device) = rtnl_dereference(net_device_ctx->nvdev); int i; - netvsc_revoke_buf(device, net_device); + netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -590,14 +608,18 @@ void netvsc_device_remove(struct hv_device *device) netdev_dbg(ndev, "net device safe to remove\n"); /* older versions require that buffer be revoked before close */ - if (vmbus_proto_version < VERSION_WIN10) - netvsc_teardown_gpadl(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) { + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device); + } /* Now, we can close the channel safely */ vmbus_close(device->channel); - if (vmbus_proto_version >= VERSION_WIN10) - netvsc_teardown_gpadl(device, net_device); + if (vmbus_proto_version >= VERSION_WIN10) { + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device); + } /* Release all resources */ free_netvsc_device_rcu(net_device); -- 1.8.3.1
[PATCH 1/4] hv_netvsc: Use Windows version instead of NVSP version on GPAD teardown
When changing network interface settings, Windows guests older than WS2016 can no longer shutdown. This was addressed by commit 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions"), however the issue also occurs on WS2012 guests that share NVSP protocol versions with WS2016 guests. Hence we use Windows version directly to differentiate them. Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions") Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index c9910c3..d65b7fc 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -590,13 +590,13 @@ void netvsc_device_remove(struct hv_device *device) netdev_dbg(ndev, "net device safe to remove\n"); /* older versions require that buffer be revoked before close */ - if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4) + if (vmbus_proto_version < VERSION_WIN10) netvsc_teardown_gpadl(device, net_device); /* Now, we can close the channel safely */ vmbus_close(device->channel); - if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4) + if (vmbus_proto_version >= VERSION_WIN10) netvsc_teardown_gpadl(device, net_device); /* Release all resources */ -- 1.8.3.1
[PATCH 3/4] hv_netvsc: Ensure correct teardown message sequence order
Prior to commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") the call sequence in netvsc_device_remove() was as follows (as implemented in netvsc_destroy_buf()): 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Teardown receive buffer GPADL 3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 4- Teardown send buffer GPADL 5- Close vmbus This didn't work for WS2016 hosts. Commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") rearranged the teardown sequence as follows: 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 3- Close vmbus 4- Teardown receive buffer GPADL 5- Teardown send buffer GPADL That worked well for WS2016 hosts, but it prevented guests on older hosts from shutting down after changing network settings. Commit 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions") ensured the following message sequence for older hosts 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 3- Teardown receive buffer GPADL 4- Teardown send buffer GPADL 5- Close vmbus However, with this sequence calling `ip link set eth0 mtu 1000` hangs and the process becomes uninterruptible. On futher analysis it turns out that on tearing down the receive buffer GPADL the kernel is waiting indefinitely in vmbus_teardown_gpadl() for a completion to be signaled. Here is a snippet of where this occurs: int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle) { struct vmbus_channel_gpadl_teardown *msg; struct vmbus_channel_msginfo *info; unsigned long flags; int ret; info = kmalloc(sizeof(*info) + sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL); if (!info) return -ENOMEM; init_completion(&info->waitevent); info->waiting_channel = channel; [] ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_gpadl_teardown), true); if (ret) goto post_msg_err; wait_for_completion(&info->waitevent); [] } The completion is signaled from vmbus_ongpadl_torndown(), which gets called when the corresponding message is received from the host, which apparently never happens in that case. This patch works around the issue by restoring the first mentioned message sequence for older hosts Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions") Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index f4df5de..df92c2f 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -592,8 +592,17 @@ void netvsc_device_remove(struct hv_device *device) = rtnl_dereference(net_device_ctx->nvdev); int i; + /* +* Revoke receive buffer. If host is pre-Win2016 then tear down +* receive buffer GPADL. Do the same for send buffer. +*/ netvsc_revoke_recv_buf(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_revoke_send_buf(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) + netvsc_teardown_send_gpadl(device, net_device); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -607,15 +616,13 @@ void netvsc_device_remove(struct hv_device *device) */ netdev_dbg(ndev, "net device safe to remove\n"); - /* older versions require that buffer be revoked before close */ - if (vmbus_proto_version < VERSION_WIN10) { - netvsc_teardown_recv_gpadl(device, net_device); - netvsc_teardown_send_gpadl(device, net_device); - } - /* Now, we can close the channel safely */ vmbus_close(device->channel); + /* +* If host is Win2016 or higher then we do the GPADL tear down +* here after VMBus is closed. + */ if (vmbus_proto_version >= VERSION_WIN10) { netvsc_teardown_recv_gpadl(device, net_device); netvsc_teardown_send_gpadl(device, net_device); -- 1.8.3.1
[PATCH 0/4] hv_netvsc: Fix shutdown issues on older Windows hosts
Guests running on WS2012 hosts would not shutdown when changing network interface setting (e.g. Number of channels, MTU ... etc). This patch series addresses these shutdown issues we enecountered with WS2012 hosts. It's essentialy a rework of the series sent in https://lkml.org/lkml/2018/1/23/111 on top of latest upstream Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions") Mohammed Gamal (4): hv_netvsc: Use Windows version instead of NVSP version on GPAD teardown hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl() hv_netvsc: Ensure correct teardown message sequence order hv_netvsc: Pass net_device parameter to revoke and teardown functions drivers/net/hyperv/netvsc.c | 60 + 1 file changed, 44 insertions(+), 16 deletions(-) -- 1.8.3.1
Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
On Tue, 2019-03-12 at 18:02 +, Haiyang Zhang wrote: > > > > -Original Message- > > From: Mohammed Gamal > > Sent: Thursday, March 7, 2019 1:32 PM > > To: Michael Kelley ; linux-hyp...@vger.kern > el.org; > > kimbrownkd > > Cc: Sasha Levin ; Dexuan Cui > > ; Stephen Hemminger ; > > Long Li ; KY Srinivasan ; > Haiyang > > Zhang ; vkuznets ; > linux- > > ker...@vger.kernel.org > > Subject: Re: [PATCH] hyper-v: Check for ring buffer in > > hv_get_bytes_to_read/write > > > > On Thu, 2019-03-07 at 17:33 +, Michael Kelley wrote: > > > From: Mohammed Gamal Sent: Thursday, March 7, > > > 2019 8:36 AM > > > > > > > > This patch adds a check for the presence of the ring buffer in > > > > hv_get_bytes_to_read/write() to avoid possible NULL pointer > > > > dereferences. > > > > If the ring buffer is not yet allocated, return 0 bytes to be > > > > read/written. > > > > > > > > The root cause is that code that accesses the ring buffer > including > > > > hv_get_bytes_to_read/write() could be vulnerable to the race > > > > condition discussed in > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F > %2Flk > > > > > > ml.org%2Flkml%2F2018%2F10%2F18%2F779&data=02%7C01%7Chaiyangz > > %40m > > > > > > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9 > > 1 > > > > > > ab2d7cd011db47%7C1%7C0%7C636875803518430021&sdata=b51Xc5GUN > > nHX0K > > > > 08LrH3ShTyFcRZ4mYHUATd%2BDpvYDw%3D&reserved=0>; > > > > > > > > This race is being addressed by the patch series by Kimberly > Brown > > > > in > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F > %2Flk > > > > > > ml.org%2Flkml%2F2019%2F2%2F21%2F1236&data=02%7C01%7Chaiyangz > > %40m > > > > > > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9 > > 1 > > > > > > ab2d7cd011db47%7C1%7C0%7C636875803518430021&sdata=js1ff15Gbk7 > > 0MD > > > > A2hkMZExxvAAbDuKDhfBvc5ZrckzM%3D&reserved=0 which is not > > final > > > > yet > > > > > > > > Signed-off-by: Mohammed Gamal > > > > > > Could you elaborate on the code paths where > > > hv_get_bytes_to_read/write() could be called when the ring buffer > > > isn't yet allocated? My sense is that Kim Brown's patch will > address > > > all of the code paths that involved sysfs access from outside the > > > driver. And within a driver, the ring buffer should never be > accessed > > > unless it is already allocated. Is there another code path we're > not > > > aware of? I'm wondering if these changes are really needed once > Kim > > > Brown's patch is finished. > > > > > > Michael > > > > I've seen one instance of the race in the netvsc driver when > running traffic > > through it with iperf3 while continuously changing the channel > settings. > > > > The following code path deallocates the ring buffer: > > netvsc_set_channels() -> netvsc_detach() -> > > rndis_filter_device_remove() -> netvsc_device_remove() -> > vmbus_close() > > -> vmbus_free_ring() -> hv_ringbuffer_cleanup(). > > > > netvsc_send_pkt() -> hv_get_bytes_to_write() might get called > concurrently > > after vmbus_close() and before vmbus_open() returns and sets up the > new ring > > buffer. > > > > The race is fairly hard to reproduce on recent upstream kernels, > but I still > > managed to reproduce it. > > Looking at the code from netvsc_detach() – > netif_tx_disable(ndev) is called before > rndis_filter_device_remove(hdev, nvdev). > So there should be no call to netvsc_send_pkt() after detaching. > What’s the crash stack trace? > > static int netvsc_detach(struct net_device *ndev, > struct netvsc_device *nvdev) > { > struct net_device_context *ndev_ctx = netdev_priv(ndev); > struct hv_device *hdev = ndev_ctx->device_ctx; > int ret; > > /* Don't try continuing to try and setup sub channels */ > if (cancel_work_sync(&nvdev->subchan_work)) > nvdev->num_chn = 1; > > /* If device was up (receiving) then shutdown */ > if (netif_running(ndev)) { > netif_tx_disable(ndev); > &
Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote: > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal > wrote: > > > > This patch exposes allow_smaller_maxphyaddr to the user as a module > > parameter. > > > > Since smaller physical address spaces are only supported on VMX, > > the parameter > > is only exposed in the kvm_intel module. > > Modifications to VMX page fault and EPT violation handling will > > depend on whether > > that parameter is enabled. > > > > Also disable support by default, and let the user decide if they > > want to enable > > it. > > > > Signed-off-by: Mohammed Gamal > > --- > > arch/x86/kvm/vmx/vmx.c | 15 ++- > > arch/x86/kvm/vmx/vmx.h | 3 +++ > > arch/x86/kvm/x86.c | 2 +- > > 3 files changed, 10 insertions(+), 10 deletions(-) > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index 819c185adf09..dc778c7b5a06 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -129,6 +129,9 @@ static bool __read_mostly > > enable_preemption_timer = 1; > > module_param_named(preemption_timer, enable_preemption_timer, > > bool, S_IRUGO); > > #endif > > > > +extern bool __read_mostly allow_smaller_maxphyaddr; > > Since this variable is in the kvm module rather than the kvm_intel > module, its current setting is preserved across "rmmod kvm_intel; > modprobe kvm_intel." That is, if set to true, it doesn't revert to > false after "rmmod kvm_intel." Is that the intended behavior? > IIRC, this is because this setting was indeed not intended to be just VMX-specific, but since AMD has an issue with PTE accessed-bits being set by hardware and thus we can't yet enable this feature on it, it might make sense to move the variable to the kvm_intel module for now. Paolo, what do you think?
[PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
This patch adds a check for the presence of the ring buffer in hv_get_bytes_to_read/write() to avoid possible NULL pointer dereferences. If the ring buffer is not yet allocated, return 0 bytes to be read/written. The root cause is that code that accesses the ring buffer including hv_get_bytes_to_read/write() could be vulnerable to the race condition discussed in https://lkml.org/lkml/2018/10/18/779 This race is being addressed by the patch series by Kimberly Brown in https://lkml.org/lkml/2019/2/21/1236 which is not final yet Signed-off-by: Mohammed Gamal --- include/linux/hyperv.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 64698ec8f2ac..7b2f566250b2 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -148,6 +148,9 @@ static inline u32 hv_get_bytes_to_read(const struct hv_ring_buffer_info *rbi) { u32 read_loc, write_loc, dsize, read; + if (!rbi->ring_buffer) + return 0; + dsize = rbi->ring_datasize; read_loc = rbi->ring_buffer->read_index; write_loc = READ_ONCE(rbi->ring_buffer->write_index); @@ -162,6 +165,9 @@ static inline u32 hv_get_bytes_to_write(const struct hv_ring_buffer_info *rbi) { u32 read_loc, write_loc, dsize, write; + if (!rbi->ring_buffer) + return 0; + dsize = rbi->ring_datasize; read_loc = READ_ONCE(rbi->ring_buffer->read_index); write_loc = rbi->ring_buffer->write_index; -- 2.18.1
Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
On Thu, 2019-03-07 at 17:33 +, Michael Kelley wrote: > From: Mohammed Gamal Sent: Thursday, March 7, > 2019 8:36 AM > > > > This patch adds a check for the presence of the ring buffer in > > hv_get_bytes_to_read/write() to avoid possible NULL pointer > > dereferences. > > If the ring buffer is not yet allocated, return 0 bytes to be > > read/written. > > > > The root cause is that code that accesses the ring buffer including > > hv_get_bytes_to_read/write() could be vulnerable to the race > > condition > > discussed in https://lkml.org/lkml/2018/10/18/779>; > > > > This race is being addressed by the patch series by Kimberly Brown > > in > > https://lkml.org/lkml/2019/2/21/1236 which is not final yet > > > > Signed-off-by: Mohammed Gamal > > Could you elaborate on the code paths where > hv_get_bytes_to_read/write() could be called when the ring buffer > isn't yet allocated? My sense is that Kim Brown's patch will address > all of the code paths that involved sysfs access from outside the > driver. And within a driver, the ring buffer should never be > accessed > unless it is already allocated. Is there another code path we're not > aware of? I'm wondering if these changes are really needed once > Kim Brown's patch is finished. > > Michael I've seen one instance of the race in the netvsc driver when running traffic through it with iperf3 while continuously changing the channel settings. The following code path deallocates the ring buffer: netvsc_set_channels() -> netvsc_detach() -> rndis_filter_device_remove() -> netvsc_device_remove() -> vmbus_close() -> vmbus_free_ring() -> hv_ringbuffer_cleanup(). netvsc_send_pkt() -> hv_get_bytes_to_write() might get called concurrently after vmbus_close() and before vmbus_open() returns and sets up the new ring buffer. The race is fairly hard to reproduce on recent upstream kernels, but I still managed to reproduce it.
[PATCH v2] hv_netvsc: Fix net device attach on older Windows hosts
On older windows hosts the net_device instance is returned to the caller of rndis_filter_device_add() without having the presence bit set first. This would cause any subsequent calls to network device operations (e.g. MTU change, channel change) to fail after the device is detached once, returning -ENODEV. Instead of returning the device instabce, we take the exit path where we call netif_device_attach() Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic") Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/rndis_filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 6b127be..e7ca5b5 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -1288,7 +1288,7 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev, rndis_device->link_state ? "down" : "up"); if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) - return net_device; + goto out; rndis_filter_query_link_speed(rndis_device, net_device); -- 1.8.3.1
Re: [PATCH] hv_netvsc: Make sure out channel is fully opened on send
On Wed, 2018-03-14 at 10:22 +0100, Mohammed Gamal wrote: > On Tue, 2018-03-13 at 12:35 -0700, Stephen Hemminger wrote: > > On Tue, 13 Mar 2018 20:06:50 +0100 > > Mohammed Gamal wrote: > > > > > Dring high network traffic changes to network interface > > > parameters > > > such as number of channels or MTU can cause a kernel panic with a > > > NULL > > > pointer dereference. This is due to netvsc_device_remove() being > > > called and deallocating the channel ring buffers, which can then > > > be > > > accessed by netvsc_send_pkt() before they're allocated on calling > > > netvsc_device_add() > > > > > > The patch fixes this problem by checking the channel state and > > > returning > > > ENODEV if not yet opened. We also move the call to > > > hv_ringbuf_avail_percent() > > > which may access the uninitialized ring buffer. > > > > > > Signed-off-by: Mohammed Gamal > > > --- > > > drivers/net/hyperv/netvsc.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/net/hyperv/netvsc.c > > > b/drivers/net/hyperv/netvsc.c > > > index 0265d70..44a8358 100644 > > > --- a/drivers/net/hyperv/netvsc.c > > > +++ b/drivers/net/hyperv/netvsc.c > > > @@ -757,7 +757,7 @@ static inline int netvsc_send_pkt( > > > struct netdev_queue *txq = netdev_get_tx_queue(ndev, > > > packet->q_idx); > > > u64 req_id; > > > int ret; > > > - u32 ring_avail = hv_ringbuf_avail_percent(&out_channel- > > > > outbound); > > > > > > + u32 ring_avail; > > > > > > nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; > > > if (skb) > > > @@ -773,7 +773,7 @@ static inline int netvsc_send_pkt( > > > > > > req_id = (ulong)skb; > > > > > > - if (out_channel->rescind) > > > + if (out_channel->rescind || out_channel->state != > > > CHANNEL_OPENED_STATE) > > > return -ENODEV; > > > > > > if (packet->page_buf_cnt) { > > > @@ -791,6 +791,7 @@ static inline int netvsc_send_pkt( > > > VMBUS_DATA_PACKET_FLAG_CO > > > MP > > > LETION_REQUESTED); > > > } > > > > > > + ring_avail = hv_ringbuf_avail_percent(&out_channel- > > > > outbound); > > > > > > if (ret == 0) { > > > atomic_inc_return(&nvchan->queue_sends); > > > > > > > Thanks for your patch. Yes there are races with the current update > > logic. The root cause goes higher up in the flow; the send queues > > should > > be stopped before netvsc_device_remove is called. Solving it where > > you tried > > to is racy and not going to work reliably. > > > > Network patches should go to net...@vger.kernel.org > > > > You can't move the ring_avail check until after the > > vmbus_sendpacket > > because > > that will break the flow control logic. > > > > Why? I don't see ring_avail being used before that point. Ah, stupid me. vmbus_sendpacket() will write to the ring buffer and that means that ring_avail value will be different than the expected. > > > Instead, you should just move the avail_read check until just after > > the existing rescind > > check. > > > > Also, you shouldn't need to check for OPENED_STATE, just rescind is > > enough. > > That rarely mitigated the race. channel->rescind flag is set on vmbus > exit - called on module unload - and when a rescind offer is received > from the host, which AFAICT doesn't happen on every call to > netvsc_device_remove, so it's quite possible that the ringbuffer is > accessed before it's allocated again on channel open and hence the > check for OPENED_STAT - which is only set after all vmbus data is > initialized. > Perhaps I haven't been clear enough. The NULL pointer dereference happens in the call to hv_ringbuf_avail_percent() which is used to calculate ring_avail. So we need to stop the queues before calling it if the channel's ring buffers haven't been allocated yet, but OTOH we should only stop the queues based upon the value of ring_avail, so this leads into a chicken and egg situation. Is my observation here correct? Please correct me if I am wrong, Stephen.
[PATCH] hv_netvsc: Fix net device attach on older Windows hosts
On older windows hosts the net_device instance is returned to the caller of rndis_filter_device_add() without having the presence bit set first. This would cause any subsequent calls to network device operations (e.g. MTU change, channel change) to fail after the device is detached once, returning -ENODEV. Make sure we explicitly call netif_device_attach() before returning the net_device instance to make sure the presence bit is set Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic") Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/rndis_filter.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 6b127be..09a3c1d 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -1287,8 +1287,10 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev, rndis_device->hw_mac_adr, rndis_device->link_state ? "down" : "up"); - if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) + if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) { + netif_device_attach(net); return net_device; + } rndis_filter_query_link_speed(rndis_device, net_device); -- 1.8.3.1
Re: [PATCH] hv_netvsc: Fix net device attach on older Windows hosts
On Tue, 2018-05-08 at 11:13 -0700, Stephen Hemminger wrote: > On Tue, 8 May 2018 19:40:47 +0200 > Mohammed Gamal wrote: > > > On older windows hosts the net_device instance is returned to > > the caller of rndis_filter_device_add() without having the presence > > bit set first. This would cause any subsequent calls to network > > device > > operations (e.g. MTU change, channel change) to fail after the > > device > > is detached once, returning -ENODEV. > > > > Make sure we explicitly call netif_device_attach() before returning > > the net_device instance to make sure the presence bit is set > > > > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic") > > > > Signed-off-by: Mohammed Gamal > > --- > > drivers/net/hyperv/rndis_filter.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/hyperv/rndis_filter.c > > b/drivers/net/hyperv/rndis_filter.c > > index 6b127be..09a3c1d 100644 > > --- a/drivers/net/hyperv/rndis_filter.c > > +++ b/drivers/net/hyperv/rndis_filter.c > > @@ -1287,8 +1287,10 @@ struct netvsc_device > > *rndis_filter_device_add(struct hv_device *dev, > > rndis_device->hw_mac_adr, > > rndis_device->link_state ? "down" : "up"); > > > > - if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) > > + if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) { > > + netif_device_attach(net); > > return net_device; > > + } > > Yes, this looks right, but it might be easier to use goto existing > exit > path. > I was just not sure if we should set max_chn and num_chn here. I will modify the patch and resend. > diff --git a/drivers/net/hyperv/rndis_filter.c > b/drivers/net/hyperv/rndis_filter.c > index 3b6dbacaf77d..ed941c5a0be9 100644 > --- a/drivers/net/hyperv/rndis_filter.c > +++ b/drivers/net/hyperv/rndis_filter.c > @@ -1316,7 +1316,7 @@ struct netvsc_device > *rndis_filter_device_add(struct hv_device *dev, > rndis_device->link_state ? "down" : "up"); > > if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) > - return net_device; > + goto out; > > rndis_filter_query_link_speed(rndis_device, net_device); >
[PATCH v2 05/11] KVM: x86: update exception bitmap on CPUID changes
From: Paolo Bonzini Allow vendor code to observe changes to MAXPHYADDR and start/stop intercepting page faults. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 8a294f9747aa..ea5bbf2153bb 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -128,6 +128,8 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_mmu_reset_context(vcpu); kvm_pmu_refresh(vcpu); + kvm_x86_ops.update_exception_bitmap(vcpu); + return 0; } -- 2.26.2
[PATCH v2 08/11] KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
From: Paolo Bonzini Ignore non-present page faults, since those cannot have reserved bits set. When running access.flat with "-cpu Haswell,phys-bits=36", the number of trapped page faults goes down from 8872644 to 3978948. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f38cbadcb3a5..8daf78b2d4cb 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4358,6 +4358,16 @@ static void init_vmcs(struct vcpu_vmx *vmx) vmx->pt_desc.guest.output_mask = 0x7F; vmcs_write64(GUEST_IA32_RTIT_CTL, 0); } + + /* +* If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched +* between guest and host. In that case we only care about present +* faults. +*/ + if (enable_ept) { + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK); + } } static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) -- 2.26.2
[PATCH v2 07/11] KVM: VMX: Add guest physical address check in EPT violation and misconfig
Check guest physical address against it's maximum physical memory. If the guest's physical address exceeds the maximum (i.e. has reserved bits set), inject a guest page fault with PFERR_RSVD_MASK set. This has to be done both in the EPT violation and page fault paths, as there are complications in both cases with respect to the computation of the correct error code. For EPT violations, unfortunately the only possibility is to emulate, because the access type in the exit qualification might refer to an access to a paging structure, rather than to the access performed by the program. Trapping page faults instead is needed in order to correct the error code, but the access type can be obtained from the original error code and passed to gva_to_gpa. The corrections required in the error code are subtle. For example, imagine that a PTE for a supervisor page has a reserved bit set. On a supervisor-mode access, the EPT violation path would trigger. However, on a user-mode access, the processor will not notice the reserved bit and not include PFERR_RSVD_MASK in the error code. Co-developed-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 24 +--- arch/x86/kvm/vmx/vmx.h | 3 ++- 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 46d522ee5cb1..f38cbadcb3a5 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4793,9 +4793,15 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) if (is_page_fault(intr_info)) { cr2 = vmx_get_exit_qual(vcpu); - /* EPT won't cause page fault directly */ - WARN_ON_ONCE(!vcpu->arch.apf.host_apf_flags && enable_ept); - return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0); + if (enable_ept && !vcpu->arch.apf.host_apf_flags) { + /* +* EPT will cause page fault only if we need to +* detect illegal GPAs. +*/ + kvm_fixup_and_inject_pf_error(vcpu, cr2, error_code); + return 1; + } else + return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0); } ex_no = intr_info & INTR_INFO_VECTOR_MASK; @@ -5311,6 +5317,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK; vcpu->arch.exit_qualification = exit_qualification; + + /* +* Check that the GPA doesn't exceed physical memory limits, as that is +* a guest page fault. We have to emulate the instruction here, because +* if the illegal address is that of a paging structure, then +* EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if supported we +* would also use advanced VM-exit information for EPT violations to +* reconstruct the page fault error code. +*/ + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa))) + return kvm_emulate_instruction(vcpu, 0); + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); } diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 5e2da15fe94f..950ecf237558 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -11,6 +11,7 @@ #include "kvm_cache_regs.h" #include "ops.h" #include "vmcs.h" +#include "cpuid.h" extern const u32 vmx_msr_index[]; @@ -554,7 +555,7 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu) { - return !enable_ept; + return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits; } void dump_vmcs(void); -- 2.26.2
[PATCH v2 10/11] KVM: SVM: Add guest physical address check in NPF/PF interception
Check guest physical address against it's maximum physical memory. If the guest's physical address exceeds the maximum (i.e. has reserved bits set), inject a guest page fault with PFERR_RSVD_MASK set. Similar ot VMX, this has to be done both in the NPF and page fault interceptions, as there are complications in both cases with respect to the computation of the correct error code. For NPF interceptions, unfortunately the only possibility is to emulate, because the access type in the exit qualification might refer to an access to a paging structure, rather than to the access performed by the program. Trapping page faults instead is needed in order to correct the error code, but the access type can be obtained from the original error code and passed to gva_to_gpa. The corrections required in the error code are subtle. For example, imagine that a PTE for a supervisor page has a reserved bit set. On a supervisor-mode access, the EPT violation path would trigger. However, on a user-mode access, the processor will not notice the reserved bit and not include PFERR_RSVD_MASK in the error code. CC: Tom Lendacky CC: Babu Moger Signed-off-by: Mohammed Gamal --- arch/x86/kvm/svm/svm.c | 11 +++ arch/x86/kvm/svm/svm.h | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 05412818027d..ec3224a2e7c2 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1702,6 +1702,12 @@ static int pf_interception(struct vcpu_svm *svm) u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2); u64 error_code = svm->vmcb->control.exit_info_1; + if (npt_enabled && !svm->vcpu.arch.apf.host_apf_flags) { + kvm_fixup_and_inject_pf_error(&svm->vcpu, + fault_address, error_code); + return 1; + } + return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address, static_cpu_has(X86_FEATURE_DECODEASSISTS) ? svm->vmcb->control.insn_bytes : NULL, @@ -1714,6 +1720,11 @@ static int npf_interception(struct vcpu_svm *svm) u64 error_code = svm->vmcb->control.exit_info_1; trace_kvm_page_fault(fault_address, error_code); + + /* Check if guest gpa doesn't exceed physical memory limits */ + if (unlikely(kvm_mmu_is_illegal_gpa(&svm->vcpu, fault_address))) + return kvm_emulate_instruction(&svm->vcpu, 0); + return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code, static_cpu_has(X86_FEATURE_DECODEASSISTS) ? svm->vmcb->control.insn_bytes : NULL, diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 2b7469f3db0e..12b502e36dbd 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -348,7 +348,7 @@ static inline bool gif_set(struct vcpu_svm *svm) static inline bool svm_need_pf_intercept(struct vcpu_svm *svm) { -return !npt_enabled; +return !npt_enabled || cpuid_maxphyaddr(&svm->vcpu) < boot_cpu_data.x86_phys_bits; } /* svm.c */ -- 2.26.2
[PATCH v2 09/11] KVM: SVM: introduce svm_need_pf_intercept
CC: Tom Lendacky CC: Babu Moger Signed-off-by: Mohammed Gamal --- arch/x86/kvm/svm/svm.c | 8 arch/x86/kvm/svm/svm.h | 6 ++ 2 files changed, 14 insertions(+) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 94108e6cc6da..05412818027d 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1087,6 +1087,9 @@ static void init_vmcb(struct vcpu_svm *svm) } svm->asid_generation = 0; + if (svm_need_pf_intercept(svm)) + set_exception_intercept(svm, PF_VECTOR); + svm->nested.vmcb = 0; svm->vcpu.arch.hflags = 0; @@ -1633,6 +1636,11 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu) clr_exception_intercept(svm, BP_VECTOR); + if (svm_need_pf_intercept(svm)) + set_exception_intercept(svm, PF_VECTOR); + else + clr_exception_intercept(svm, PF_VECTOR); + if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) { if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) set_exception_intercept(svm, BP_VECTOR); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 6ac4c00a5d82..2b7469f3db0e 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -19,6 +19,7 @@ #include #include +#include "cpuid.h" static const u32 host_save_user_msrs[] = { #ifdef CONFIG_X86_64 @@ -345,6 +346,11 @@ static inline bool gif_set(struct vcpu_svm *svm) return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); } +static inline bool svm_need_pf_intercept(struct vcpu_svm *svm) +{ +return !npt_enabled; +} + /* svm.c */ #define MSR_INVALID0xU -- 2.26.2
[PATCH v2 11/11] KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable
The reason behind including this patch is unexpected behaviour we see with NPT vmexit handling in AMD processor. With previous patch ("KVM: SVM: Add guest physical address check in NPF/PF interception") we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 221 expected 201 Dump mapping: address: 0x1234 --L4: 24c3027 --L3: 24c4027 --L2: 24c5021 --L1: 100221 This shows that the PTE's accessed bit is apparently being set by the CPU hardware before the NPF vmexit. This completely handled by hardware and can not be fixed in software. This patch introduces a workaround. We add a boolean variable: 'allow_smaller_maxphyaddr' Which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. We also add a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it would ignore the -cpu ..,phys-bits=X) CC: Tom Lendacky CC: Babu Moger Signed-off-by: Mohammed Gamal --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm/svm.c | 15 +++ arch/x86/kvm/vmx/vmx.c | 7 +++ arch/x86/kvm/x86.c | 6 ++ include/uapi/linux/kvm.h| 1 + 5 files changed, 30 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7ebdb43632e0..b25f7497307d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1304,7 +1304,7 @@ struct kvm_arch_async_pf { }; extern u64 __read_mostly host_efer; - +extern bool __read_mostly allow_smaller_maxphyaddr; extern struct kvm_x86_ops kvm_x86_ops; #define __KVM_HAVE_ARCH_VM_ALLOC diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index ec3224a2e7c2..1b8880b89e9f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -924,6 +924,21 @@ static __init int svm_hardware_setup(void) svm_set_cpu_caps(); + /* +* It seems that on AMD processors PTE's accessed bit is +* being set by the CPU hardware before the NPF vmexit. +* This is not expected behaviour and our tests fail because +* of it. +* A workaround here is to disable support for +* GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled. +* In this case userspace can know if there is support using +* KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle +* it +* If future AMD CPU models change the behaviour described above, +* this variable can be changed accordingly +*/ + allow_smaller_maxphyaddr = !npt_enabled; + return 0; err: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 8daf78b2d4cb..fe0ca39c0887 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8316,6 +8316,13 @@ static int __init vmx_init(void) #endif vmx_check_vmcs12_offsets(); + /* +* Intel processors don't have problems with +* GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable +* it for VMX by default +*/ + allow_smaller_maxphyaddr = true; + return 0; } module_init(vmx_init); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 84f1f0084d2e..5bca6d6d24e9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -187,6 +187,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs; u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); +bool __read_mostly allow_smaller_maxphyaddr; +EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr); + static u64 __read_mostly host_xss; u64 __read_mostly supported_xss; EXPORT_SYMBOL_GPL(supported_xss); @@ -3533,6 +3536,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: r = kvm_x86_ops.nested_ops->enable_evmcs != NULL; break; + case KVM_CAP_SMALLER_MAXPHYADDR: + r = (int) allow_smaller_maxphyaddr; + break; default: break; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 4fdf30316582..68cd3a0af9bb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PPC_SECURE_GUEST 181 #define KVM_CAP_HALT_POLL 182 #define KVM_CAP_ASYNC_PF_INT 183 +#define KVM_CAP_SMALLER_MAXPHYADDR 184 #ifdef KVM_CAP_IRQ_ROUTING -- 2.26.2
[PATCH v2 02/11] KVM: x86: mmu: Move translate_gpa() to mmu.c
Also no point of it being inline since it's always called through function pointers. So remove that. Signed-off-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 6 -- arch/x86/kvm/mmu/mmu.c | 6 ++ 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f8998e97457f..bc0fb116cc5c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1551,12 +1551,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd, bool skip_tlb_flush, void kvm_configure_mmu(bool enable_tdp, int tdp_page_level); -static inline gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, - struct x86_exception *exception) -{ - return gpa; -} - static inline struct kvm_mmu_page *page_header(hpa_t shadow_page) { struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index fdd05c233308..ee113fc1f1bf 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -515,6 +515,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte) return likely(kvm_gen == spte_gen); } +static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, + struct x86_exception *exception) +{ +return gpa; +} + /* * Sets the shadow PTE masks used by the MMU. * -- 2.26.2
[PATCH v2 06/11] KVM: VMX: introduce vmx_need_pf_intercept
From: Paolo Bonzini Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 28 +--- arch/x86/kvm/vmx/vmx.c| 2 +- arch/x86/kvm/vmx/vmx.h| 5 + 3 files changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index d1af20b050a8..328411919518 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2433,22 +2433,28 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) /* * Whether page-faults are trapped is determined by a combination of -* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. -* If enable_ept, L0 doesn't care about page faults and we should -* set all of these to L1's desires. However, if !enable_ept, L0 does -* care about (at least some) page faults, and because it is not easy -* (if at all possible?) to merge L0 and L1's desires, we simply ask -* to exit on each and every L2 page fault. This is done by setting -* MASK=MATCH=0 and (see below) EB.PF=1. +* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. If L0 +* doesn't care about page faults then we should set all of these to +* L1's desires. However, if L0 does care about (some) page faults, it +* is not easy (if at all possible?) to merge L0 and L1's desires, we +* simply ask to exit on each and every L2 page fault. This is done by +* setting MASK=MATCH=0 and (see below) EB.PF=1. * Note that below we don't need special code to set EB.PF beyond the * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept, * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when * !enable_ept, EB.PF is 1, so the "or" will always be 1. */ - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, - enable_ept ? vmcs12->page_fault_error_code_mask : 0); - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, - enable_ept ? vmcs12->page_fault_error_code_match : 0); + if (vmx_need_pf_intercept(&vmx->vcpu)) { + /* +* TODO: if both L0 and L1 need the same MASK and MATCH, +* go ahead and use it? +*/ + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0); + } else { + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, vmcs12->page_fault_error_code_mask); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, vmcs12->page_fault_error_code_match); + } if (cpu_has_vmx_apicv()) { vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f82c42ac87f9..46d522ee5cb1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -783,7 +783,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu) eb |= 1u << BP_VECTOR; if (to_vmx(vcpu)->rmode.vm86_active) eb = ~0; - if (enable_ept) + if (!vmx_need_pf_intercept(vcpu)) eb &= ~(1u << PF_VECTOR); /* When we are running a nested L2 guest and L1 specified for it a diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 8a83b5edc820..5e2da15fe94f 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -552,6 +552,11 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE; } +static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu) +{ + return !enable_ept; +} + void dump_vmcs(void); #endif /* __KVM_X86_VMX_H */ -- 2.26.2
[PATCH v2 03/11] KVM: x86: mmu: Add guest physical address check in translate_gpa()
In case of running a guest with 4-level page tables on a 5-level page table host, it might happen that a guest might have a physical address with reserved bits set, but the host won't see that and trap it. Hence, we need to check page faults' physical addresses against the guest's maximum physical memory and if it's exceeded, we need to add the PFERR_RSVD_MASK bits to the PF's error code. Also make sure the error code isn't overwritten by the page table walker. Signed-off-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index ee113fc1f1bf..10409b76b2d8 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -518,6 +518,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte) static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, struct x86_exception *exception) { + /* Check if guest physical address doesn't exceed guest maximum */ + if (kvm_mmu_is_illegal_gpa(vcpu, gpa)) { + exception->error_code |= PFERR_RSVD_MASK; + return UNMAPPED_GVA; + } + return gpa; } -- 2.26.2
[PATCH v2 04/11] KVM: x86: rename update_bp_intercept to update_exception_bitmap
From: Paolo Bonzini We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm/svm.c | 7 +++ arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bc0fb116cc5c..7ebdb43632e0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1098,7 +1098,7 @@ struct kvm_x86_ops { void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); void (*vcpu_put)(struct kvm_vcpu *vcpu); - void (*update_bp_intercept)(struct kvm_vcpu *vcpu); + void (*update_exception_bitmap)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8ccfa4197d9c..94108e6cc6da 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1627,7 +1627,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu, mark_dirty(svm->vmcb, VMCB_SEG); } -static void update_bp_intercept(struct kvm_vcpu *vcpu) +static void update_exception_bitmap(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -1636,8 +1636,7 @@ static void update_bp_intercept(struct kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) { if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) set_exception_intercept(svm, BP_VECTOR); - } else - vcpu->guest_debug = 0; + } } static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd) @@ -3989,7 +3988,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .vcpu_blocking = svm_vcpu_blocking, .vcpu_unblocking = svm_vcpu_unblocking, - .update_bp_intercept = update_bp_intercept, + .update_exception_bitmap = update_exception_bitmap, .get_msr_feature = svm_get_msr_feature, .get_msr = svm_get_msr, .set_msr = svm_set_msr, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 36c771728c8c..f82c42ac87f9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7881,7 +7881,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .vcpu_load = vmx_vcpu_load, .vcpu_put = vmx_vcpu_put, - .update_bp_intercept = update_exception_bitmap, + .update_exception_bitmap = update_exception_bitmap, .get_msr_feature = vmx_get_msr_feature, .get_msr = vmx_get_msr, .set_msr = vmx_set_msr, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ac8642e890b1..84f1f0084d2e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9275,7 +9275,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, */ kvm_set_rflags(vcpu, rflags); - kvm_x86_ops.update_bp_intercept(vcpu); + kvm_x86_ops.update_exception_bitmap(vcpu); r = 0; -- 2.26.2
[PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection
This patch adds two helper functions that will be used to support virtualizing MAXPHYADDR in both kvm-intel.ko and kvm.ko. kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA, while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits. Signed-off-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu.h | 6 ++ arch/x86/kvm/x86.c | 21 + arch/x86/kvm/x86.h | 1 + 3 files changed, 28 insertions(+) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 0ad06bfe2c2c..555237dfb91c 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -4,6 +4,7 @@ #include #include "kvm_cache_regs.h" +#include "cpuid.h" #define PT64_PT_BITS 9 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS) @@ -158,6 +159,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu) return kvm_read_cr0_bits(vcpu, X86_CR0_WP); } +static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa) +{ +return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu))); +} + /* * Check if a given access (described through the I/D, W/R and U/S bits of a * page fault error code pfec) causes a permission fault with the given PTE diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 00c88c2f34e4..ac8642e890b1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits); +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code) +{ + struct x86_exception fault; + + if (!(error_code & PFERR_PRESENT_MASK) || + vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, &fault) != UNMAPPED_GVA) { + /* +* If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page +* tables probably do not match the TLB. Just proceed +* with the error code that the processor gave. +*/ + fault.vector = PF_VECTOR; + fault.error_code_valid = true; + fault.error_code = error_code; + fault.nested_page_fault = false; + fault.address = gva; + } + vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault); +} +EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error); + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 6eb62e97e59f..239ae0f3e40b 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int page_num); bool kvm_vector_hashing_enabled(void); +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code); int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int emulation_type, void *insn, int insn_len); fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu); -- 2.26.2
[PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR
When EPT/NPT is enabled, KVM does not really look at guest physical address size. Address bits above maximum physical memory size are reserved. Because KVM does not look at these guest physical addresses, it currently effectively supports guest physical address sizes equal to the host. This can be problem when having a mixed setup of machines with 5-level page tables and machines with 4-level page tables, as live migration can change MAXPHYADDR while the guest runs, which can theoretically introduce bugs. In this patch series we add checks on guest physical addresses in EPT violation/misconfig and NPF vmexits and if needed inject the proper page faults in the guest. A more subtle issue is when the host MAXPHYADDR is larger than that of the guest. Page faults caused by reserved bits on the guest won't cause an EPT violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK error code to the page fault if needed. The last 3 patches (i.e. SVM bits and patch 11) are not intended for immediate inclusion and probably need more discussion. We've been noticing some unexpected behavior in handling NPF vmexits on AMD CPUs (see individual patches for details), and thus we are proposing a workaround (see last patch) that adds a capability that userspace can use to decide who to deal with hosts that might have issues supprting guest MAXPHYADDR < host MAXPHYADDR. Mohammed Gamal (7): KVM: x86: Add helper functions for illegal GPA checking and page fault injection KVM: x86: mmu: Move translate_gpa() to mmu.c KVM: x86: mmu: Add guest physical address check in translate_gpa() KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: SVM: introduce svm_need_pf_intercept KVM: SVM: Add guest physical address check in NPF/PF interception KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable Paolo Bonzini (4): KVM: x86: rename update_bp_intercept to update_exception_bitmap KVM: x86: update exception bitmap on CPUID changes KVM: VMX: introduce vmx_need_pf_intercept KVM: VMX: optimize #PF injection when MAXPHYADDR does not match arch/x86/include/asm/kvm_host.h | 10 ++-- arch/x86/kvm/cpuid.c| 2 ++ arch/x86/kvm/mmu.h | 6 + arch/x86/kvm/mmu/mmu.c | 12 + arch/x86/kvm/svm/svm.c | 41 +++--- arch/x86/kvm/svm/svm.h | 6 + arch/x86/kvm/vmx/nested.c | 28 arch/x86/kvm/vmx/vmx.c | 45 + arch/x86/kvm/vmx/vmx.h | 6 + arch/x86/kvm/x86.c | 29 - arch/x86/kvm/x86.h | 1 + include/uapi/linux/kvm.h| 1 + 12 files changed, 158 insertions(+), 29 deletions(-) -- 2.26.2
Re: [PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection
On Mon, 2020-06-22 at 12:44 +0800, Yuan Yao wrote: > On Fri, Jun 19, 2020 at 05:39:15PM +0200, Mohammed Gamal wrote: > > This patch adds two helper functions that will be used to support > > virtualizing > > MAXPHYADDR in both kvm-intel.ko and kvm.ko. > > > > kvm_fixup_and_inject_pf_error() injects a page fault for a user- > > specified GVA, > > while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU > > address limits. > > > > Signed-off-by: Mohammed Gamal > > Signed-off-by: Paolo Bonzini > > --- > > arch/x86/kvm/mmu.h | 6 ++ > > arch/x86/kvm/x86.c | 21 + > > arch/x86/kvm/x86.h | 1 + > > 3 files changed, 28 insertions(+) > > > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > > index 0ad06bfe2c2c..555237dfb91c 100644 > > --- a/arch/x86/kvm/mmu.h > > +++ b/arch/x86/kvm/mmu.h > > @@ -4,6 +4,7 @@ > > > > #include > > #include "kvm_cache_regs.h" > > +#include "cpuid.h" > > > > #define PT64_PT_BITS 9 > > #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS) > > @@ -158,6 +159,11 @@ static inline bool is_write_protection(struct > > kvm_vcpu *vcpu) > > return kvm_read_cr0_bits(vcpu, X86_CR0_WP); > > } > > > > +static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, > > gpa_t gpa) > > +{ > > +return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu))); > > +} > > + > > /* > > * Check if a given access (described through the I/D, W/R and U/S > > bits of a > > * page fault error code pfec) causes a permission fault with the > > given PTE > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 00c88c2f34e4..ac8642e890b1 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct > > kvm_vcpu *vcpu) > > } > > EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits); > > > > +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t > > gva, u16 error_code) > > +{ > > + struct x86_exception fault; > > + > > + if (!(error_code & PFERR_PRESENT_MASK) || > > + vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, > > &fault) != UNMAPPED_GVA) { > > + /* > > +* If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the > > page > > +* tables probably do not match the TLB. Just proceed > > +* with the error code that the processor gave. > > +*/ > > + fault.vector = PF_VECTOR; > > + fault.error_code_valid = true; > > + fault.error_code = error_code; > > + fault.nested_page_fault = false; > > + fault.address = gva; > > + } > > + vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault); > > Should this "vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault)" > inside the last brace? > Otherwise an uninitialized fault variable will be passed to the > walk_mmu->inject_page_fault. Good catch. You're right. Will fix it in v3 > > > +} > > +EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error); > > + > > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); > > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); > > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > > index 6eb62e97e59f..239ae0f3e40b 100644 > > --- a/arch/x86/kvm/x86.h > > +++ b/arch/x86/kvm/x86.h > > @@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 > > msr, u64 *pdata); > > bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, > > gfn_t gfn, > > int page_num); > > bool kvm_vector_hashing_enabled(void); > > +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t > > gva, u16 error_code); > > int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t > > cr2_or_gpa, > > int emulation_type, void *insn, int > > insn_len); > > fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu); > > -- > > 2.26.2 > >
Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR
On Fri, 2020-06-19 at 16:52 -0500, Tom Lendacky wrote: > On 6/19/20 10:39 AM, Mohammed Gamal wrote: > > When EPT/NPT is enabled, KVM does not really look at guest physical > > address size. Address bits above maximum physical memory size are > > reserved. > > Because KVM does not look at these guest physical addresses, it > > currently > > effectively supports guest physical address sizes equal to the > > host. > > > > This can be problem when having a mixed setup of machines with 5- > > level page > > tables and machines with 4-level page tables, as live migration can > > change > > MAXPHYADDR while the guest runs, which can theoretically introduce > > bugs. > > > > In this patch series we add checks on guest physical addresses in > > EPT > > violation/misconfig and NPF vmexits and if needed inject the proper > > page faults in the guest. > > > > A more subtle issue is when the host MAXPHYADDR is larger than that > > of the > > guest. Page faults caused by reserved bits on the guest won't cause > > an EPT > > violation/NPF and hence we also check guest MAXPHYADDR and add > > PFERR_RSVD_MASK > > error code to the page fault if needed. > > I'm probably missing something here, but I'm confused by this > statement. > Is this for a case where a page has been marked not present and the > guest > has also set what it believes are reserved bits? Then when the page > is > accessed, the guest sees a page fault without the error code for > reserved > bits? If so, my understanding is that is architecturally correct. P=0 > is > considered higher priority than other page faults, at least on AMD. > So if > you have a P=0 and other issues exist within the PTE, AMD will report > the > P=0 fault and that's it. > > The priority of other page fault conditions when P=1 is not defined > and I > don't think we guarantee that you would get all error codes on > fault. > Software is always expected to address the page fault and retry, and > it > may get another page fault when it does, with a different error > code. > Assuming the other errors are addressed, eventually the reserved > bits > would cause an NPF and that could be detected by the HV and handled > appropriately. > > > The last 3 patches (i.e. SVM bits and patch 11) are not intended > > for > > immediate inclusion and probably need more discussion. > > We've been noticing some unexpected behavior in handling NPF > > vmexits > > on AMD CPUs (see individual patches for details), and thus we are > > proposing a workaround (see last patch) that adds a capability that > > userspace can use to decide who to deal with hosts that might have > > issues supprting guest MAXPHYADDR < host MAXPHYADDR. > > Also, something to consider. On AMD, when memory encryption is > enabled > (via the SYS_CFG MSR), a guest can actually have a larger MAXPHYADDR > than > the host. How do these patches all play into that? Well the patches definitely don't address that case. It's assumed a guest VM's MAXPHYADDR <= host MAXPHYADDR, and hence we handle the case where a guests's physical address space is smaller and try to trap faults that may go unnoticed by the host. My question is in the case of guest MAXPHYADDR > host MAXPHYADDR, do we expect somehow that there might be guest physical addresses that contain what the host could see as reserved bits? And how'd the host handle that? Thanks, Mohammed > > Thanks, > Tom > > > > > Mohammed Gamal (7): > >KVM: x86: Add helper functions for illegal GPA checking and page > > fault > > injection > >KVM: x86: mmu: Move translate_gpa() to mmu.c > >KVM: x86: mmu: Add guest physical address check in > > translate_gpa() > >KVM: VMX: Add guest physical address check in EPT violation and > > misconfig > >KVM: SVM: introduce svm_need_pf_intercept > >KVM: SVM: Add guest physical address check in NPF/PF > > interception > >KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR > > support > > configurable > > > > Paolo Bonzini (4): > >KVM: x86: rename update_bp_intercept to update_exception_bitmap > >KVM: x86: update exception bitmap on CPUID changes > >KVM: VMX: introduce vmx_need_pf_intercept > >KVM: VMX: optimize #PF injection when MAXPHYADDR does not match > > > > arch/x86/include/asm/kvm_host.h | 10 ++-- > > arch/x86/kvm/cpuid.c| 2 ++ > > arch/x86/kvm/mmu.h | 6 + > > a
[PATCH RESEND] hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions
When selecting function_graph tracer with the command: # echo function_graph > /sys/kernel/debug/tracing/current_tracer The kernel crashes with the following stack trace: [69703.122389] BUG: stack guard page was hit at 1056545c (stack is fa3f8fed..05d39503) [69703.122403] kernel stack overflow (double-fault): [#1] SMP PTI [69703.122413] CPU: 0 PID: 16982 Comm: bash Kdump: loaded Not tainted 4.18.0-236.el8.x86_64 #1 [69703.122420] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 [69703.122433] RIP: 0010repare_ftrace_return+0xa/0x110 [69703.122458] Code: 05 00 0f 0b 48 c7 c7 10 ca 69 ae 0f b6 f0 e8 4b 52 0c 00 31 c0 eb ca 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 <53> 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 48 85 [69703.122467] RSP: 0018:bd6d01118000 EFLAGS: 00010086 [69703.122476] RAX: RBX: RCX: 0003 [69703.122484] RDX: RSI: bd6d011180d8 RDI: adce7550 [69703.122491] RBP: bd6d01118018 R08: R09: 9d4b09266000 [69703.122498] R10: 9d4b0fc04540 R11: 9d4b0fc20a00 R12: 9d4b6e42aa90 [69703.122506] R13: 9d4b0fc20ab8 R14: 03e8 R15: bd6d0111837c [69703.122514] FS: 7fd5f2588740() GS:9d4b6e40() knlGS: [69703.122521] CS: 0010 DS: ES: CR0: 80050033 [69703.122528] CR2: bd6d01117ff8 CR3: 565d8001 CR4: 003606f0 [69703.122538] DR0: DR1: DR2: [69703.122545] DR3: DR6: fffe0ff0 DR7: 0400 [69703.122552] Call Trace: [69703.122568] ftrace_graph_caller+0x6b/0xa0 [69703.122589] ? read_hv_sched_clock_tsc+0x5/0x20 [69703.122599] read_hv_sched_clock_tsc+0x5/0x20 [69703.122611] sched_clock+0x5/0x10 [69703.122621] sched_clock_local+0x12/0x80 [69703.122631] sched_clock_cpu+0x8c/0xb0 [69703.122644] trace_clock_global+0x21/0x90 [69703.122655] ring_buffer_lock_reserve+0x100/0x3c0 [69703.122671] trace_buffer_lock_reserve+0x16/0x50 [69703.122683] __trace_graph_entry+0x28/0x90 [69703.122695] trace_graph_entry+0xfd/0x1a0 [69703.122705] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122714] ? sched_clock+0x5/0x10 [69703.122723] prepare_ftrace_return+0x99/0x110 [69703.122734] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122743] ? sched_clock+0x5/0x10 [...] Setting the notrace attribute for read_hv_sched_clock_msr() and read_hv_sched_clock_tsc() fixes it Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V specific sched clock function") Suggested-by: Vitaly Kuznetsov Signed-off-by: Mohammed Gamal --- drivers/clocksource/hyperv_timer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c index 09aa44cb8a91d..ba04cb381cd3f 100644 --- a/drivers/clocksource/hyperv_timer.c +++ b/drivers/clocksource/hyperv_timer.c @@ -341,7 +341,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg) return read_hv_clock_tsc(); } -static u64 read_hv_sched_clock_tsc(void) +static u64 notrace read_hv_sched_clock_tsc(void) { return (read_hv_clock_tsc() - hv_sched_clock_offset) * (NSEC_PER_SEC / HV_CLOCK_HZ); @@ -404,7 +404,7 @@ static u64 notrace read_hv_clock_msr_cs(struct clocksource *arg) return read_hv_clock_msr(); } -static u64 read_hv_sched_clock_msr(void) +static u64 notrace read_hv_sched_clock_msr(void) { return (read_hv_clock_msr() - hv_sched_clock_offset) * (NSEC_PER_SEC / HV_CLOCK_HZ); -- 2.26.2
Re: [PATCH RESEND] hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions
On Mon, 2020-10-05 at 13:47 +0200, Mohammed Gamal wrote: > When selecting function_graph tracer with the command: > # echo function_graph > /sys/kernel/debug/tracing/current_tracer > > The kernel crashes with the following stack trace: > > [69703.122389] BUG: stack guard page was hit at 1056545c > (stack is fa3f8fed..05d39503) > [69703.122403] kernel stack overflow (double-fault): [#1] SMP > PTI > [69703.122413] CPU: 0 PID: 16982 Comm: bash Kdump: loaded Not tainted > 4.18.0-236.el8.x86_64 #1 > [69703.122420] Hardware name: Microsoft Corporation Virtual > Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 > [69703.122433] RIP: 0010repare_ftrace_return+0xa/0x110 > [69703.122458] Code: 05 00 0f 0b 48 c7 c7 10 ca 69 ae 0f b6 f0 e8 4b > 52 0c 00 31 c0 eb ca 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 > 55 41 54 <53> 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 > c0 48 85 > [69703.122467] RSP: 0018:bd6d01118000 EFLAGS: 00010086 > [69703.122476] RAX: RBX: RCX: > 0003 > [69703.122484] RDX: RSI: bd6d011180d8 RDI: > adce7550 > [69703.122491] RBP: bd6d01118018 R08: R09: > 9d4b09266000 > [69703.122498] R10: 9d4b0fc04540 R11: 9d4b0fc20a00 R12: > 9d4b6e42aa90 > [69703.122506] R13: 9d4b0fc20ab8 R14: 03e8 R15: > bd6d0111837c > [69703.122514] FS: 7fd5f2588740() GS:9d4b6e40() > knlGS: > [69703.122521] CS: 0010 DS: ES: CR0: 80050033 > [69703.122528] CR2: bd6d01117ff8 CR3: 565d8001 CR4: > 003606f0 > [69703.122538] DR0: DR1: DR2: > > [69703.122545] DR3: DR6: fffe0ff0 DR7: > 0400 > [69703.122552] Call Trace: > [69703.122568] ftrace_graph_caller+0x6b/0xa0 > [69703.122589] ? read_hv_sched_clock_tsc+0x5/0x20 > [69703.122599] read_hv_sched_clock_tsc+0x5/0x20 > [69703.122611] sched_clock+0x5/0x10 > [69703.122621] sched_clock_local+0x12/0x80 > [69703.122631] sched_clock_cpu+0x8c/0xb0 > [69703.122644] trace_clock_global+0x21/0x90 > [69703.122655] ring_buffer_lock_reserve+0x100/0x3c0 > [69703.122671] trace_buffer_lock_reserve+0x16/0x50 > [69703.122683] __trace_graph_entry+0x28/0x90 > [69703.122695] trace_graph_entry+0xfd/0x1a0 > [69703.122705] ? read_hv_clock_tsc_cs+0x10/0x10 > [69703.122714] ? sched_clock+0x5/0x10 > [69703.122723] prepare_ftrace_return+0x99/0x110 > [69703.122734] ? read_hv_clock_tsc_cs+0x10/0x10 > [69703.122743] ? sched_clock+0x5/0x10 > [...] > > Setting the notrace attribute for read_hv_sched_clock_msr() and > read_hv_sched_clock_tsc() fixes it > > Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V > specific > sched clock function") > Suggested-by: Vitaly Kuznetsov > Signed-off-by: Mohammed Gamal > --- > drivers/clocksource/hyperv_timer.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/clocksource/hyperv_timer.c > b/drivers/clocksource/hyperv_timer.c > index 09aa44cb8a91d..ba04cb381cd3f 100644 > --- a/drivers/clocksource/hyperv_timer.c > +++ b/drivers/clocksource/hyperv_timer.c > @@ -341,7 +341,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct > clocksource *arg) > return read_hv_clock_tsc(); > } > > -static u64 read_hv_sched_clock_tsc(void) > +static u64 notrace read_hv_sched_clock_tsc(void) > { > return (read_hv_clock_tsc() - hv_sched_clock_offset) * > (NSEC_PER_SEC / HV_CLOCK_HZ); > @@ -404,7 +404,7 @@ static u64 notrace read_hv_clock_msr_cs(struct > clocksource *arg) > return read_hv_clock_msr(); > } > > -static u64 read_hv_sched_clock_msr(void) > +static u64 notrace read_hv_sched_clock_msr(void) > { > return (read_hv_clock_msr() - hv_sched_clock_offset) * > (NSEC_PER_SEC / HV_CLOCK_HZ); Please ignore the patch. Somehow I missed Wei's reply on it. It's already applied to hyperv-next. Thanks
[PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
This patch exposes allow_smaller_maxphyaddr to the user as a module parameter. Since smaller physical address spaces are only supported on VMX, the parameter is only exposed in the kvm_intel module. Modifications to VMX page fault and EPT violation handling will depend on whether that parameter is enabled. Also disable support by default, and let the user decide if they want to enable it. Signed-off-by: Mohammed Gamal --- arch/x86/kvm/vmx/vmx.c | 15 ++- arch/x86/kvm/vmx/vmx.h | 3 +++ arch/x86/kvm/x86.c | 2 +- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 819c185adf09..dc778c7b5a06 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1; module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO); #endif +extern bool __read_mostly allow_smaller_maxphyaddr; +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR); + #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD) #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE #define KVM_VM_CR0_ALWAYS_ON \ @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) if (is_page_fault(intr_info)) { cr2 = vmx_get_exit_qual(vcpu); - if (enable_ept && !vcpu->arch.apf.host_apf_flags) { + if (enable_ept && !vcpu->arch.apf.host_apf_flags + && allow_smaller_maxphyaddr) { /* * EPT will cause page fault only if we need to * detect illegal GPAs. @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) * would also use advanced VM-exit information for EPT violations to * reconstruct the page fault error code. */ - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa))) + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && allow_smaller_maxphyaddr) return kvm_emulate_instruction(vcpu, 0); return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); @@ -8303,13 +8307,6 @@ static int __init vmx_init(void) #endif vmx_check_vmcs12_offsets(); - /* -* Intel processors don't have problems with -* GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable -* it for VMX by default -*/ - allow_smaller_maxphyaddr = true; - return 0; } module_init(vmx_init); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 26175a4759fa..b859435efa2e 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu) { + if (!allow_smaller_maxphyaddr) + return false; + return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d39d6cf1d473..982f1d73a884 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs; u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); -bool __read_mostly allow_smaller_maxphyaddr; +bool __read_mostly allow_smaller_maxphyaddr = 0; EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr); static u64 __read_mostly host_xss; -- 2.26.2
[PATCH v3 5/9] KVM: x86: update exception bitmap on CPUID changes
From: Paolo Bonzini Allow vendor code to observe changes to MAXPHYADDR and start/stop intercepting page faults. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/cpuid.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 8a294f9747aa..ea5bbf2153bb 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -128,6 +128,8 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) kvm_mmu_reset_context(vcpu); kvm_pmu_refresh(vcpu); + kvm_x86_ops.update_exception_bitmap(vcpu); + return 0; } -- 2.26.2
[PATCH v3 2/9] KVM: x86: mmu: Move translate_gpa() to mmu.c
Also no point of it being inline since it's always called through function pointers. So remove that. Signed-off-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 6 -- arch/x86/kvm/mmu/mmu.c | 6 ++ 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index be5363b21540..62373cc06c72 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1551,12 +1551,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd, bool skip_tlb_flush, void kvm_configure_mmu(bool enable_tdp, int tdp_page_level); -static inline gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, - struct x86_exception *exception) -{ - return gpa; -} - static inline struct kvm_mmu_page *page_header(hpa_t shadow_page) { struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6d6a0ae7800c..f8b3c5181466 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -515,6 +515,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte) return likely(kvm_gen == spte_gen); } +static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, + struct x86_exception *exception) +{ +return gpa; +} + /* * Sets the shadow PTE masks used by the MMU. * -- 2.26.2
[PATCH v3 1/9] KVM: x86: Add helper functions for illegal GPA checking and page fault injection
This patch adds two helper functions that will be used to support virtualizing MAXPHYADDR in both kvm-intel.ko and kvm.ko. kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA, while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits. Signed-off-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu.h | 6 ++ arch/x86/kvm/x86.c | 21 + arch/x86/kvm/x86.h | 1 + 3 files changed, 28 insertions(+) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 444bb9c54548..59930231d5d5 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -4,6 +4,7 @@ #include #include "kvm_cache_regs.h" +#include "cpuid.h" #define PT64_PT_BITS 9 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS) @@ -158,6 +159,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu) return kvm_read_cr0_bits(vcpu, X86_CR0_WP); } +static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa) +{ +return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu))); +} + /* * Check if a given access (described through the I/D, W/R and U/S bits of a * page fault error code pfec) causes a permission fault with the given PTE diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 88c593f83b28..1f5f4074fc59 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10699,6 +10699,27 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits); +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code) +{ + struct x86_exception fault; + + if (!(error_code & PFERR_PRESENT_MASK) || + vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, &fault) != UNMAPPED_GVA) { + /* +* If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page +* tables probably do not match the TLB. Just proceed +* with the error code that the processor gave. +*/ + fault.vector = PF_VECTOR; + fault.error_code_valid = true; + fault.error_code = error_code; + fault.nested_page_fault = false; + fault.address = gva; + } + vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault); +} +EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error); + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 6eb62e97e59f..239ae0f3e40b 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int page_num); bool kvm_vector_hashing_enabled(void); +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code); int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int emulation_type, void *insn, int insn_len); fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu); -- 2.26.2
[PATCH v3 0/9] KVM: Support guest MAXPHYADDR < host MAXPHYADDR
When EPT is enabled, KVM does not really look at guest physical address size. Address bits above maximum physical memory size are reserved. Because KVM does not look at these guest physical addresses, it currently effectively supports guest physical address sizes equal to the host. This can be problem when having a mixed setup of machines with 5-level page tables and machines with 4-level page tables, as live migration can change MAXPHYADDR while the guest runs, which can theoretically introduce bugs. In this patch series we add checks on guest physical addresses in EPT violation/misconfig and NPF vmexits and if needed inject the proper page faults in the guest. A more subtle issue is when the host MAXPHYADDR is larger than that of the guest. Page faults caused by reserved bits on the guest won't cause an EPT violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK error code to the page fault if needed. Changes from v2: - Drop support for this feature on AMD processors after discussion with AMD Mohammed Gamal (5): KVM: x86: Add helper functions for illegal GPA checking and page fault injection KVM: x86: mmu: Move translate_gpa() to mmu.c KVM: x86: mmu: Add guest physical address check in translate_gpa() KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable Paolo Bonzini (4): KVM: x86: rename update_bp_intercept to update_exception_bitmap KVM: x86: update exception bitmap on CPUID changes KVM: VMX: introduce vmx_need_pf_intercept KVM: VMX: optimize #PF injection when MAXPHYADDR does not match arch/x86/include/asm/kvm_host.h | 10 ++-- arch/x86/kvm/cpuid.c| 2 ++ arch/x86/kvm/mmu.h | 6 + arch/x86/kvm/mmu/mmu.c | 12 + arch/x86/kvm/svm/svm.c | 22 +--- arch/x86/kvm/vmx/nested.c | 28 arch/x86/kvm/vmx/vmx.c | 45 + arch/x86/kvm/vmx/vmx.h | 6 + arch/x86/kvm/x86.c | 29 - arch/x86/kvm/x86.h | 1 + include/uapi/linux/kvm.h| 1 + 11 files changed, 133 insertions(+), 29 deletions(-) -- 2.26.2
[PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig
Check guest physical address against it's maximum physical memory. If the guest's physical address exceeds the maximum (i.e. has reserved bits set), inject a guest page fault with PFERR_RSVD_MASK set. This has to be done both in the EPT violation and page fault paths, as there are complications in both cases with respect to the computation of the correct error code. For EPT violations, unfortunately the only possibility is to emulate, because the access type in the exit qualification might refer to an access to a paging structure, rather than to the access performed by the program. Trapping page faults instead is needed in order to correct the error code, but the access type can be obtained from the original error code and passed to gva_to_gpa. The corrections required in the error code are subtle. For example, imagine that a PTE for a supervisor page has a reserved bit set. On a supervisor-mode access, the EPT violation path would trigger. However, on a user-mode access, the processor will not notice the reserved bit and not include PFERR_RSVD_MASK in the error code. Co-developed-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 24 +--- arch/x86/kvm/vmx/vmx.h | 3 ++- 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 770b090969fb..de3f436b2d32 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4790,9 +4790,15 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) if (is_page_fault(intr_info)) { cr2 = vmx_get_exit_qual(vcpu); - /* EPT won't cause page fault directly */ - WARN_ON_ONCE(!vcpu->arch.apf.host_apf_flags && enable_ept); - return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0); + if (enable_ept && !vcpu->arch.apf.host_apf_flags) { + /* +* EPT will cause page fault only if we need to +* detect illegal GPAs. +*/ + kvm_fixup_and_inject_pf_error(vcpu, cr2, error_code); + return 1; + } else + return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0); } ex_no = intr_info & INTR_INFO_VECTOR_MASK; @@ -5308,6 +5314,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK; vcpu->arch.exit_qualification = exit_qualification; + + /* +* Check that the GPA doesn't exceed physical memory limits, as that is +* a guest page fault. We have to emulate the instruction here, because +* if the illegal address is that of a paging structure, then +* EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if supported we +* would also use advanced VM-exit information for EPT violations to +* reconstruct the page fault error code. +*/ + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa))) + return kvm_emulate_instruction(vcpu, 0); + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); } diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index b0e5e210f1c1..0d06951e607c 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -11,6 +11,7 @@ #include "kvm_cache_regs.h" #include "ops.h" #include "vmcs.h" +#include "cpuid.h" extern const u32 vmx_msr_index[]; @@ -552,7 +553,7 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu) { - return !enable_ept; + return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits; } void dump_vmcs(void); -- 2.26.2
[PATCH v3 6/9] KVM: VMX: introduce vmx_need_pf_intercept
From: Paolo Bonzini Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 28 +--- arch/x86/kvm/vmx/vmx.c| 2 +- arch/x86/kvm/vmx/vmx.h| 5 + 3 files changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index b26655104d4a..1aea9e3b8c43 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2433,22 +2433,28 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) /* * Whether page-faults are trapped is determined by a combination of -* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. -* If enable_ept, L0 doesn't care about page faults and we should -* set all of these to L1's desires. However, if !enable_ept, L0 does -* care about (at least some) page faults, and because it is not easy -* (if at all possible?) to merge L0 and L1's desires, we simply ask -* to exit on each and every L2 page fault. This is done by setting -* MASK=MATCH=0 and (see below) EB.PF=1. +* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. If L0 +* doesn't care about page faults then we should set all of these to +* L1's desires. However, if L0 does care about (some) page faults, it +* is not easy (if at all possible?) to merge L0 and L1's desires, we +* simply ask to exit on each and every L2 page fault. This is done by +* setting MASK=MATCH=0 and (see below) EB.PF=1. * Note that below we don't need special code to set EB.PF beyond the * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept, * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when * !enable_ept, EB.PF is 1, so the "or" will always be 1. */ - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, - enable_ept ? vmcs12->page_fault_error_code_mask : 0); - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, - enable_ept ? vmcs12->page_fault_error_code_match : 0); + if (vmx_need_pf_intercept(&vmx->vcpu)) { + /* +* TODO: if both L0 and L1 need the same MASK and MATCH, +* go ahead and use it? +*/ + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0); + } else { + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, vmcs12->page_fault_error_code_mask); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, vmcs12->page_fault_error_code_match); + } if (cpu_has_vmx_apicv()) { vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 178ee92551a9..770b090969fb 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -780,7 +780,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu) eb |= 1u << BP_VECTOR; if (to_vmx(vcpu)->rmode.vm86_active) eb = ~0; - if (enable_ept) + if (!vmx_need_pf_intercept(vcpu)) eb &= ~(1u << PF_VECTOR); /* When we are running a nested L2 guest and L1 specified for it a diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 639798e4a6ca..b0e5e210f1c1 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -550,6 +550,11 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE; } +static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu) +{ + return !enable_ept; +} + void dump_vmcs(void); #endif /* __KVM_X86_VMX_H */ -- 2.26.2
[PATCH v3 9/9] KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable
The reason behind including this patch is unexpected behaviour we see with NPT vmexit handling in AMD processor. With previous patch ("KVM: SVM: Add guest physical address check in NPF/PF interception") we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 221 expected 201 Dump mapping: address: 0x1234 --L4: 24c3027 --L3: 24c4027 --L2: 24c5021 --L1: 100221 This shows that the PTE's accessed bit is apparently being set by the CPU hardware before the NPF vmexit. This completely handled by hardware and can not be fixed in software. This patch introduces a workaround. We add a boolean variable: 'allow_smaller_maxphyaddr' Which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. We also add a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it would ignore the -cpu ..,phys-bits=X) CC: Tom Lendacky CC: Babu Moger Signed-off-by: Mohammed Gamal --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm/svm.c | 15 +++ arch/x86/kvm/vmx/vmx.c | 7 +++ arch/x86/kvm/x86.c | 6 ++ include/uapi/linux/kvm.h| 1 + 5 files changed, 30 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bb4044ffb7b7..26002e1b47f7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1304,7 +1304,7 @@ struct kvm_arch_async_pf { }; extern u64 __read_mostly host_efer; - +extern bool __read_mostly allow_smaller_maxphyaddr; extern struct kvm_x86_ops kvm_x86_ops; #define __KVM_HAVE_ARCH_VM_ALLOC diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 79c33b3539f0..f3d7ae26875c 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -924,6 +924,21 @@ static __init int svm_hardware_setup(void) svm_set_cpu_caps(); + /* +* It seems that on AMD processors PTE's accessed bit is +* being set by the CPU hardware before the NPF vmexit. +* This is not expected behaviour and our tests fail because +* of it. +* A workaround here is to disable support for +* GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled. +* In this case userspace can know if there is support using +* KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle +* it +* If future AMD CPU models change the behaviour described above, +* this variable can be changed accordingly +*/ + allow_smaller_maxphyaddr = !npt_enabled; + return 0; err: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0cebc4832805..8a8e85e6c529 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8294,6 +8294,13 @@ static int __init vmx_init(void) #endif vmx_check_vmcs12_offsets(); + /* +* Intel processors don't have problems with +* GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable +* it for VMX by default +*/ + allow_smaller_maxphyaddr = true; + return 0; } module_init(vmx_init); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03c401963062..167becd6a634 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -187,6 +187,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs; u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); +bool __read_mostly allow_smaller_maxphyaddr; +EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr); + static u64 __read_mostly host_xss; u64 __read_mostly supported_xss; EXPORT_SYMBOL_GPL(supported_xss); @@ -3538,6 +3541,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_ENLIGHTENED_VMCS: r = kvm_x86_ops.nested_ops->enable_evmcs != NULL; break; + case KVM_CAP_SMALLER_MAXPHYADDR: + r = (int) allow_smaller_maxphyaddr; + break; default: break; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 4fdf30316582..68cd3a0af9bb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PPC_SECURE_GUEST 181 #define KVM_CAP_HALT_POLL 182 #define KVM_CAP_ASYNC_PF_INT 183 +#define KVM_CAP_SMALLER_MAXPHYADDR 184 #ifdef KVM_CAP_IRQ_ROUTING -- 2.26.2
[PATCH v3 4/9] KVM: x86: rename update_bp_intercept to update_exception_bitmap
From: Paolo Bonzini We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm/svm.c | 7 +++ arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 62373cc06c72..bb4044ffb7b7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1098,7 +1098,7 @@ struct kvm_x86_ops { void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); void (*vcpu_put)(struct kvm_vcpu *vcpu); - void (*update_bp_intercept)(struct kvm_vcpu *vcpu); + void (*update_exception_bitmap)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index c0da4dd78ac5..79c33b3539f0 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1627,7 +1627,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu, mark_dirty(svm->vmcb, VMCB_SEG); } -static void update_bp_intercept(struct kvm_vcpu *vcpu) +static void update_exception_bitmap(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -1636,8 +1636,7 @@ static void update_bp_intercept(struct kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) { if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) set_exception_intercept(svm, BP_VECTOR); - } else - vcpu->guest_debug = 0; + } } static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd) @@ -3989,7 +3988,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .vcpu_blocking = svm_vcpu_blocking, .vcpu_unblocking = svm_vcpu_unblocking, - .update_bp_intercept = update_bp_intercept, + .update_exception_bitmap = update_exception_bitmap, .get_msr_feature = svm_get_msr_feature, .get_msr = svm_get_msr, .set_msr = svm_set_msr, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 13745f2a5ecd..178ee92551a9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7859,7 +7859,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .vcpu_load = vmx_vcpu_load, .vcpu_put = vmx_vcpu_put, - .update_bp_intercept = update_exception_bitmap, + .update_exception_bitmap = update_exception_bitmap, .get_msr_feature = vmx_get_msr_feature, .get_msr = vmx_get_msr, .set_msr = vmx_set_msr, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1f5f4074fc59..03c401963062 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9281,7 +9281,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, */ kvm_set_rflags(vcpu, rflags); - kvm_x86_ops.update_bp_intercept(vcpu); + kvm_x86_ops.update_exception_bitmap(vcpu); r = 0; -- 2.26.2
[PATCH v3 8/9] KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
From: Paolo Bonzini Ignore non-present page faults, since those cannot have reserved bits set. When running access.flat with "-cpu Haswell,phys-bits=36", the number of trapped page faults goes down from 8872644 to 3978948. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index de3f436b2d32..0cebc4832805 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4355,6 +4355,16 @@ static void init_vmcs(struct vcpu_vmx *vmx) vmx->pt_desc.guest.output_mask = 0x7F; vmcs_write64(GUEST_IA32_RTIT_CTL, 0); } + + /* +* If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched +* between guest and host. In that case we only care about present +* faults. +*/ + if (enable_ept) { + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK); + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK); + } } static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) -- 2.26.2
[PATCH v3 3/9] KVM: x86: mmu: Add guest physical address check in translate_gpa()
In case of running a guest with 4-level page tables on a 5-level page table host, it might happen that a guest might have a physical address with reserved bits set, but the host won't see that and trap it. Hence, we need to check page faults' physical addresses against the guest's maximum physical memory and if it's exceeded, we need to add the PFERR_RSVD_MASK bits to the PF's error code. Also make sure the error code isn't overwritten by the page table walker. Signed-off-by: Mohammed Gamal Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f8b3c5181466..e03e85b21cda 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -518,6 +518,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte) static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, struct x86_exception *exception) { + /* Check if guest physical address doesn't exceed guest maximum */ + if (kvm_mmu_is_illegal_gpa(vcpu, gpa)) { + exception->error_code |= PFERR_RSVD_MASK; + return UNMAPPED_GVA; + } + return gpa; } -- 2.26.2
[PATCH] hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions
When selecting function_graph tracer with the command: # echo function_graph > /sys/kernel/debug/tracing/current_tracer The kernel crashes with the following stack trace: [69703.122389] BUG: stack guard page was hit at 1056545c (stack is fa3f8fed..05d39503) [69703.122403] kernel stack overflow (double-fault): [#1] SMP PTI [69703.122413] CPU: 0 PID: 16982 Comm: bash Kdump: loaded Not tainted 4.18.0-236.el8.x86_64 #1 [69703.122420] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 [69703.122433] RIP: 0010repare_ftrace_return+0xa/0x110 [69703.122458] Code: 05 00 0f 0b 48 c7 c7 10 ca 69 ae 0f b6 f0 e8 4b 52 0c 00 31 c0 eb ca 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 <53> 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 48 85 [69703.122467] RSP: 0018:bd6d01118000 EFLAGS: 00010086 [69703.122476] RAX: RBX: RCX: 0003 [69703.122484] RDX: RSI: bd6d011180d8 RDI: adce7550 [69703.122491] RBP: bd6d01118018 R08: R09: 9d4b09266000 [69703.122498] R10: 9d4b0fc04540 R11: 9d4b0fc20a00 R12: 9d4b6e42aa90 [69703.122506] R13: 9d4b0fc20ab8 R14: 03e8 R15: bd6d0111837c [69703.122514] FS: 7fd5f2588740() GS:9d4b6e40() knlGS: [69703.122521] CS: 0010 DS: ES: CR0: 80050033 [69703.122528] CR2: bd6d01117ff8 CR3: 565d8001 CR4: 003606f0 [69703.122538] DR0: DR1: DR2: [69703.122545] DR3: DR6: fffe0ff0 DR7: 0400 [69703.122552] Call Trace: [69703.122568] ftrace_graph_caller+0x6b/0xa0 [69703.122589] ? read_hv_sched_clock_tsc+0x5/0x20 [69703.122599] read_hv_sched_clock_tsc+0x5/0x20 [69703.122611] sched_clock+0x5/0x10 [69703.122621] sched_clock_local+0x12/0x80 [69703.122631] sched_clock_cpu+0x8c/0xb0 [69703.122644] trace_clock_global+0x21/0x90 [69703.122655] ring_buffer_lock_reserve+0x100/0x3c0 [69703.122671] trace_buffer_lock_reserve+0x16/0x50 [69703.122683] __trace_graph_entry+0x28/0x90 [69703.122695] trace_graph_entry+0xfd/0x1a0 [69703.122705] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122714] ? sched_clock+0x5/0x10 [69703.122723] prepare_ftrace_return+0x99/0x110 [69703.122734] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122743] ? sched_clock+0x5/0x10 [69703.122752] ftrace_graph_caller+0x6b/0xa0 [69703.122768] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122777] ? sched_clock+0x5/0x10 [69703.122786] ? read_hv_sched_clock_tsc+0x5/0x20 [69703.122796] ? ring_buffer_unlock_commit+0x1d/0xa0 [69703.122805] read_hv_sched_clock_tsc+0x5/0x20 [69703.122814] ftrace_graph_caller+0xa0/0xa0 [69703.122823] ? trace_clock_local+0x5/0x10 [69703.122831] ? ftrace_push_return_trace+0x5d/0x120 [69703.122842] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122850] ? sched_clock+0x5/0x10 [69703.122860] ? prepare_ftrace_return+0xd5/0x110 [69703.122871] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122879] ? sched_clock+0x5/0x10 [69703.122889] ? ftrace_graph_caller+0x6b/0xa0 [69703.122904] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.122912] ? sched_clock+0x5/0x10 [69703.122922] ? read_hv_sched_clock_tsc+0x5/0x20 [69703.122931] ? ring_buffer_unlock_commit+0x1d/0xa0 [69703.122940] ? read_hv_sched_clock_tsc+0x5/0x20 [69703.122966] ? ftrace_graph_caller+0xa0/0xa0 [69703.122975] ? trace_clock_local+0x5/0x10 [69703.122984] ? ftrace_push_return_trace+0x5d/0x120 [69703.122995] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.123006] ? sched_clock+0x5/0x10 [69703.123016] ? prepare_ftrace_return+0xd5/0x110 [69703.123026] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.123035] ? sched_clock+0x5/0x10 [69703.123044] ? ftrace_graph_caller+0x6b/0xa0 [69703.123059] ? read_hv_clock_tsc_cs+0x10/0x10 [69703.123068] ? sched_clock+0x5/0x10 Setting the notrace attribute for read_hv_sched_clock_msr() and read_hv_sched_clock_tsc() fixes it Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V specific sched clock function") Signed-off-by: Vitaly Kuznetsov Signed-off-by: Mohammed Gamal --- drivers/clocksource/hyperv_timer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c index 09aa44cb8a91d..ba04cb381cd3f 100644 --- a/drivers/clocksource/hyperv_timer.c +++ b/drivers/clocksource/hyperv_timer.c @@ -341,7 +341,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg) return read_hv_clock_tsc(); } -static u64 read_hv_sched_clock_tsc(void) +static u64 notrace read_hv_sched_clock_tsc(void) { return (read_hv_clock_tsc() - hv_sched_clock_offset) * (NSEC_PER_SEC / HV_CLOCK_HZ); @@ -404,7 +404,7 @@ static u64 notrace read_hv_clock_msr_cs(struct clocksource *arg) return read_hv_clock_msr(
[PATCH] netvsc: Remove redundant use of ipv6_hdr()
This condition already uses an object of type ipv6hdr in the line above. Use the object directly instead of calling ipv6_hdr Signed-off-by: Mohammed Gamal --- drivers/net/hyperv/netvsc_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 63c98bb..06d591c 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -339,7 +339,7 @@ static u32 net_checksum_info(struct sk_buff *skb) if (ip6->nexthdr == IPPROTO_TCP) return TRANSPORT_INFO_IPV6_TCP; - else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP) + else if (ip6->nexthdr == IPPROTO_UDP) return TRANSPORT_INFO_IPV6_UDP; } -- 2.9.4
Re: [PATCH] netvsc: Remove redundant use of ipv6_hdr()
- Original Message - > On Wed, 19 Jul 2017 15:19:28 +0200 > Mohammed Gamal wrote: > > > This condition already uses an object of type ipv6hdr in the line above. > > Use the object directly instead of calling ipv6_hdr > > > > Signed-off-by: Mohammed Gamal > > --- > > drivers/net/hyperv/netvsc_drv.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/net/hyperv/netvsc_drv.c > > b/drivers/net/hyperv/netvsc_drv.c > > index 63c98bb..06d591c 100644 > > --- a/drivers/net/hyperv/netvsc_drv.c > > +++ b/drivers/net/hyperv/netvsc_drv.c > > @@ -339,7 +339,7 @@ static u32 net_checksum_info(struct sk_buff *skb) > > > > if (ip6->nexthdr == IPPROTO_TCP) > > return TRANSPORT_INFO_IPV6_TCP; > > - else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP) > > + else if (ip6->nexthdr == IPPROTO_UDP) > > return TRANSPORT_INFO_IPV6_UDP; > > } > > > > Patch looks fine. > Network patches go through net...@vger.kernel.org not linux driver mailing > list. > I will add it to my next patch of patches that are going to netdev for > net-next. > Thanks for the heads up. Will take that into consideration next time. It's worth pointing out that MAINTAINERS points that files under drivers/net/hyperv are to be sent to de...@linuxdriverproject.org. Perhaps that should be updated.