Re: GPL vs non-GPL device drivers

2007-02-15 Thread Mohammed Gamal

I am still a kernel newbie, and I am still not very much aware about
the GPL vs. Non-GPL drivers debate. I personally think it'd be better
that all drivers should be GPL'd but if that's the case, what would be
the legal position of such vendors as ATI or NVIDIA who supply closed
source drivers? Would it be illegal to use them?

I know this question might sound silly, but as I said before I have
only little awareness about the issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] xen-netfront, xen-netback: Use correct minimum MTU values

2017-10-16 Thread Mohammed Gamal
RFC791 specifies the minimum MTU to be 68, while xen-net{front|back}
drivers use a minimum value of 0.

When set MTU to 0~67 with xen_net{front|back} driver, the network
will become unreachable immediately, the guest can no longer be pinged.

xen_net{front|back} should not allow the user to set this value which causes
network problems.

Reported-by: Chen Shi 
Signed-off-by: Mohammed Gamal 
---
 drivers/net/xen-netback/interface.c | 2 +-
 drivers/net/xen-netfront.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index ee8ed9da..4491ca5 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -486,7 +486,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t 
domid,
 
dev->tx_queue_len = XENVIF_QUEUE_LENGTH;
 
-   dev->min_mtu = 0;
+   dev->min_mtu = ETH_MIN_MTU;
dev->max_mtu = ETH_MAX_MTU - VLAN_ETH_HLEN;
 
/*
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 523387e..8b8689c 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1316,7 +1316,7 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
netdev->features |= netdev->hw_features;
 
netdev->ethtool_ops = &xennet_ethtool_ops;
-   netdev->min_mtu = 0;
+   netdev->min_mtu = ETH_MIN_MTU;
netdev->max_mtu = XEN_NETIF_MAX_TX_SIZE;
SET_NETDEV_DEV(netdev, &dev->dev);
 
-- 
1.8.3.1



[PATCH] vmxnet3: Use correct minimum MTU value

2017-10-17 Thread Mohammed Gamal
Currently the vmxnet3 driver has a minimum MTU value of 60. Which
goes against the RFC791 spec which specifies it at 68.

Setting MTU to values between 60 <= MTU <= 67 causes the network
interface to lose its IP, and it fails to restart.

This sets the minimum value to ETH_MIN_MTU (68) which is compatible
with is according to spec.

Reported-by: Bo Yang 
Signed-off-by: Mohammed Gamal 
---
 drivers/net/vmxnet3/vmxnet3_defs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_defs.h 
b/drivers/net/vmxnet3/vmxnet3_defs.h
index c3a3164..4ad905a 100644
--- a/drivers/net/vmxnet3/vmxnet3_defs.h
+++ b/drivers/net/vmxnet3/vmxnet3_defs.h
@@ -749,7 +749,7 @@ struct Vmxnet3_DriverShared {
((vfTable[vid >> 5] & (1 << (vid & 31))) != 0)
 
 #define VMXNET3_MAX_MTU 9000
-#define VMXNET3_MIN_MTU 60
+#define VMXNET3_MIN_MTU ETH_MIN_MTU
 
 #define VMXNET3_LINK_UP (1 << 16 | 1)/* 10 Gbps, up */
 #define VMXNET3_LINK_DOWN   0
-- 
1.8.3.1



[RFC PATCH 2/2] hv_netvsc: Change GPADL teardown order according to Hyper-V version

2018-01-23 Thread Mohammed Gamal
Commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
introduced a regression causing VMs not to shutdown on pre-Wind2016
hosts after netvsc_remove_device() is called. This was caused as the
GPADL teardown sequence was changed.

This patch restores the old behavior for pre-Win2016 hosts, while
keeping the changes from 0cf7378 for Win2016 and higher hosts.

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 3982f76..d09bb3b 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -575,8 +575,17 @@ void netvsc_device_remove(struct hv_device *device)
 
cancel_work_sync(&net_device->subchan_work);
 
+   /*
+* Revoke receive buffer. If host is pre-Win2016 then tear down
+* receive buffer GPADL. Do the same for send buffer.
+*/
netvsc_revoke_recv_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_recv_buf_gpadl(device, net_device);
+
netvsc_revoke_send_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_send_buf_gpadl(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -589,8 +598,14 @@ void netvsc_device_remove(struct hv_device *device)
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   netvsc_teardown_recv_buf_gpadl(device, net_device);
-   netvsc_teardown_send_buf_gpadl(device, net_device);
+   /*
+* If host is Win2016 or higher then we do the GPADL tear down
+* here after VMBus is closed, instead of doing it earlier.
+*/
+   if (vmbus_proto_version >= VERSION_WIN10) {
+   netvsc_teardown_recv_buf_gpadl(device, net_device);
+   netvsc_teardown_send_buf_gpadl(device, net_device);
+   }
 
/* And dissassociate NAPI context from device */
for (i = 0; i < net_device->num_chn; i++)
-- 
1.8.3.1



[RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-01-23 Thread Mohammed Gamal
Split each of the functions into two for each of send/recv buffers

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 35 +++
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index bfc7969..3982f76 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -100,8 +100,8 @@ static void free_netvsc_device_rcu(struct netvsc_device 
*nvdev)
call_rcu(&nvdev->rcu, free_netvsc_device);
 }
 
-static void netvsc_revoke_buf(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_revoke_recv_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
struct nvsp_message *revoke_packet;
struct net_device *ndev = hv_get_drvdata(device);
@@ -146,6 +146,14 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
net_device->recv_section_cnt = 0;
}
+}
+
+static void netvsc_revoke_send_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct nvsp_message *revoke_packet;
+   struct net_device *ndev = hv_get_drvdata(device);
+   int ret;
 
/* Deal with the send buffer we may have setup.
 * If we got a  send section size, it means we received a
@@ -189,8 +197,8 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
 }
 
-static void netvsc_teardown_gpadl(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_teardown_recv_buf_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
struct net_device *ndev = hv_get_drvdata(device);
int ret;
@@ -215,6 +223,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device,
vfree(net_device->recv_buf);
net_device->recv_buf = NULL;
}
+}
+
+static void netvsc_teardown_send_buf_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   int ret;
 
if (net_device->send_buf_gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
@@ -425,8 +440,10 @@ static int netvsc_init_buf(struct hv_device *device,
goto exit;
 
 cleanup:
-   netvsc_revoke_buf(device, net_device);
-   netvsc_teardown_gpadl(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
+   netvsc_teardown_recv_buf_gpadl(device, net_device);
+   netvsc_teardown_send_buf_gpadl(device, net_device);
 
 exit:
return ret;
@@ -558,7 +575,8 @@ void netvsc_device_remove(struct hv_device *device)
 
cancel_work_sync(&net_device->subchan_work);
 
-   netvsc_revoke_buf(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -571,7 +589,8 @@ void netvsc_device_remove(struct hv_device *device)
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   netvsc_teardown_gpadl(device, net_device);
+   netvsc_teardown_recv_buf_gpadl(device, net_device);
+   netvsc_teardown_send_buf_gpadl(device, net_device);
 
/* And dissassociate NAPI context from device */
for (i = 0; i < net_device->num_chn; i++)
-- 
1.8.3.1



[RFC PATCH 0/2] hv_netvsc: Fix shutdown regression on Win2012 hosts

2018-01-23 Thread Mohammed Gamal
Commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") introduced
a regression that caused VMs not to shutdown after netvsc_device_remove() is
called. This is caused by GPADL teardown sequence change, and while that was 
necessary to fix issues with Win2016 hosts, it did introduce a regression for
earlier versions.

Prior to commit 0cf737808 the call sequence in netvsc_device_remove() was as 
follows (as implemented in netvsc_destroy_buf()):
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Teardown receive buffer GPADL
3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
4- Teardown send buffer GPADL
5- Close vmbus

This didn't work for WS2016 hosts. Commit 0cf737808 split netvsc_destroy_buf()
into two functions and rearranged the order as follows
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Close vmbus
4- Teardown receive buffer GPADL
5- Teardown send buffer GPADL

That worked well for WS2016 hosts, but for WS2012 hosts it prevented VMs from
shutting down. 

This patch series works around this problem. The first patch splits
netvsc_revoke_buf() and netvsc_teardown_gpadl() into two finer grained
functions for tearing down send and receive buffers individally. The second 
patch
uses the finer grained functions to implement the teardown sequence according to
the host's version. We keep the behavior introduced in 0cf737808ae7 for Windows
2016 hosts, while we re-introduce the old sequence for earlier verions.

Mohammed Gamal (2):
  hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
  hv_netvsc: Change GPADL teardown order according to Hyper-V version

 drivers/net/hyperv/netvsc.c | 50 +
 1 file changed, 42 insertions(+), 8 deletions(-)

-- 
1.8.3.1



Re: [RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-01-31 Thread Mohammed Gamal
On Tue, 2018-01-30 at 11:29 -0800, Stephen Hemminger wrote:
> On Tue, 23 Jan 2018 10:34:04 +0100
> Mohammed Gamal  wrote:
> 
> > Split each of the functions into two for each of send/recv buffers
> > 
> > Signed-off-by: Mohammed Gamal 
> 
> Splitting these functions is not necessary

How so? We need to send each message independently, and hence the split
(see cover letter). Is there another way?


Re: [RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-02-01 Thread Mohammed Gamal
On Wed, 2018-01-31 at 15:01 -0800, Stephen Hemminger wrote:
> On Wed, 31 Jan 2018 12:16:49 +0100
> Mohammed Gamal  wrote:
> 
> > On Tue, 2018-01-30 at 11:29 -0800, Stephen Hemminger wrote:
> > > On Tue, 23 Jan 2018 10:34:04 +0100
> > > Mohammed Gamal  wrote:
> > >   
> > > > Split each of the functions into two for each of send/recv
> > > > buffers
> > > > 
> > > > Signed-off-by: Mohammed Gamal   
> > > 
> > > Splitting these functions is not necessary  
> > 
> > How so? We need to send each message independently, and hence the
> > split
> > (see cover letter). Is there another way?
> 
> This is all that is needed.
> 
> 
> Subject: [PATCH] hv_netvsc: work around for gpadl teardown on older
> windows
>  server
> 
> On WS2012 the host ignores messages after vmbus channel is closed.
> Workaround this by doing what Windows does and send the teardown
> before close on older versions of NVSP protocol.
> 
> Reported-by: Mohammed Gamal 
> Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
> Signed-off-by: Stephen Hemminger 
> ---
>  drivers/net/hyperv/netvsc.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c
> b/drivers/net/hyperv/netvsc.c
> index 17e529af79dc..1a3df0eff42f 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -574,10 +574,17 @@ void netvsc_device_remove(struct hv_device
> *device)
>    */
>   netdev_dbg(ndev, "net device safe to remove\n");
>  
> + /* Workaround for older versions of Windows require that
> +  * buffer be revoked before channel is disabled
> +  */
> + if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4)
> + netvsc_teardown_gpadl(device, net_device);
> +
>   /* Now, we can close the channel safely */
>   vmbus_close(device->channel);
>  
> - netvsc_teardown_gpadl(device, net_device);
> + if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4)
> + netvsc_teardown_gpadl(device, net_device);
>  
>   /* And dissassociate NAPI context from device */
>   for (i = 0; i < net_device->num_chn; i++)

I've tried a similar workaround before by calling
netvsc_teardown_gpadl() after netvsc_revoke_buf(), but before setting
net_device_ctx->nvdev to NULL and it caused the guest to hang when
trying to change MTU. 

Let me try that change and see if it behaves differently.


Re: [RFC PATCH 1/2] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-02-01 Thread Mohammed Gamal
On Thu, 2018-02-01 at 09:37 +0100, Mohammed Gamal wrote:
> On Wed, 2018-01-31 at 15:01 -0800, Stephen Hemminger wrote:
> > On Wed, 31 Jan 2018 12:16:49 +0100
> > Mohammed Gamal  wrote:
> > 
> > > On Tue, 2018-01-30 at 11:29 -0800, Stephen Hemminger wrote:
> > > > On Tue, 23 Jan 2018 10:34:04 +0100
> > > > Mohammed Gamal  wrote:
> > > >   
> > > > > Split each of the functions into two for each of send/recv
> > > > > buffers
> > > > > 
> > > > > Signed-off-by: Mohammed Gamal   
> > > > 
> > > > Splitting these functions is not necessary  
> > > 
> > > How so? We need to send each message independently, and hence the
> > > split
> > > (see cover letter). Is there another way?
> > 
> > This is all that is needed.
> > 
> > 
> > Subject: [PATCH] hv_netvsc: work around for gpadl teardown on older
> > windows
> >  server
> > 
> > On WS2012 the host ignores messages after vmbus channel is closed.
> > Workaround this by doing what Windows does and send the teardown
> > before close on older versions of NVSP protocol.
> > 
> > Reported-by: Mohammed Gamal 
> > Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
> > Signed-off-by: Stephen Hemminger 
> > ---
> >  drivers/net/hyperv/netvsc.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/hyperv/netvsc.c
> > b/drivers/net/hyperv/netvsc.c
> > index 17e529af79dc..1a3df0eff42f 100644
> > --- a/drivers/net/hyperv/netvsc.c
> > +++ b/drivers/net/hyperv/netvsc.c
> > @@ -574,10 +574,17 @@ void netvsc_device_remove(struct hv_device
> > *device)
> >      */
> >     netdev_dbg(ndev, "net device safe to remove\n");
> >  
> > +   /* Workaround for older versions of Windows require that
> > +    * buffer be revoked before channel is disabled
> > +    */
> > +   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4)
> > +   netvsc_teardown_gpadl(device, net_device);
> > +
> >     /* Now, we can close the channel safely */
> >     vmbus_close(device->channel);
> >  
> > -   netvsc_teardown_gpadl(device, net_device);
> > +   if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4)
> > +   netvsc_teardown_gpadl(device, net_device);
> >  
> >     /* And dissassociate NAPI context from device */
> >     for (i = 0; i < net_device->num_chn; i++)
> 
> I've tried a similar workaround before by calling
> netvsc_teardown_gpadl() after netvsc_revoke_buf(), but before setting
> net_device_ctx->nvdev to NULL and it caused the guest to hang when
> trying to change MTU. 
> 
> Let me try that change and see if it behaves differently.

I tested the patch, but I've actually seen some unexpected behavior.

First, net_device->nvsp_version is actually NVSP_PROTOCOL_VERSION_5 on
both my Win2012 and Win2016 hosts that I tested on, so the condition is
never executed.

Second, when doing the check instead as  if (vmbus_proto_version <
VERSION_WIN10), I get the same behavior I described above where the
guest hangs as the kernel waits indefinitely in vmbus_teardown_gpadl()
for a completion to be signaled. This is actually what lead me to
propose splitting netvsc_revoke_buf() and netvsc_teardown_gpadl() in my
initial patchset so that we keep the same order of messages and avoid
that indefinite wait.


[PATCH net-next] hv_netvsc: Correct filter setting for multicast/broadcast

2018-03-09 Thread Mohammed Gamal
Commit 009f766 intended to filter multicast/broadcast, however
the NDIS filter wasn't set properly in non-promiscuous modes,
which resulted in issues like DHCP timeouts.
This patch sets the filter flags correctly.

Fixes: 009f766 ("hv_netvsc: filter multicast/broadcast")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/rndis_filter.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 8927c48..411a3ae 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -861,9 +861,9 @@ static void rndis_set_multicast(struct work_struct *w)
filter = NDIS_PACKET_TYPE_PROMISCUOUS;
} else {
if (flags & IFF_ALLMULTI)
-   flags |= NDIS_PACKET_TYPE_ALL_MULTICAST;
+   filter |= NDIS_PACKET_TYPE_ALL_MULTICAST;
if (flags & IFF_BROADCAST)
-   flags |= NDIS_PACKET_TYPE_BROADCAST;
+   filter |= NDIS_PACKET_TYPE_BROADCAST;
}
 
rndis_filter_set_packet_filter(rdev, filter);
-- 
1.8.3.1



[PATCH] hv_netvsc: Make sure out channel is fully opened on send

2018-03-13 Thread Mohammed Gamal
Dring high network traffic changes to network interface parameters
such as number of channels or MTU can cause a kernel panic with a NULL
pointer dereference. This is due to netvsc_device_remove() being
called and deallocating the channel ring buffers, which can then be
accessed by netvsc_send_pkt() before they're allocated on calling
netvsc_device_add()

The patch fixes this problem by checking the channel state and returning
ENODEV if not yet opened. We also move the call to hv_ringbuf_avail_percent()
which may access the uninitialized ring buffer.

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0265d70..44a8358 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -757,7 +757,7 @@ static inline int netvsc_send_pkt(
struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
u64 req_id;
int ret;
-   u32 ring_avail = hv_ringbuf_avail_percent(&out_channel->outbound);
+   u32 ring_avail;
 
nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
if (skb)
@@ -773,7 +773,7 @@ static inline int netvsc_send_pkt(
 
req_id = (ulong)skb;
 
-   if (out_channel->rescind)
+   if (out_channel->rescind || out_channel->state != CHANNEL_OPENED_STATE)
return -ENODEV;
 
if (packet->page_buf_cnt) {
@@ -791,6 +791,7 @@ static inline int netvsc_send_pkt(
   
VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
}
 
+   ring_avail = hv_ringbuf_avail_percent(&out_channel->outbound);
if (ret == 0) {
atomic_inc_return(&nvchan->queue_sends);
 
-- 
1.8.3.1



Re: [PATCH] hv_netvsc: Make sure out channel is fully opened on send

2018-03-14 Thread Mohammed Gamal
On Tue, 2018-03-13 at 12:35 -0700, Stephen Hemminger wrote:
> On Tue, 13 Mar 2018 20:06:50 +0100
> Mohammed Gamal  wrote:
> 
> > Dring high network traffic changes to network interface parameters
> > such as number of channels or MTU can cause a kernel panic with a
> > NULL
> > pointer dereference. This is due to netvsc_device_remove() being
> > called and deallocating the channel ring buffers, which can then be
> > accessed by netvsc_send_pkt() before they're allocated on calling
> > netvsc_device_add()
> > 
> > The patch fixes this problem by checking the channel state and
> > returning
> > ENODEV if not yet opened. We also move the call to
> > hv_ringbuf_avail_percent()
> > which may access the uninitialized ring buffer.
> > 
> > Signed-off-by: Mohammed Gamal 
> > ---
> >  drivers/net/hyperv/netvsc.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/hyperv/netvsc.c
> > b/drivers/net/hyperv/netvsc.c
> > index 0265d70..44a8358 100644
> > --- a/drivers/net/hyperv/netvsc.c
> > +++ b/drivers/net/hyperv/netvsc.c
> > @@ -757,7 +757,7 @@ static inline int netvsc_send_pkt(
> >     struct netdev_queue *txq = netdev_get_tx_queue(ndev,
> > packet->q_idx);
> >     u64 req_id;
> >     int ret;
> > -   u32 ring_avail = hv_ringbuf_avail_percent(&out_channel-
> > >outbound);
> > +   u32 ring_avail;
> >  
> >     nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
> >     if (skb)
> > @@ -773,7 +773,7 @@ static inline int netvsc_send_pkt(
> >  
> >     req_id = (ulong)skb;
> >  
> > -   if (out_channel->rescind)
> > +   if (out_channel->rescind || out_channel->state !=
> > CHANNEL_OPENED_STATE)
> >     return -ENODEV;
> >  
> >     if (packet->page_buf_cnt) {
> > @@ -791,6 +791,7 @@ static inline int netvsc_send_pkt(
> >        VMBUS_DATA_PACKET_FLAG_COMP
> > LETION_REQUESTED);
> >     }
> >  
> > +   ring_avail = hv_ringbuf_avail_percent(&out_channel-
> > >outbound);
> >     if (ret == 0) {
> >     atomic_inc_return(&nvchan->queue_sends);
> >  
> 
> Thanks for your patch. Yes there are races with the current update
> logic. The root cause goes higher up in the flow; the send queues
> should
> be stopped before netvsc_device_remove is called. Solving it where
> you tried
> to is racy and not going to work reliably.
> 
> Network patches should go to net...@vger.kernel.org
> 
> You can't move the ring_avail check until after the vmbus_sendpacket
> because
> that will break the flow control logic.
> 
Why? I don't see ring_avail being used before that point.

> Instead, you should just move the avail_read check until just after
> the existing rescind
> check.
> 
> Also, you shouldn't need to check for OPENED_STATE, just rescind is
> enough.

That rarely mitigated the race. channel->rescind flag is set on vmbus
exit - called on module unload - and when a rescind offer is received
from the host, which AFAICT doesn't happen on every call to
netvsc_device_remove, so it's quite possible that the ringbuffer is
accessed before it's allocated again on channel open and hence the
check for OPENED_STAT - which is only set after all vmbus data is
initialized.

> 
> 
> 
> 


[PATCH 4/4] hv_netvsc: Pass net_device parameter to revoke and teardown functions

2018-04-05 Thread Mohammed Gamal
The callers to netvsc_revoke_*_buf() and netvsc_teardown_*_gpadl()
already have their net_device instances. Pass them as a paramaeter to
the function instead of obtaining them from netvsc_device struct
everytime

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 37 ++---
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index df92c2f..04f611e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -110,9 +110,9 @@ static void free_netvsc_device_rcu(struct netvsc_device 
*nvdev)
 }
 
 static void netvsc_revoke_recv_buf(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
struct nvsp_message *revoke_packet;
int ret;
 
@@ -160,9 +160,9 @@ static void netvsc_revoke_recv_buf(struct hv_device *device,
 }
 
 static void netvsc_revoke_send_buf(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
struct nvsp_message *revoke_packet;
int ret;
 
@@ -211,9 +211,9 @@ static void netvsc_revoke_send_buf(struct hv_device *device,
 }
 
 static void netvsc_teardown_recv_gpadl(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
int ret;
 
if (net_device->recv_buf_gpadl_handle) {
@@ -233,9 +233,9 @@ static void netvsc_teardown_recv_gpadl(struct hv_device 
*device,
 }
 
 static void netvsc_teardown_send_gpadl(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
int ret;
 
if (net_device->send_buf_gpadl_handle) {
@@ -452,10 +452,10 @@ static int netvsc_init_buf(struct hv_device *device,
goto exit;
 
 cleanup:
-   netvsc_revoke_recv_buf(device, net_device);
-   netvsc_revoke_send_buf(device, net_device);
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device, ndev);
+   netvsc_revoke_send_buf(device, net_device, ndev);
+   netvsc_teardown_recv_gpadl(device, net_device, ndev);
+   netvsc_teardown_send_gpadl(device, net_device, ndev);
 
 exit:
return ret;
@@ -474,7 +474,6 @@ static int negotiate_nvsp_ver(struct hv_device *device,
init_packet->hdr.msg_type = NVSP_MSG_TYPE_INIT;
init_packet->msg.init_msg.init.min_protocol_ver = nvsp_ver;
init_packet->msg.init_msg.init.max_protocol_ver = nvsp_ver;
-
trace_nvsp_send(ndev, init_packet);
 
/* Send the init request */
@@ -596,13 +595,13 @@ void netvsc_device_remove(struct hv_device *device)
 * Revoke receive buffer. If host is pre-Win2016 then tear down
 * receive buffer GPADL. Do the same for send buffer.
 */
-   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device, ndev);
if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device, ndev);
 
-   netvsc_revoke_send_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device, ndev);
if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_send_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device, ndev);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -624,8 +623,8 @@ void netvsc_device_remove(struct hv_device *device)
 * here after VMBus is closed.
*/
if (vmbus_proto_version >= VERSION_WIN10) {
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device, ndev);
+   netvsc_teardown_send_gpadl(device, net_device, ndev);
}
 
/* Release all resources */
-- 
1.8.3.1



[PATCH 2/4] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-04-05 Thread Mohammed Gamal
Split each of the functions into two for each of send/recv buffers.
This will be needed in order to implement a fine-grained messaging
sequence to the host so tht we accommodate the requirements of
different Windows versions

Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 46 +
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index d65b7fc..f4df5de 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -109,11 +109,11 @@ static void free_netvsc_device_rcu(struct netvsc_device 
*nvdev)
call_rcu(&nvdev->rcu, free_netvsc_device);
 }
 
-static void netvsc_revoke_buf(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_revoke_recv_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
-   struct nvsp_message *revoke_packet;
struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *revoke_packet;
int ret;
 
/*
@@ -157,6 +157,14 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
net_device->recv_section_cnt = 0;
}
+}
+
+static void netvsc_revoke_send_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *revoke_packet;
+   int ret;
 
/* Deal with the send buffer we may have setup.
 * If we got a  send section size, it means we received a
@@ -202,8 +210,8 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
 }
 
-static void netvsc_teardown_gpadl(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_teardown_recv_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
struct net_device *ndev = hv_get_drvdata(device);
int ret;
@@ -222,6 +230,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device,
}
net_device->recv_buf_gpadl_handle = 0;
}
+}
+
+static void netvsc_teardown_send_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   int ret;
 
if (net_device->send_buf_gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
@@ -437,8 +452,10 @@ static int netvsc_init_buf(struct hv_device *device,
goto exit;
 
 cleanup:
-   netvsc_revoke_buf(device, net_device);
-   netvsc_teardown_gpadl(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
 
 exit:
return ret;
@@ -575,7 +592,8 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
-   netvsc_revoke_buf(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -590,14 +608,18 @@ void netvsc_device_remove(struct hv_device *device)
netdev_dbg(ndev, "net device safe to remove\n");
 
/* older versions require that buffer be revoked before close */
-   if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_gpadl(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10) {
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
+   }
 
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   if (vmbus_proto_version >= VERSION_WIN10)
-   netvsc_teardown_gpadl(device, net_device);
+   if (vmbus_proto_version >= VERSION_WIN10) {
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
+   }
 
/* Release all resources */
free_netvsc_device_rcu(net_device);
-- 
1.8.3.1



[PATCH 1/4] hv_netvsc: Use Windows version instead of NVSP version on GPAD teardown

2018-04-05 Thread Mohammed Gamal
When changing network interface settings, Windows guests
older than WS2016 can no longer shutdown. This was addressed
by commit 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order
on older versions"), however the issue also occurs on WS2012
guests that share NVSP protocol versions with WS2016 guests.
Hence we use Windows version directly to differentiate them.

Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index c9910c3..d65b7fc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -590,13 +590,13 @@ void netvsc_device_remove(struct hv_device *device)
netdev_dbg(ndev, "net device safe to remove\n");
 
/* older versions require that buffer be revoked before close */
-   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4)
+   if (vmbus_proto_version < VERSION_WIN10)
netvsc_teardown_gpadl(device, net_device);
 
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4)
+   if (vmbus_proto_version >= VERSION_WIN10)
netvsc_teardown_gpadl(device, net_device);
 
/* Release all resources */
-- 
1.8.3.1



[PATCH 3/4] hv_netvsc: Ensure correct teardown message sequence order

2018-04-05 Thread Mohammed Gamal
Prior to commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
the call sequence in netvsc_device_remove() was as follows (as
implemented in netvsc_destroy_buf()):
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Teardown receive buffer GPADL
3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
4- Teardown send buffer GPADL
5- Close vmbus

This didn't work for WS2016 hosts. Commit 0cf737808ae7
("hv_netvsc: netvsc_teardown_gpadl() split") rearranged the
teardown sequence as follows:
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Close vmbus
4- Teardown receive buffer GPADL
5- Teardown send buffer GPADL

That worked well for WS2016 hosts, but it prevented guests on older hosts from
shutting down after changing network settings. Commit 0ef58b0a05c1
("hv_netvsc: change GPAD teardown order on older versions") ensured the
following message sequence for older hosts
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Teardown receive buffer GPADL
4- Teardown send buffer GPADL
5- Close vmbus

However, with this sequence calling `ip link set eth0 mtu 1000` hangs and the
process becomes uninterruptible. On futher analysis it turns out that on tearing
down the receive buffer GPADL the kernel is waiting indefinitely
in vmbus_teardown_gpadl() for a completion to be signaled.

Here is a snippet of where this occurs:
int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
{
struct vmbus_channel_gpadl_teardown *msg;
struct vmbus_channel_msginfo *info;
unsigned long flags;
int ret;

info = kmalloc(sizeof(*info) +
   sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
if (!info)
return -ENOMEM;

init_completion(&info->waitevent);
info->waiting_channel = channel;
[]
ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_gpadl_teardown),
 true);

if (ret)
goto post_msg_err;

wait_for_completion(&info->waitevent);
[]
}

The completion is signaled from vmbus_ongpadl_torndown(), which gets called when
the corresponding message is received from the host, which apparently never 
happens
in that case.
This patch works around the issue by restoring the first mentioned message 
sequence
for older hosts

Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index f4df5de..df92c2f 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -592,8 +592,17 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
+   /*
+* Revoke receive buffer. If host is pre-Win2016 then tear down
+* receive buffer GPADL. Do the same for send buffer.
+*/
netvsc_revoke_recv_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_recv_gpadl(device, net_device);
+
netvsc_revoke_send_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_send_gpadl(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -607,15 +616,13 @@ void netvsc_device_remove(struct hv_device *device)
 */
netdev_dbg(ndev, "net device safe to remove\n");
 
-   /* older versions require that buffer be revoked before close */
-   if (vmbus_proto_version < VERSION_WIN10) {
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
-   }
-
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
+   /*
+* If host is Win2016 or higher then we do the GPADL tear down
+* here after VMBus is closed.
+   */
if (vmbus_proto_version >= VERSION_WIN10) {
netvsc_teardown_recv_gpadl(device, net_device);
netvsc_teardown_send_gpadl(device, net_device);
-- 
1.8.3.1



[PATCH 0/4] hv_netvsc: Fix shutdown issues on older Windows hosts

2018-04-05 Thread Mohammed Gamal
Guests running on WS2012 hosts would not shutdown when changing network
interface setting (e.g. Number of channels, MTU ... etc). 

This patch series addresses these shutdown issues we enecountered with WS2012
hosts. It's essentialy a rework of the series sent in 
https://lkml.org/lkml/2018/1/23/111 on top of latest upstream

Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions")

Mohammed Gamal (4):
  hv_netvsc: Use Windows version instead of NVSP version on GPAD
teardown
  hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
  hv_netvsc: Ensure correct teardown message sequence order
  hv_netvsc: Pass net_device parameter to revoke and teardown functions

 drivers/net/hyperv/netvsc.c | 60 +
 1 file changed, 44 insertions(+), 16 deletions(-)

-- 
1.8.3.1



Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

2019-03-13 Thread Mohammed Gamal
On Tue, 2019-03-12 at 18:02 +, Haiyang Zhang wrote:
>  
>  
> > -Original Message-
> > From: Mohammed Gamal 
> > Sent: Thursday, March 7, 2019 1:32 PM
> > To: Michael Kelley ; linux-hyp...@vger.kern
> el.org;
> > kimbrownkd 
> > Cc: Sasha Levin ; Dexuan Cui
> > ; Stephen Hemminger ;
> > Long Li ; KY Srinivasan ;
> Haiyang
> > Zhang ; vkuznets ;
> linux-
> > ker...@vger.kernel.org
> > Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> > hv_get_bytes_to_read/write
> >
> > On Thu, 2019-03-07 at 17:33 +, Michael Kelley wrote:
> > > From: Mohammed Gamal  Sent: Thursday, March 7,
> > > 2019 8:36 AM
> > > >
> > > > This patch adds a check for the presence of the ring buffer in
> > > > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > > > dereferences.
> > > > If the ring buffer is not yet allocated, return 0 bytes to be
> > > > read/written.
> > > >
> > > > The root cause is that code that accesses the ring buffer
> including
> > > > hv_get_bytes_to_read/write() could be vulnerable to the race
> > > > condition discussed in
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> %2Flk
> > > >
> > ml.org%2Flkml%2F2018%2F10%2F18%2F779&data=02%7C01%7Chaiyangz
> > %40m
> > > >
> > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9
> > 1
> > > >
> > ab2d7cd011db47%7C1%7C0%7C636875803518430021&sdata=b51Xc5GUN
> > nHX0K
> > > > 08LrH3ShTyFcRZ4mYHUATd%2BDpvYDw%3D&reserved=0>;
> > > >
> > > > This race is being addressed by the patch series by Kimberly
> Brown
> > > > in
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> %2Flk
> > > >
> > ml.org%2Flkml%2F2019%2F2%2F21%2F1236&data=02%7C01%7Chaiyangz
> > %40m
> > > >
> > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9
> > 1
> > > >
> > ab2d7cd011db47%7C1%7C0%7C636875803518430021&sdata=js1ff15Gbk7
> > 0MD
> > > > A2hkMZExxvAAbDuKDhfBvc5ZrckzM%3D&reserved=0 which is not
> > final
> > > > yet
> > > >
> > > > Signed-off-by: Mohammed Gamal 
> > >
> > > Could you elaborate on the code paths where
> > > hv_get_bytes_to_read/write() could be called when the ring buffer
> > > isn't yet allocated?  My sense is that Kim Brown's patch will
> address
> > > all of the code paths that involved sysfs access from outside the
> > > driver.  And within a driver, the ring buffer should never be
> accessed
> > > unless it is already allocated.  Is there another code path we're
> not
> > > aware of?  I'm wondering if these changes are really needed once
> Kim
> > > Brown's patch is finished.
> > >
> > > Michael
> >
> > I've seen one instance of the race in the netvsc driver when
> running traffic
> > through it with iperf3 while continuously changing the channel
> settings.
> >
> > The following code path deallocates the ring buffer:
> > netvsc_set_channels() -> netvsc_detach() ->
> > rndis_filter_device_remove() -> netvsc_device_remove() ->
> vmbus_close()
> > -> vmbus_free_ring() -> hv_ringbuffer_cleanup().
> >
> > netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
> concurrently
> > after vmbus_close() and before vmbus_open() returns and sets up the
> new ring
> > buffer.
> >
> > The race is fairly hard to reproduce on recent upstream kernels,
> but I still
> > managed to reproduce it.
>  
> Looking at the code from netvsc_detach() –
>  netif_tx_disable(ndev) is called before
> rndis_filter_device_remove(hdev, nvdev).
> So there should be no call to netvsc_send_pkt() after detaching.
> What’s the crash stack trace?
>  
> static int netvsc_detach(struct net_device *ndev,
>  struct netvsc_device *nvdev)
> {
>     struct net_device_context *ndev_ctx = netdev_priv(ndev);
>     struct hv_device *hdev = ndev_ctx->device_ctx;
>     int ret;
>  
>     /* Don't try continuing to try and setup sub channels */
>     if (cancel_work_sync(&nvdev->subchan_work))
>     nvdev->num_chn = 1;
>  
>     /* If device was up (receiving) then shutdown */
>     if (netif_running(ndev)) {
>     netif_tx_disable(ndev);
>  
&

Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

2021-01-18 Thread Mohammed Gamal
On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal 
> wrote:
> > 
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> > 
> > Since smaller physical address spaces are only supported on VMX,
> > the parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will
> > depend on whether
> > that parameter is enabled.
> > 
> > Also disable support by default, and let the user decide if they
> > want to enable
> > it.
> > 
> > Signed-off-by: Mohammed Gamal 
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 15 ++-
> >  arch/x86/kvm/vmx/vmx.h |  3 +++
> >  arch/x86/kvm/x86.c |  2 +-
> >  3 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly
> > enable_preemption_timer = 1;
> >  module_param_named(preemption_timer, enable_preemption_timer,
> > bool, S_IRUGO);
> >  #endif
> > 
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
> 
> Since this variable is in the kvm module rather than the kvm_intel
> module, its current setting is preserved across "rmmod kvm_intel;
> modprobe kvm_intel." That is, if set to true, it doesn't revert to
> false after "rmmod kvm_intel." Is that the intended behavior?
> 

IIRC, this is because this setting was indeed not intended to be just
VMX-specific, but since AMD has an issue with PTE accessed-bits being
set by hardware and thus we can't yet enable this feature on it, it
might make sense to move the variable to the kvm_intel module for now.

Paolo, what do you think?




[PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

2019-03-07 Thread Mohammed Gamal
This patch adds a check for the presence of the ring buffer in
hv_get_bytes_to_read/write() to avoid possible NULL pointer dereferences.
If the ring buffer is not yet allocated, return 0 bytes to be read/written.

The root cause is that code that accesses the ring buffer including
hv_get_bytes_to_read/write() could be vulnerable to the race condition
discussed in https://lkml.org/lkml/2018/10/18/779

This race is being addressed by the patch series by Kimberly Brown in
https://lkml.org/lkml/2019/2/21/1236 which is not final yet

Signed-off-by: Mohammed Gamal 
---
 include/linux/hyperv.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 64698ec8f2ac..7b2f566250b2 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -148,6 +148,9 @@ static inline u32 hv_get_bytes_to_read(const struct 
hv_ring_buffer_info *rbi)
 {
u32 read_loc, write_loc, dsize, read;
 
+   if (!rbi->ring_buffer)
+   return 0;
+
dsize = rbi->ring_datasize;
read_loc = rbi->ring_buffer->read_index;
write_loc = READ_ONCE(rbi->ring_buffer->write_index);
@@ -162,6 +165,9 @@ static inline u32 hv_get_bytes_to_write(const struct 
hv_ring_buffer_info *rbi)
 {
u32 read_loc, write_loc, dsize, write;
 
+   if (!rbi->ring_buffer)
+   return 0;
+
dsize = rbi->ring_datasize;
read_loc = READ_ONCE(rbi->ring_buffer->read_index);
write_loc = rbi->ring_buffer->write_index;
-- 
2.18.1



Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

2019-03-07 Thread Mohammed Gamal
On Thu, 2019-03-07 at 17:33 +, Michael Kelley wrote:
> From: Mohammed Gamal  Sent: Thursday, March 7,
> 2019 8:36 AM
> > 
> > This patch adds a check for the presence of the ring buffer in
> > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > dereferences.
> > If the ring buffer is not yet allocated, return 0 bytes to be
> > read/written.
> > 
> > The root cause is that code that accesses the ring buffer including
> > hv_get_bytes_to_read/write() could be vulnerable to the race
> > condition
> > discussed in https://lkml.org/lkml/2018/10/18/779>;
> > 
> > This race is being addressed by the patch series by Kimberly Brown
> > in
> > https://lkml.org/lkml/2019/2/21/1236 which is not final yet
> > 
> > Signed-off-by: Mohammed Gamal 
> 
> Could you elaborate on the code paths where
> hv_get_bytes_to_read/write() could be called when the ring buffer
> isn't yet allocated?  My sense is that Kim Brown's patch will address
> all of the code paths that involved sysfs access from outside the
> driver.  And within a driver, the ring buffer should never be
> accessed
> unless it is already allocated.  Is there another code path we're not
> aware of?  I'm wondering if these changes are really needed once
> Kim Brown's patch is finished.
> 
> Michael

I've seen one instance of the race in the netvsc driver when running
traffic through it with iperf3 while continuously changing the channel
settings.

The following code path deallocates the ring buffer:
netvsc_set_channels() -> netvsc_detach() ->
rndis_filter_device_remove() -> netvsc_device_remove() -> vmbus_close()
-> vmbus_free_ring() -> hv_ringbuffer_cleanup().

netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
concurrently after vmbus_close() and before vmbus_open() returns and
sets up the new ring buffer. 

The race is fairly hard to reproduce on recent upstream kernels, but I
still managed to reproduce it.


[PATCH v2] hv_netvsc: Fix net device attach on older Windows hosts

2018-05-09 Thread Mohammed Gamal
On older windows hosts the net_device instance is returned to
the caller of rndis_filter_device_add() without having the presence
bit set first. This would cause any subsequent calls to network device
operations (e.g. MTU change, channel change) to fail after the device
is detached once, returning -ENODEV.

Instead of returning the device instabce, we take the exit path where
we call netif_device_attach()

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/rndis_filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 6b127be..e7ca5b5 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1288,7 +1288,7 @@ struct netvsc_device *rndis_filter_device_add(struct 
hv_device *dev,
   rndis_device->link_state ? "down" : "up");
 
if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5)
-   return net_device;
+   goto out;
 
rndis_filter_query_link_speed(rndis_device, net_device);
 
-- 
1.8.3.1



Re: [PATCH] hv_netvsc: Make sure out channel is fully opened on send

2018-03-15 Thread Mohammed Gamal
On Wed, 2018-03-14 at 10:22 +0100, Mohammed Gamal wrote:
> On Tue, 2018-03-13 at 12:35 -0700, Stephen Hemminger wrote:
> > On Tue, 13 Mar 2018 20:06:50 +0100
> > Mohammed Gamal  wrote:
> > 
> > > Dring high network traffic changes to network interface
> > > parameters
> > > such as number of channels or MTU can cause a kernel panic with a
> > > NULL
> > > pointer dereference. This is due to netvsc_device_remove() being
> > > called and deallocating the channel ring buffers, which can then
> > > be
> > > accessed by netvsc_send_pkt() before they're allocated on calling
> > > netvsc_device_add()
> > > 
> > > The patch fixes this problem by checking the channel state and
> > > returning
> > > ENODEV if not yet opened. We also move the call to
> > > hv_ringbuf_avail_percent()
> > > which may access the uninitialized ring buffer.
> > > 
> > > Signed-off-by: Mohammed Gamal 
> > > ---
> > >  drivers/net/hyperv/netvsc.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/hyperv/netvsc.c
> > > b/drivers/net/hyperv/netvsc.c
> > > index 0265d70..44a8358 100644
> > > --- a/drivers/net/hyperv/netvsc.c
> > > +++ b/drivers/net/hyperv/netvsc.c
> > > @@ -757,7 +757,7 @@ static inline int netvsc_send_pkt(
> > >   struct netdev_queue *txq = netdev_get_tx_queue(ndev,
> > > packet->q_idx);
> > >   u64 req_id;
> > >   int ret;
> > > - u32 ring_avail = hv_ringbuf_avail_percent(&out_channel-
> > > > outbound);
> > > 
> > > + u32 ring_avail;
> > >  
> > >   nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
> > >   if (skb)
> > > @@ -773,7 +773,7 @@ static inline int netvsc_send_pkt(
> > >  
> > >   req_id = (ulong)skb;
> > >  
> > > - if (out_channel->rescind)
> > > + if (out_channel->rescind || out_channel->state !=
> > > CHANNEL_OPENED_STATE)
> > >   return -ENODEV;
> > >  
> > >   if (packet->page_buf_cnt) {
> > > @@ -791,6 +791,7 @@ static inline int netvsc_send_pkt(
> > >      VMBUS_DATA_PACKET_FLAG_CO
> > > MP
> > > LETION_REQUESTED);
> > >   }
> > >  
> > > + ring_avail = hv_ringbuf_avail_percent(&out_channel-
> > > > outbound);
> > > 
> > >   if (ret == 0) {
> > >   atomic_inc_return(&nvchan->queue_sends);
> > >  
> > 
> > Thanks for your patch. Yes there are races with the current update
> > logic. The root cause goes higher up in the flow; the send queues
> > should
> > be stopped before netvsc_device_remove is called. Solving it where
> > you tried
> > to is racy and not going to work reliably.
> > 
> > Network patches should go to net...@vger.kernel.org
> > 
> > You can't move the ring_avail check until after the
> > vmbus_sendpacket
> > because
> > that will break the flow control logic.
> > 
> 
> Why? I don't see ring_avail being used before that point.

Ah, stupid me. vmbus_sendpacket() will write to the ring buffer and
that means that ring_avail value will be different than the expected.

> 
> > Instead, you should just move the avail_read check until just after
> > the existing rescind
> > check.
> > 
> > Also, you shouldn't need to check for OPENED_STATE, just rescind is
> > enough.
> 
> That rarely mitigated the race. channel->rescind flag is set on vmbus
> exit - called on module unload - and when a rescind offer is received
> from the host, which AFAICT doesn't happen on every call to
> netvsc_device_remove, so it's quite possible that the ringbuffer is
> accessed before it's allocated again on channel open and hence the
> check for OPENED_STAT - which is only set after all vmbus data is
> initialized.
> 

Perhaps I haven't been clear enough. The NULL pointer dereference
happens in the call to hv_ringbuf_avail_percent() which is used to
calculate ring_avail. 

So we need to stop the queues before calling it if the channel's ring
buffers haven't been allocated yet, but OTOH we should only stop the
queues based upon the value of ring_avail, so this leads into a chicken
and egg situation. 

Is my observation here correct? Please correct me if I am wrong,
Stephen.


[PATCH] hv_netvsc: Fix net device attach on older Windows hosts

2018-05-08 Thread Mohammed Gamal
On older windows hosts the net_device instance is returned to
the caller of rndis_filter_device_add() without having the presence
bit set first. This would cause any subsequent calls to network device
operations (e.g. MTU change, channel change) to fail after the device
is detached once, returning -ENODEV.

Make sure we explicitly call netif_device_attach() before returning
the net_device instance to make sure the presence bit is set

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/rndis_filter.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 6b127be..09a3c1d 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1287,8 +1287,10 @@ struct netvsc_device *rndis_filter_device_add(struct 
hv_device *dev,
   rndis_device->hw_mac_adr,
   rndis_device->link_state ? "down" : "up");
 
-   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5)
+   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) {
+   netif_device_attach(net);
return net_device;
+   }
 
rndis_filter_query_link_speed(rndis_device, net_device);
 
-- 
1.8.3.1



Re: [PATCH] hv_netvsc: Fix net device attach on older Windows hosts

2018-05-08 Thread Mohammed Gamal
On Tue, 2018-05-08 at 11:13 -0700, Stephen Hemminger wrote:
> On Tue,  8 May 2018 19:40:47 +0200
> Mohammed Gamal  wrote:
> 
> > On older windows hosts the net_device instance is returned to
> > the caller of rndis_filter_device_add() without having the presence
> > bit set first. This would cause any subsequent calls to network
> > device
> > operations (e.g. MTU change, channel change) to fail after the
> > device
> > is detached once, returning -ENODEV.
> > 
> > Make sure we explicitly call netif_device_attach() before returning
> > the net_device instance to make sure the presence bit is set
> > 
> > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > 
> > Signed-off-by: Mohammed Gamal 
> > ---
> >  drivers/net/hyperv/rndis_filter.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/hyperv/rndis_filter.c
> > b/drivers/net/hyperv/rndis_filter.c
> > index 6b127be..09a3c1d 100644
> > --- a/drivers/net/hyperv/rndis_filter.c
> > +++ b/drivers/net/hyperv/rndis_filter.c
> > @@ -1287,8 +1287,10 @@ struct netvsc_device
> > *rndis_filter_device_add(struct hv_device *dev,
> >        rndis_device->hw_mac_adr,
> >        rndis_device->link_state ? "down" : "up");
> >  
> > -   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5)
> > +   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5) {
> > +   netif_device_attach(net);
> >     return net_device;
> > +   }
> 
> Yes, this looks right, but it might be easier to use goto existing
> exit
> path.
> 

I was just not sure if we should set max_chn and num_chn here. I will
modify the patch and resend.

> diff --git a/drivers/net/hyperv/rndis_filter.c
> b/drivers/net/hyperv/rndis_filter.c
> index 3b6dbacaf77d..ed941c5a0be9 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -1316,7 +1316,7 @@ struct netvsc_device
> *rndis_filter_device_add(struct hv_device *dev,
>    rndis_device->link_state ? "down" : "up");
>  
> if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5)
> -   return net_device;
> +   goto out;
>  
> rndis_filter_query_link_speed(rndis_device, net_device);
> 


[PATCH v2 05/11] KVM: x86: update exception bitmap on CPUID changes

2020-06-19 Thread Mohammed Gamal
From: Paolo Bonzini 

Allow vendor code to observe changes to MAXPHYADDR and start/stop
intercepting page faults.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/cpuid.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8a294f9747aa..ea5bbf2153bb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -128,6 +128,8 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
kvm_mmu_reset_context(vcpu);
 
kvm_pmu_refresh(vcpu);
+   kvm_x86_ops.update_exception_bitmap(vcpu);
+
return 0;
 }
 
-- 
2.26.2



[PATCH v2 08/11] KVM: VMX: optimize #PF injection when MAXPHYADDR does not match

2020-06-19 Thread Mohammed Gamal
From: Paolo Bonzini 

Ignore non-present page faults, since those cannot have reserved
bits set.

When running access.flat with "-cpu Haswell,phys-bits=36", the
number of trapped page faults goes down from 8872644 to 3978948.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/vmx.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f38cbadcb3a5..8daf78b2d4cb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4358,6 +4358,16 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
+
+   /*
+* If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
+* between guest and host.  In that case we only care about present
+* faults.
+*/
+   if (enable_ept) {
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
+   }
 }
 
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
-- 
2.26.2



[PATCH v2 07/11] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-06-19 Thread Mohammed Gamal
Check guest physical address against it's maximum physical memory. If
the guest's physical address exceeds the maximum (i.e. has reserved bits
set), inject a guest page fault with PFERR_RSVD_MASK set.

This has to be done both in the EPT violation and page fault paths, as
there are complications in both cases with respect to the computation
of the correct error code.

For EPT violations, unfortunately the only possibility is to emulate,
because the access type in the exit qualification might refer to an
access to a paging structure, rather than to the access performed by
the program.

Trapping page faults instead is needed in order to correct the error code,
but the access type can be obtained from the original error code and
passed to gva_to_gpa.  The corrections required in the error code are
subtle. For example, imagine that a PTE for a supervisor page has a reserved
bit set.  On a supervisor-mode access, the EPT violation path would trigger.
However, on a user-mode access, the processor will not notice the reserved
bit and not include PFERR_RSVD_MASK in the error code.

Co-developed-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/vmx.c | 24 +---
 arch/x86/kvm/vmx/vmx.h |  3 ++-
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 46d522ee5cb1..f38cbadcb3a5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4793,9 +4793,15 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 
if (is_page_fault(intr_info)) {
cr2 = vmx_get_exit_qual(vcpu);
-   /* EPT won't cause page fault directly */
-   WARN_ON_ONCE(!vcpu->arch.apf.host_apf_flags && enable_ept);
-   return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0);
+   if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
+   /*
+* EPT will cause page fault only if we need to
+* detect illegal GPAs.
+*/
+   kvm_fixup_and_inject_pf_error(vcpu, cr2, error_code);
+   return 1;
+   } else
+   return kvm_handle_page_fault(vcpu, error_code, cr2, 
NULL, 0);
}
 
ex_no = intr_info & INTR_INFO_VECTOR_MASK;
@@ -5311,6 +5317,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
   PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
vcpu->arch.exit_qualification = exit_qualification;
+
+   /*
+* Check that the GPA doesn't exceed physical memory limits, as that is
+* a guest page fault.  We have to emulate the instruction here, because
+* if the illegal address is that of a paging structure, then
+* EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
+* would also use advanced VM-exit information for EPT violations to
+* reconstruct the page fault error code.
+*/
+   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
+   return kvm_emulate_instruction(vcpu, 0);
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 5e2da15fe94f..950ecf237558 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,7 @@
 #include "kvm_cache_regs.h"
 #include "ops.h"
 #include "vmcs.h"
+#include "cpuid.h"
 
 extern const u32 vmx_msr_index[];
 
@@ -554,7 +555,7 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
 
 static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
 {
-   return !enable_ept;
+   return !enable_ept || cpuid_maxphyaddr(vcpu) < 
boot_cpu_data.x86_phys_bits;
 }
 
 void dump_vmcs(void);
-- 
2.26.2



[PATCH v2 10/11] KVM: SVM: Add guest physical address check in NPF/PF interception

2020-06-19 Thread Mohammed Gamal
Check guest physical address against it's maximum physical memory. If
the guest's physical address exceeds the maximum (i.e. has reserved bits
set), inject a guest page fault with PFERR_RSVD_MASK set.

Similar ot VMX, this has to be done both in the NPF and page fault 
interceptions,
as there are complications in both cases with respect to the computation
of the correct error code.

For NPF interceptions, unfortunately the only possibility is to emulate,
because the access type in the exit qualification might refer to an
access to a paging structure, rather than to the access performed by
the program.

Trapping page faults instead is needed in order to correct the error code,
but the access type can be obtained from the original error code and
passed to gva_to_gpa.  The corrections required in the error code are
subtle. For example, imagine that a PTE for a supervisor page has a reserved
bit set.  On a supervisor-mode access, the EPT violation path would trigger.
However, on a user-mode access, the processor will not notice the reserved
bit and not include PFERR_RSVD_MASK in the error code.

CC: Tom Lendacky 
CC: Babu Moger 
Signed-off-by: Mohammed Gamal 
---
 arch/x86/kvm/svm/svm.c | 11 +++
 arch/x86/kvm/svm/svm.h |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 05412818027d..ec3224a2e7c2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1702,6 +1702,12 @@ static int pf_interception(struct vcpu_svm *svm)
u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2);
u64 error_code = svm->vmcb->control.exit_info_1;
 
+   if (npt_enabled && !svm->vcpu.arch.apf.host_apf_flags) {
+   kvm_fixup_and_inject_pf_error(&svm->vcpu,
+   fault_address, error_code);
+   return 1;
+   }
+
return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
@@ -1714,6 +1720,11 @@ static int npf_interception(struct vcpu_svm *svm)
u64 error_code = svm->vmcb->control.exit_info_1;
 
trace_kvm_page_fault(fault_address, error_code);
+
+   /* Check if guest gpa doesn't exceed physical memory limits */
+   if (unlikely(kvm_mmu_is_illegal_gpa(&svm->vcpu, fault_address)))
+   return kvm_emulate_instruction(&svm->vcpu, 0);
+
return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 2b7469f3db0e..12b502e36dbd 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -348,7 +348,7 @@ static inline bool gif_set(struct vcpu_svm *svm)
 
 static inline bool svm_need_pf_intercept(struct vcpu_svm *svm)
 {
-return !npt_enabled;
+return !npt_enabled || cpuid_maxphyaddr(&svm->vcpu) < 
boot_cpu_data.x86_phys_bits;
 }
 
 /* svm.c */
-- 
2.26.2



[PATCH v2 09/11] KVM: SVM: introduce svm_need_pf_intercept

2020-06-19 Thread Mohammed Gamal
CC: Tom Lendacky 
CC: Babu Moger 
Signed-off-by: Mohammed Gamal 
---
 arch/x86/kvm/svm/svm.c | 8 
 arch/x86/kvm/svm/svm.h | 6 ++
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 94108e6cc6da..05412818027d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1087,6 +1087,9 @@ static void init_vmcb(struct vcpu_svm *svm)
}
svm->asid_generation = 0;
 
+   if (svm_need_pf_intercept(svm))
+   set_exception_intercept(svm, PF_VECTOR);
+
svm->nested.vmcb = 0;
svm->vcpu.arch.hflags = 0;
 
@@ -1633,6 +1636,11 @@ static void update_exception_bitmap(struct kvm_vcpu 
*vcpu)
 
clr_exception_intercept(svm, BP_VECTOR);
 
+   if (svm_need_pf_intercept(svm))
+   set_exception_intercept(svm, PF_VECTOR);
+   else
+   clr_exception_intercept(svm, PF_VECTOR);
+
if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) {
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP)
set_exception_intercept(svm, BP_VECTOR);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6ac4c00a5d82..2b7469f3db0e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -19,6 +19,7 @@
 #include 
 
 #include 
+#include "cpuid.h"
 
 static const u32 host_save_user_msrs[] = {
 #ifdef CONFIG_X86_64
@@ -345,6 +346,11 @@ static inline bool gif_set(struct vcpu_svm *svm)
return !!(svm->vcpu.arch.hflags & HF_GIF_MASK);
 }
 
+static inline bool svm_need_pf_intercept(struct vcpu_svm *svm)
+{
+return !npt_enabled;
+}
+
 /* svm.c */
 #define MSR_INVALID0xU
 
-- 
2.26.2



[PATCH v2 11/11] KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable

2020-06-19 Thread Mohammed Gamal
The reason behind including this patch is unexpected behaviour we see
with NPT vmexit handling in AMD processor.

With previous patch ("KVM: SVM: Add guest physical address check in
NPF/PF interception") we see the followning error multiple times in
the 'access' test in kvm-unit-tests:

test pte.p pte.36 pde.p: FAIL: pte 221 expected 201
Dump mapping: address: 0x1234
--L4: 24c3027
--L3: 24c4027
--L2: 24c5021
--L1: 100221

This shows that the PTE's accessed bit is apparently being set by
the CPU hardware before the NPF vmexit. This completely handled by
hardware and can not be fixed in software.

This patch introduces a workaround. We add a boolean variable:
'allow_smaller_maxphyaddr'
Which is set individually by VMX and SVM init routines. On VMX it's
always set to true, on SVM it's only set to true when NPT is not
enabled.

We also add a new capability KVM_CAP_SMALLER_MAXPHYADDR which
allows userspace to query if the underlying architecture would
support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
(e.g. qemu can decide if it would ignore the -cpu ..,phys-bits=X)

CC: Tom Lendacky 
CC: Babu Moger 
Signed-off-by: Mohammed Gamal 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/svm/svm.c  | 15 +++
 arch/x86/kvm/vmx/vmx.c  |  7 +++
 arch/x86/kvm/x86.c  |  6 ++
 include/uapi/linux/kvm.h|  1 +
 5 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ebdb43632e0..b25f7497307d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1304,7 +1304,7 @@ struct kvm_arch_async_pf {
 };
 
 extern u64 __read_mostly host_efer;
-
+extern bool __read_mostly allow_smaller_maxphyaddr;
 extern struct kvm_x86_ops kvm_x86_ops;
 
 #define __KVM_HAVE_ARCH_VM_ALLOC
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ec3224a2e7c2..1b8880b89e9f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -924,6 +924,21 @@ static __init int svm_hardware_setup(void)
 
svm_set_cpu_caps();
 
+   /*
+* It seems that on AMD processors PTE's accessed bit is
+* being set by the CPU hardware before the NPF vmexit.
+* This is not expected behaviour and our tests fail because
+* of it.
+* A workaround here is to disable support for
+* GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled.
+* In this case userspace can know if there is support using
+* KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle
+* it
+* If future AMD CPU models change the behaviour described above,
+* this variable can be changed accordingly
+*/
+   allow_smaller_maxphyaddr = !npt_enabled;
+
return 0;
 
 err:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8daf78b2d4cb..fe0ca39c0887 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8316,6 +8316,13 @@ static int __init vmx_init(void)
 #endif
vmx_check_vmcs12_offsets();
 
+   /*
+* Intel processors don't have problems with
+* GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
+* it for VMX by default
+*/
+   allow_smaller_maxphyaddr = true;
+
return 0;
 }
 module_init(vmx_init);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 84f1f0084d2e..5bca6d6d24e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -187,6 +187,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
 
+bool __read_mostly allow_smaller_maxphyaddr;
+EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
+
 static u64 __read_mostly host_xss;
 u64 __read_mostly supported_xss;
 EXPORT_SYMBOL_GPL(supported_xss);
@@ -3533,6 +3536,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
r = kvm_x86_ops.nested_ops->enable_evmcs != NULL;
break;
+   case KVM_CAP_SMALLER_MAXPHYADDR:
+   r = (int) allow_smaller_maxphyaddr;
+   break;
default:
break;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..68cd3a0af9bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_SMALLER_MAXPHYADDR 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.26.2



[PATCH v2 02/11] KVM: x86: mmu: Move translate_gpa() to mmu.c

2020-06-19 Thread Mohammed Gamal
Also no point of it being inline since it's always called through
function pointers. So remove that.

Signed-off-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h | 6 --
 arch/x86/kvm/mmu/mmu.c  | 6 ++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8998e97457f..bc0fb116cc5c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1551,12 +1551,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t 
new_pgd, bool skip_tlb_flush,
 
 void kvm_configure_mmu(bool enable_tdp, int tdp_page_level);
 
-static inline gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
- struct x86_exception *exception)
-{
-   return gpa;
-}
-
 static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 {
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fdd05c233308..ee113fc1f1bf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -515,6 +515,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 
spte)
return likely(kvm_gen == spte_gen);
 }
 
+static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
+  struct x86_exception *exception)
+{
+return gpa;
+}
+
 /*
  * Sets the shadow PTE masks used by the MMU.
  *
-- 
2.26.2



[PATCH v2 06/11] KVM: VMX: introduce vmx_need_pf_intercept

2020-06-19 Thread Mohammed Gamal
From: Paolo Bonzini 

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/nested.c | 28 +---
 arch/x86/kvm/vmx/vmx.c|  2 +-
 arch/x86/kvm/vmx/vmx.h|  5 +
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d1af20b050a8..328411919518 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2433,22 +2433,28 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, 
struct vmcs12 *vmcs12)
 
/*
 * Whether page-faults are trapped is determined by a combination of
-* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
-* If enable_ept, L0 doesn't care about page faults and we should
-* set all of these to L1's desires. However, if !enable_ept, L0 does
-* care about (at least some) page faults, and because it is not easy
-* (if at all possible?) to merge L0 and L1's desires, we simply ask
-* to exit on each and every L2 page fault. This is done by setting
-* MASK=MATCH=0 and (see below) EB.PF=1.
+* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.  If L0
+* doesn't care about page faults then we should set all of these to
+* L1's desires. However, if L0 does care about (some) page faults, it
+* is not easy (if at all possible?) to merge L0 and L1's desires, we
+* simply ask to exit on each and every L2 page fault. This is done by
+* setting MASK=MATCH=0 and (see below) EB.PF=1.
 * Note that below we don't need special code to set EB.PF beyond the
 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
 */
-   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
-   enable_ept ? vmcs12->page_fault_error_code_mask : 0);
-   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
-   enable_ept ? vmcs12->page_fault_error_code_match : 0);
+   if (vmx_need_pf_intercept(&vmx->vcpu)) {
+   /*
+* TODO: if both L0 and L1 need the same MASK and MATCH,
+* go ahead and use it?
+*/
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
+   } else {
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 
vmcs12->page_fault_error_code_mask);
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 
vmcs12->page_fault_error_code_match);
+   }
 
if (cpu_has_vmx_apicv()) {
vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f82c42ac87f9..46d522ee5cb1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -783,7 +783,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
eb |= 1u << BP_VECTOR;
if (to_vmx(vcpu)->rmode.vm86_active)
eb = ~0;
-   if (enable_ept)
+   if (!vmx_need_pf_intercept(vcpu))
eb &= ~(1u << PF_VECTOR);
 
/* When we are running a nested L2 guest and L1 specified for it a
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 8a83b5edc820..5e2da15fe94f 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -552,6 +552,11 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
 }
 
+static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
+{
+   return !enable_ept;
+}
+
 void dump_vmcs(void);
 
 #endif /* __KVM_X86_VMX_H */
-- 
2.26.2



[PATCH v2 03/11] KVM: x86: mmu: Add guest physical address check in translate_gpa()

2020-06-19 Thread Mohammed Gamal
In case of running a guest with 4-level page tables on a 5-level page
table host, it might happen that a guest might have a physical address
with reserved bits set, but the host won't see that and trap it.

Hence, we need to check page faults' physical addresses against the guest's
maximum physical memory and if it's exceeded, we need to add
the PFERR_RSVD_MASK bits to the PF's error code.

Also make sure the error code isn't overwritten by the page table walker.

Signed-off-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ee113fc1f1bf..10409b76b2d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -518,6 +518,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 
spte)
 static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
   struct x86_exception *exception)
 {
+   /* Check if guest physical address doesn't exceed guest maximum */
+   if (kvm_mmu_is_illegal_gpa(vcpu, gpa)) {
+   exception->error_code |= PFERR_RSVD_MASK;
+   return UNMAPPED_GVA;
+   }
+
 return gpa;
 }
 
-- 
2.26.2



[PATCH v2 04/11] KVM: x86: rename update_bp_intercept to update_exception_bitmap

2020-06-19 Thread Mohammed Gamal
From: Paolo Bonzini 

We would like to introduce a callback to update the #PF intercept
when CPUID changes.  Just reuse update_bp_intercept since VMX is
already using update_exception_bitmap instead of a bespoke function.

While at it, remove an unnecessary assignment in the SVM version,
which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
and has nothing to do with the exception bitmap.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/svm/svm.c  | 7 +++
 arch/x86/kvm/vmx/vmx.c  | 2 +-
 arch/x86/kvm/x86.c  | 2 +-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bc0fb116cc5c..7ebdb43632e0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1098,7 +1098,7 @@ struct kvm_x86_ops {
void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
void (*vcpu_put)(struct kvm_vcpu *vcpu);
 
-   void (*update_bp_intercept)(struct kvm_vcpu *vcpu);
+   void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8ccfa4197d9c..94108e6cc6da 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1627,7 +1627,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
mark_dirty(svm->vmcb, VMCB_SEG);
 }
 
-static void update_bp_intercept(struct kvm_vcpu *vcpu)
+static void update_exception_bitmap(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1636,8 +1636,7 @@ static void update_bp_intercept(struct kvm_vcpu *vcpu)
if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) {
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP)
set_exception_intercept(svm, BP_VECTOR);
-   } else
-   vcpu->guest_debug = 0;
+   }
 }
 
 static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
@@ -3989,7 +3988,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_blocking = svm_vcpu_blocking,
.vcpu_unblocking = svm_vcpu_unblocking,
 
-   .update_bp_intercept = update_bp_intercept,
+   .update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = svm_get_msr_feature,
.get_msr = svm_get_msr,
.set_msr = svm_set_msr,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 36c771728c8c..f82c42ac87f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7881,7 +7881,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_load = vmx_vcpu_load,
.vcpu_put = vmx_vcpu_put,
 
-   .update_bp_intercept = update_exception_bitmap,
+   .update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vmx_get_msr,
.set_msr = vmx_set_msr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac8642e890b1..84f1f0084d2e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9275,7 +9275,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 */
kvm_set_rflags(vcpu, rflags);
 
-   kvm_x86_ops.update_bp_intercept(vcpu);
+   kvm_x86_ops.update_exception_bitmap(vcpu);
 
r = 0;
 
-- 
2.26.2



[PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection

2020-06-19 Thread Mohammed Gamal
This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.

kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.

Signed-off-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu.h |  6 ++
 arch/x86/kvm/x86.c | 21 +
 arch/x86/kvm/x86.h |  1 +
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0ad06bfe2c2c..555237dfb91c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -4,6 +4,7 @@
 
 #include 
 #include "kvm_cache_regs.h"
+#include "cpuid.h"
 
 #define PT64_PT_BITS 9
 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
@@ -158,6 +159,11 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
 }
 
+static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu)));
+}
+
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00c88c2f34e4..ac8642e890b1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);
 
+void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 
error_code)
+{
+   struct x86_exception fault;
+
+   if (!(error_code & PFERR_PRESENT_MASK) ||
+   vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, &fault) != 
UNMAPPED_GVA) {
+   /*
+* If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page
+* tables probably do not match the TLB.  Just proceed
+* with the error code that the processor gave.
+*/
+   fault.vector = PF_VECTOR;
+   fault.error_code_valid = true;
+   fault.error_code = error_code;
+   fault.nested_page_fault = false;
+   fault.address = gva;
+   }
+   vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
+}
+EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6eb62e97e59f..239ae0f3e40b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
*pdata);
 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
  int page_num);
 bool kvm_vector_hashing_enabled(void);
+void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 
error_code);
 int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type, void *insn, int insn_len);
 fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
-- 
2.26.2



[PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

2020-06-19 Thread Mohammed Gamal
When EPT/NPT is enabled, KVM does not really look at guest physical
address size. Address bits above maximum physical memory size are reserved.
Because KVM does not look at these guest physical addresses, it currently
effectively supports guest physical address sizes equal to the host.

This can be problem when having a mixed setup of machines with 5-level page
tables and machines with 4-level page tables, as live migration can change
MAXPHYADDR while the guest runs, which can theoretically introduce bugs.

In this patch series we add checks on guest physical addresses in EPT
violation/misconfig and NPF vmexits and if needed inject the proper
page faults in the guest.

A more subtle issue is when the host MAXPHYADDR is larger than that of the
guest. Page faults caused by reserved bits on the guest won't cause an EPT
violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK
error code to the page fault if needed.

The last 3 patches (i.e. SVM bits and patch 11) are not intended for
immediate inclusion and probably need more discussion.
We've been noticing some unexpected behavior in handling NPF vmexits
on AMD CPUs (see individual patches for details), and thus we are
proposing a workaround (see last patch) that adds a capability that
userspace can use to decide who to deal with hosts that might have
issues supprting guest MAXPHYADDR < host MAXPHYADDR.


Mohammed Gamal (7):
  KVM: x86: Add helper functions for illegal GPA checking and page fault
injection
  KVM: x86: mmu: Move translate_gpa() to mmu.c
  KVM: x86: mmu: Add guest physical address check in translate_gpa()
  KVM: VMX: Add guest physical address check in EPT violation and
misconfig
  KVM: SVM: introduce svm_need_pf_intercept
  KVM: SVM: Add guest physical address check in NPF/PF interception
  KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
configurable

Paolo Bonzini (4):
  KVM: x86: rename update_bp_intercept to update_exception_bitmap
  KVM: x86: update exception bitmap on CPUID changes
  KVM: VMX: introduce vmx_need_pf_intercept
  KVM: VMX: optimize #PF injection when MAXPHYADDR does not match

 arch/x86/include/asm/kvm_host.h | 10 ++--
 arch/x86/kvm/cpuid.c|  2 ++
 arch/x86/kvm/mmu.h  |  6 +
 arch/x86/kvm/mmu/mmu.c  | 12 +
 arch/x86/kvm/svm/svm.c  | 41 +++---
 arch/x86/kvm/svm/svm.h  |  6 +
 arch/x86/kvm/vmx/nested.c   | 28 
 arch/x86/kvm/vmx/vmx.c  | 45 +
 arch/x86/kvm/vmx/vmx.h  |  6 +
 arch/x86/kvm/x86.c  | 29 -
 arch/x86/kvm/x86.h  |  1 +
 include/uapi/linux/kvm.h|  1 +
 12 files changed, 158 insertions(+), 29 deletions(-)

-- 
2.26.2



Re: [PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection

2020-06-22 Thread Mohammed Gamal
On Mon, 2020-06-22 at 12:44 +0800, Yuan Yao wrote:
> On Fri, Jun 19, 2020 at 05:39:15PM +0200, Mohammed Gamal wrote:
> > This patch adds two helper functions that will be used to support
> > virtualizing
> > MAXPHYADDR in both kvm-intel.ko and kvm.ko.
> > 
> > kvm_fixup_and_inject_pf_error() injects a page fault for a user-
> > specified GVA,
> > while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU
> > address limits.
> > 
> > Signed-off-by: Mohammed Gamal 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  arch/x86/kvm/mmu.h |  6 ++
> >  arch/x86/kvm/x86.c | 21 +
> >  arch/x86/kvm/x86.h |  1 +
> >  3 files changed, 28 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 0ad06bfe2c2c..555237dfb91c 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -4,6 +4,7 @@
> >  
> >  #include 
> >  #include "kvm_cache_regs.h"
> > +#include "cpuid.h"
> >  
> >  #define PT64_PT_BITS 9
> >  #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
> > @@ -158,6 +159,11 @@ static inline bool is_write_protection(struct
> > kvm_vcpu *vcpu)
> > return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
> >  }
> >  
> > +static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu,
> > gpa_t gpa)
> > +{
> > +return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu)));
> > +}
> > +
> >  /*
> >   * Check if a given access (described through the I/D, W/R and U/S
> > bits of a
> >   * page fault error code pfec) causes a permission fault with the
> > given PTE
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 00c88c2f34e4..ac8642e890b1 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct
> > kvm_vcpu *vcpu)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);
> >  
> > +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t
> > gva, u16 error_code)
> > +{
> > +   struct x86_exception fault;
> > +
> > +   if (!(error_code & PFERR_PRESENT_MASK) ||
> > +   vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code,
> > &fault) != UNMAPPED_GVA) {
> > +   /*
> > +* If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the
> > page
> > +* tables probably do not match the TLB.  Just proceed
> > +* with the error code that the processor gave.
> > +*/
> > +   fault.vector = PF_VECTOR;
> > +   fault.error_code_valid = true;
> > +   fault.error_code = error_code;
> > +   fault.nested_page_fault = false;
> > +   fault.address = gva;
> > +   }
> > +   vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
> 
> Should this "vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault)"
> inside the last brace?
> Otherwise an uninitialized fault variable will be passed to the
> walk_mmu->inject_page_fault.

Good catch. You're right. Will fix it in v3

> 
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);
> > +
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 6eb62e97e59f..239ae0f3e40b 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32
> > msr, u64 *pdata);
> >  bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu,
> > gfn_t gfn,
> >   int page_num);
> >  bool kvm_vector_hashing_enabled(void);
> > +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t
> > gva, u16 error_code);
> >  int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t
> > cr2_or_gpa,
> > int emulation_type, void *insn, int
> > insn_len);
> >  fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
> > -- 
> > 2.26.2
> > 



Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

2020-06-22 Thread Mohammed Gamal
On Fri, 2020-06-19 at 16:52 -0500, Tom Lendacky wrote:
> On 6/19/20 10:39 AM, Mohammed Gamal wrote:
> > When EPT/NPT is enabled, KVM does not really look at guest physical
> > address size. Address bits above maximum physical memory size are
> > reserved.
> > Because KVM does not look at these guest physical addresses, it
> > currently
> > effectively supports guest physical address sizes equal to the
> > host.
> > 
> > This can be problem when having a mixed setup of machines with 5-
> > level page
> > tables and machines with 4-level page tables, as live migration can
> > change
> > MAXPHYADDR while the guest runs, which can theoretically introduce
> > bugs.
> > 
> > In this patch series we add checks on guest physical addresses in
> > EPT
> > violation/misconfig and NPF vmexits and if needed inject the proper
> > page faults in the guest.
> > 
> > A more subtle issue is when the host MAXPHYADDR is larger than that
> > of the
> > guest. Page faults caused by reserved bits on the guest won't cause
> > an EPT
> > violation/NPF and hence we also check guest MAXPHYADDR and add
> > PFERR_RSVD_MASK
> > error code to the page fault if needed.
> 
> I'm probably missing something here, but I'm confused by this
> statement. 
> Is this for a case where a page has been marked not present and the
> guest 
> has also set what it believes are reserved bits? Then when the page
> is 
> accessed, the guest sees a page fault without the error code for
> reserved 
> bits? If so, my understanding is that is architecturally correct. P=0
> is 
> considered higher priority than other page faults, at least on AMD.
> So if 
> you have a P=0 and other issues exist within the PTE, AMD will report
> the 
> P=0 fault and that's it.
> 
> The priority of other page fault conditions when P=1 is not defined
> and I 
> don't think we guarantee that you would get all error codes on
> fault. 
> Software is always expected to address the page fault and retry, and
> it 
> may get another page fault when it does, with a different error
> code. 
> Assuming the other errors are addressed, eventually the reserved
> bits 
> would cause an NPF and that could be detected by the HV and handled 
> appropriately.
> 
> > The last 3 patches (i.e. SVM bits and patch 11) are not intended
> > for
> > immediate inclusion and probably need more discussion.
> > We've been noticing some unexpected behavior in handling NPF
> > vmexits
> > on AMD CPUs (see individual patches for details), and thus we are
> > proposing a workaround (see last patch) that adds a capability that
> > userspace can use to decide who to deal with hosts that might have
> > issues supprting guest MAXPHYADDR < host MAXPHYADDR.
> 
> Also, something to consider. On AMD, when memory encryption is
> enabled 
> (via the SYS_CFG MSR), a guest can actually have a larger MAXPHYADDR
> than 
> the host. How do these patches all play into that?

Well the patches definitely don't address that case. It's assumed a
guest VM's MAXPHYADDR <= host MAXPHYADDR, and hence we handle the case
where a guests's physical address space is smaller and try to trap
faults that may go unnoticed by the host.

My question is in the case of guest MAXPHYADDR > host MAXPHYADDR, do we
expect somehow that there might be guest physical addresses that
contain what the host could see as reserved bits? And how'd the host
handle that?

Thanks,
Mohammed

> 
> Thanks,
> Tom
> 
> > 
> > Mohammed Gamal (7):
> >KVM: x86: Add helper functions for illegal GPA checking and page
> > fault
> >  injection
> >KVM: x86: mmu: Move translate_gpa() to mmu.c
> >KVM: x86: mmu: Add guest physical address check in
> > translate_gpa()
> >KVM: VMX: Add guest physical address check in EPT violation and
> >  misconfig
> >KVM: SVM: introduce svm_need_pf_intercept
> >KVM: SVM: Add guest physical address check in NPF/PF
> > interception
> >KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR
> > support
> >  configurable
> > 
> > Paolo Bonzini (4):
> >KVM: x86: rename update_bp_intercept to update_exception_bitmap
> >KVM: x86: update exception bitmap on CPUID changes
> >KVM: VMX: introduce vmx_need_pf_intercept
> >KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
> > 
> >   arch/x86/include/asm/kvm_host.h | 10 ++--
> >   arch/x86/kvm/cpuid.c|  2 ++
> >   arch/x86/kvm/mmu.h  |  6 +
> >   a

[PATCH RESEND] hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions

2020-10-05 Thread Mohammed Gamal
When selecting function_graph tracer with the command:
 # echo function_graph > /sys/kernel/debug/tracing/current_tracer

The kernel crashes with the following stack trace:

[69703.122389] BUG: stack guard page was hit at 1056545c (stack is 
fa3f8fed..05d39503)
[69703.122403] kernel stack overflow (double-fault):  [#1] SMP PTI
[69703.122413] CPU: 0 PID: 16982 Comm: bash Kdump: loaded Not tainted 
4.18.0-236.el8.x86_64 #1
[69703.122420] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[69703.122433] RIP: 0010repare_ftrace_return+0xa/0x110
[69703.122458] Code: 05 00 0f 0b 48 c7 c7 10 ca 69 ae 0f b6 f0 e8 4b 52 0c 00 
31 c0 eb ca 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 <53> 48 83 
ec 18 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 48 85
[69703.122467] RSP: 0018:bd6d01118000 EFLAGS: 00010086
[69703.122476] RAX:  RBX:  RCX: 0003
[69703.122484] RDX:  RSI: bd6d011180d8 RDI: adce7550
[69703.122491] RBP: bd6d01118018 R08:  R09: 9d4b09266000
[69703.122498] R10: 9d4b0fc04540 R11: 9d4b0fc20a00 R12: 9d4b6e42aa90
[69703.122506] R13: 9d4b0fc20ab8 R14: 03e8 R15: bd6d0111837c
[69703.122514] FS:  7fd5f2588740() GS:9d4b6e40() 
knlGS:
[69703.122521] CS:  0010 DS:  ES:  CR0: 80050033
[69703.122528] CR2: bd6d01117ff8 CR3: 565d8001 CR4: 003606f0
[69703.122538] DR0:  DR1:  DR2: 
[69703.122545] DR3:  DR6: fffe0ff0 DR7: 0400
[69703.122552] Call Trace:
[69703.122568]  ftrace_graph_caller+0x6b/0xa0
[69703.122589]  ? read_hv_sched_clock_tsc+0x5/0x20
[69703.122599]  read_hv_sched_clock_tsc+0x5/0x20
[69703.122611]  sched_clock+0x5/0x10
[69703.122621]  sched_clock_local+0x12/0x80
[69703.122631]  sched_clock_cpu+0x8c/0xb0
[69703.122644]  trace_clock_global+0x21/0x90
[69703.122655]  ring_buffer_lock_reserve+0x100/0x3c0
[69703.122671]  trace_buffer_lock_reserve+0x16/0x50
[69703.122683]  __trace_graph_entry+0x28/0x90
[69703.122695]  trace_graph_entry+0xfd/0x1a0
[69703.122705]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122714]  ? sched_clock+0x5/0x10
[69703.122723]  prepare_ftrace_return+0x99/0x110
[69703.122734]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122743]  ? sched_clock+0x5/0x10
[...]

Setting the notrace attribute for read_hv_sched_clock_msr() and
read_hv_sched_clock_tsc() fixes it

Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V specific
sched clock function")
Suggested-by: Vitaly Kuznetsov 
Signed-off-by: Mohammed Gamal 
---
 drivers/clocksource/hyperv_timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/hyperv_timer.c 
b/drivers/clocksource/hyperv_timer.c
index 09aa44cb8a91d..ba04cb381cd3f 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -341,7 +341,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource 
*arg)
return read_hv_clock_tsc();
 }
 
-static u64 read_hv_sched_clock_tsc(void)
+static u64 notrace read_hv_sched_clock_tsc(void)
 {
return (read_hv_clock_tsc() - hv_sched_clock_offset) *
(NSEC_PER_SEC / HV_CLOCK_HZ);
@@ -404,7 +404,7 @@ static u64 notrace read_hv_clock_msr_cs(struct clocksource 
*arg)
return read_hv_clock_msr();
 }
 
-static u64 read_hv_sched_clock_msr(void)
+static u64 notrace read_hv_sched_clock_msr(void)
 {
return (read_hv_clock_msr() - hv_sched_clock_offset) *
(NSEC_PER_SEC / HV_CLOCK_HZ);
-- 
2.26.2



Re: [PATCH RESEND] hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions

2020-10-05 Thread Mohammed Gamal
On Mon, 2020-10-05 at 13:47 +0200, Mohammed Gamal wrote:
> When selecting function_graph tracer with the command:
>  # echo function_graph > /sys/kernel/debug/tracing/current_tracer
> 
> The kernel crashes with the following stack trace:
> 
> [69703.122389] BUG: stack guard page was hit at 1056545c
> (stack is fa3f8fed..05d39503)
> [69703.122403] kernel stack overflow (double-fault):  [#1] SMP
> PTI
> [69703.122413] CPU: 0 PID: 16982 Comm: bash Kdump: loaded Not tainted
> 4.18.0-236.el8.x86_64 #1
> [69703.122420] Hardware name: Microsoft Corporation Virtual
> Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> [69703.122433] RIP: 0010repare_ftrace_return+0xa/0x110
> [69703.122458] Code: 05 00 0f 0b 48 c7 c7 10 ca 69 ae 0f b6 f0 e8 4b
> 52 0c 00 31 c0 eb ca 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41
> 55 41 54 <53> 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31
> c0 48 85
> [69703.122467] RSP: 0018:bd6d01118000 EFLAGS: 00010086
> [69703.122476] RAX:  RBX:  RCX:
> 0003
> [69703.122484] RDX:  RSI: bd6d011180d8 RDI:
> adce7550
> [69703.122491] RBP: bd6d01118018 R08:  R09:
> 9d4b09266000
> [69703.122498] R10: 9d4b0fc04540 R11: 9d4b0fc20a00 R12:
> 9d4b6e42aa90
> [69703.122506] R13: 9d4b0fc20ab8 R14: 03e8 R15:
> bd6d0111837c
> [69703.122514] FS:  7fd5f2588740() GS:9d4b6e40()
> knlGS:
> [69703.122521] CS:  0010 DS:  ES:  CR0: 80050033
> [69703.122528] CR2: bd6d01117ff8 CR3: 565d8001 CR4:
> 003606f0
> [69703.122538] DR0:  DR1:  DR2:
> 
> [69703.122545] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [69703.122552] Call Trace:
> [69703.122568]  ftrace_graph_caller+0x6b/0xa0
> [69703.122589]  ? read_hv_sched_clock_tsc+0x5/0x20
> [69703.122599]  read_hv_sched_clock_tsc+0x5/0x20
> [69703.122611]  sched_clock+0x5/0x10
> [69703.122621]  sched_clock_local+0x12/0x80
> [69703.122631]  sched_clock_cpu+0x8c/0xb0
> [69703.122644]  trace_clock_global+0x21/0x90
> [69703.122655]  ring_buffer_lock_reserve+0x100/0x3c0
> [69703.122671]  trace_buffer_lock_reserve+0x16/0x50
> [69703.122683]  __trace_graph_entry+0x28/0x90
> [69703.122695]  trace_graph_entry+0xfd/0x1a0
> [69703.122705]  ? read_hv_clock_tsc_cs+0x10/0x10
> [69703.122714]  ? sched_clock+0x5/0x10
> [69703.122723]  prepare_ftrace_return+0x99/0x110
> [69703.122734]  ? read_hv_clock_tsc_cs+0x10/0x10
> [69703.122743]  ? sched_clock+0x5/0x10
> [...]
> 
> Setting the notrace attribute for read_hv_sched_clock_msr() and
> read_hv_sched_clock_tsc() fixes it
> 
> Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V
> specific
> sched clock function")
> Suggested-by: Vitaly Kuznetsov 
> Signed-off-by: Mohammed Gamal 
> ---
>  drivers/clocksource/hyperv_timer.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/clocksource/hyperv_timer.c
> b/drivers/clocksource/hyperv_timer.c
> index 09aa44cb8a91d..ba04cb381cd3f 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -341,7 +341,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct
> clocksource *arg)
>   return read_hv_clock_tsc();
>  }
>  
> -static u64 read_hv_sched_clock_tsc(void)
> +static u64 notrace read_hv_sched_clock_tsc(void)
>  {
>   return (read_hv_clock_tsc() - hv_sched_clock_offset) *
>   (NSEC_PER_SEC / HV_CLOCK_HZ);
> @@ -404,7 +404,7 @@ static u64 notrace read_hv_clock_msr_cs(struct
> clocksource *arg)
>   return read_hv_clock_msr();
>  }
>  
> -static u64 read_hv_sched_clock_msr(void)
> +static u64 notrace read_hv_sched_clock_msr(void)
>  {
>   return (read_hv_clock_msr() - hv_sched_clock_offset) *
>   (NSEC_PER_SEC / HV_CLOCK_HZ);

Please ignore the patch. Somehow I missed Wei's reply on it. It's
already applied to hyperv-next.

Thanks



[PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

2020-09-03 Thread Mohammed Gamal
This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.

Since smaller physical address spaces are only supported on VMX, the parameter
is only exposed in the kvm_intel module.
Modifications to VMX page fault and EPT violation handling will depend on 
whether
that parameter is enabled.

Also disable support by default, and let the user decide if they want to enable
it.

Signed-off-by: Mohammed Gamal 
---
 arch/x86/kvm/vmx/vmx.c | 15 ++-
 arch/x86/kvm/vmx/vmx.h |  3 +++
 arch/x86/kvm/x86.c |  2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 819c185adf09..dc778c7b5a06 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
 module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 #endif
 
+extern bool __read_mostly allow_smaller_maxphyaddr;
+module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
+
 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
 #define KVM_VM_CR0_ALWAYS_ON   \
@@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 
if (is_page_fault(intr_info)) {
cr2 = vmx_get_exit_qual(vcpu);
-   if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
+   if (enable_ept && !vcpu->arch.apf.host_apf_flags
+   && allow_smaller_maxphyaddr) {
/*
 * EPT will cause page fault only if we need to
 * detect illegal GPAs.
@@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 * would also use advanced VM-exit information for EPT violations to
 * reconstruct the page fault error code.
 */
-   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
+   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && 
allow_smaller_maxphyaddr)
return kvm_emulate_instruction(vcpu, 0);
 
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
@@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
 #endif
vmx_check_vmcs12_offsets();
 
-   /*
-* Intel processors don't have problems with
-* GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
-* it for VMX by default
-*/
-   allow_smaller_maxphyaddr = true;
-
return 0;
 }
 module_init(vmx_init);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 26175a4759fa..b859435efa2e 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
 
 static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
 {
+   if (!allow_smaller_maxphyaddr)
+   return false;
+
return !enable_ept || cpuid_maxphyaddr(vcpu) < 
boot_cpu_data.x86_phys_bits;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d39d6cf1d473..982f1d73a884 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
 
-bool __read_mostly allow_smaller_maxphyaddr;
+bool __read_mostly allow_smaller_maxphyaddr = 0;
 EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
 
 static u64 __read_mostly host_xss;
-- 
2.26.2



[PATCH v3 5/9] KVM: x86: update exception bitmap on CPUID changes

2020-07-10 Thread Mohammed Gamal
From: Paolo Bonzini 

Allow vendor code to observe changes to MAXPHYADDR and start/stop
intercepting page faults.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/cpuid.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8a294f9747aa..ea5bbf2153bb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -128,6 +128,8 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
kvm_mmu_reset_context(vcpu);
 
kvm_pmu_refresh(vcpu);
+   kvm_x86_ops.update_exception_bitmap(vcpu);
+
return 0;
 }
 
-- 
2.26.2



[PATCH v3 2/9] KVM: x86: mmu: Move translate_gpa() to mmu.c

2020-07-10 Thread Mohammed Gamal
Also no point of it being inline since it's always called through
function pointers. So remove that.

Signed-off-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h | 6 --
 arch/x86/kvm/mmu/mmu.c  | 6 ++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..62373cc06c72 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1551,12 +1551,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t 
new_pgd, bool skip_tlb_flush,
 
 void kvm_configure_mmu(bool enable_tdp, int tdp_page_level);
 
-static inline gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
- struct x86_exception *exception)
-{
-   return gpa;
-}
-
 static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 {
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d6a0ae7800c..f8b3c5181466 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -515,6 +515,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 
spte)
return likely(kvm_gen == spte_gen);
 }
 
+static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
+  struct x86_exception *exception)
+{
+return gpa;
+}
+
 /*
  * Sets the shadow PTE masks used by the MMU.
  *
-- 
2.26.2



[PATCH v3 1/9] KVM: x86: Add helper functions for illegal GPA checking and page fault injection

2020-07-10 Thread Mohammed Gamal
This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.

kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.

Signed-off-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu.h |  6 ++
 arch/x86/kvm/x86.c | 21 +
 arch/x86/kvm/x86.h |  1 +
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 444bb9c54548..59930231d5d5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -4,6 +4,7 @@
 
 #include 
 #include "kvm_cache_regs.h"
+#include "cpuid.h"
 
 #define PT64_PT_BITS 9
 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
@@ -158,6 +159,11 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
 }
 
+static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu)));
+}
+
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..1f5f4074fc59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10699,6 +10699,27 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);
 
+void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 
error_code)
+{
+   struct x86_exception fault;
+
+   if (!(error_code & PFERR_PRESENT_MASK) ||
+   vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, &fault) != 
UNMAPPED_GVA) {
+   /*
+* If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page
+* tables probably do not match the TLB.  Just proceed
+* with the error code that the processor gave.
+*/
+   fault.vector = PF_VECTOR;
+   fault.error_code_valid = true;
+   fault.error_code = error_code;
+   fault.nested_page_fault = false;
+   fault.address = gva;
+   }
+   vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
+}
+EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6eb62e97e59f..239ae0f3e40b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
*pdata);
 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
  int page_num);
 bool kvm_vector_hashing_enabled(void);
+void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 
error_code);
 int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type, void *insn, int insn_len);
 fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
-- 
2.26.2



[PATCH v3 0/9] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

2020-07-10 Thread Mohammed Gamal
When EPT is enabled, KVM does not really look at guest physical
address size. Address bits above maximum physical memory size are reserved.
Because KVM does not look at these guest physical addresses, it currently
effectively supports guest physical address sizes equal to the host.

This can be problem when having a mixed setup of machines with 5-level page
tables and machines with 4-level page tables, as live migration can change
MAXPHYADDR while the guest runs, which can theoretically introduce bugs.

In this patch series we add checks on guest physical addresses in EPT
violation/misconfig and NPF vmexits and if needed inject the proper
page faults in the guest.

A more subtle issue is when the host MAXPHYADDR is larger than that of the
guest. Page faults caused by reserved bits on the guest won't cause an EPT
violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK
error code to the page fault if needed.



Changes from v2:
- Drop support for this feature on AMD processors after discussion with AMD


Mohammed Gamal (5):
  KVM: x86: Add helper functions for illegal GPA checking and page fault
injection
  KVM: x86: mmu: Move translate_gpa() to mmu.c
  KVM: x86: mmu: Add guest physical address check in translate_gpa()
  KVM: VMX: Add guest physical address check in EPT violation and
misconfig
  KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
configurable

Paolo Bonzini (4):
  KVM: x86: rename update_bp_intercept to update_exception_bitmap
  KVM: x86: update exception bitmap on CPUID changes
  KVM: VMX: introduce vmx_need_pf_intercept
  KVM: VMX: optimize #PF injection when MAXPHYADDR does not match

 arch/x86/include/asm/kvm_host.h | 10 ++--
 arch/x86/kvm/cpuid.c|  2 ++
 arch/x86/kvm/mmu.h  |  6 +
 arch/x86/kvm/mmu/mmu.c  | 12 +
 arch/x86/kvm/svm/svm.c  | 22 +---
 arch/x86/kvm/vmx/nested.c   | 28 
 arch/x86/kvm/vmx/vmx.c  | 45 +
 arch/x86/kvm/vmx/vmx.h  |  6 +
 arch/x86/kvm/x86.c  | 29 -
 arch/x86/kvm/x86.h  |  1 +
 include/uapi/linux/kvm.h|  1 +
 11 files changed, 133 insertions(+), 29 deletions(-)

-- 
2.26.2



[PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-07-10 Thread Mohammed Gamal
Check guest physical address against it's maximum physical memory. If
the guest's physical address exceeds the maximum (i.e. has reserved bits
set), inject a guest page fault with PFERR_RSVD_MASK set.

This has to be done both in the EPT violation and page fault paths, as
there are complications in both cases with respect to the computation
of the correct error code.

For EPT violations, unfortunately the only possibility is to emulate,
because the access type in the exit qualification might refer to an
access to a paging structure, rather than to the access performed by
the program.

Trapping page faults instead is needed in order to correct the error code,
but the access type can be obtained from the original error code and
passed to gva_to_gpa.  The corrections required in the error code are
subtle. For example, imagine that a PTE for a supervisor page has a reserved
bit set.  On a supervisor-mode access, the EPT violation path would trigger.
However, on a user-mode access, the processor will not notice the reserved
bit and not include PFERR_RSVD_MASK in the error code.

Co-developed-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/vmx.c | 24 +---
 arch/x86/kvm/vmx/vmx.h |  3 ++-
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 770b090969fb..de3f436b2d32 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4790,9 +4790,15 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 
if (is_page_fault(intr_info)) {
cr2 = vmx_get_exit_qual(vcpu);
-   /* EPT won't cause page fault directly */
-   WARN_ON_ONCE(!vcpu->arch.apf.host_apf_flags && enable_ept);
-   return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0);
+   if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
+   /*
+* EPT will cause page fault only if we need to
+* detect illegal GPAs.
+*/
+   kvm_fixup_and_inject_pf_error(vcpu, cr2, error_code);
+   return 1;
+   } else
+   return kvm_handle_page_fault(vcpu, error_code, cr2, 
NULL, 0);
}
 
ex_no = intr_info & INTR_INFO_VECTOR_MASK;
@@ -5308,6 +5314,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
   PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
vcpu->arch.exit_qualification = exit_qualification;
+
+   /*
+* Check that the GPA doesn't exceed physical memory limits, as that is
+* a guest page fault.  We have to emulate the instruction here, because
+* if the illegal address is that of a paging structure, then
+* EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
+* would also use advanced VM-exit information for EPT violations to
+* reconstruct the page fault error code.
+*/
+   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
+   return kvm_emulate_instruction(vcpu, 0);
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index b0e5e210f1c1..0d06951e607c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,7 @@
 #include "kvm_cache_regs.h"
 #include "ops.h"
 #include "vmcs.h"
+#include "cpuid.h"
 
 extern const u32 vmx_msr_index[];
 
@@ -552,7 +553,7 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
 
 static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
 {
-   return !enable_ept;
+   return !enable_ept || cpuid_maxphyaddr(vcpu) < 
boot_cpu_data.x86_phys_bits;
 }
 
 void dump_vmcs(void);
-- 
2.26.2



[PATCH v3 6/9] KVM: VMX: introduce vmx_need_pf_intercept

2020-07-10 Thread Mohammed Gamal
From: Paolo Bonzini 

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/nested.c | 28 +---
 arch/x86/kvm/vmx/vmx.c|  2 +-
 arch/x86/kvm/vmx/vmx.h|  5 +
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b26655104d4a..1aea9e3b8c43 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2433,22 +2433,28 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, 
struct vmcs12 *vmcs12)
 
/*
 * Whether page-faults are trapped is determined by a combination of
-* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
-* If enable_ept, L0 doesn't care about page faults and we should
-* set all of these to L1's desires. However, if !enable_ept, L0 does
-* care about (at least some) page faults, and because it is not easy
-* (if at all possible?) to merge L0 and L1's desires, we simply ask
-* to exit on each and every L2 page fault. This is done by setting
-* MASK=MATCH=0 and (see below) EB.PF=1.
+* 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.  If L0
+* doesn't care about page faults then we should set all of these to
+* L1's desires. However, if L0 does care about (some) page faults, it
+* is not easy (if at all possible?) to merge L0 and L1's desires, we
+* simply ask to exit on each and every L2 page fault. This is done by
+* setting MASK=MATCH=0 and (see below) EB.PF=1.
 * Note that below we don't need special code to set EB.PF beyond the
 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
 */
-   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
-   enable_ept ? vmcs12->page_fault_error_code_mask : 0);
-   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
-   enable_ept ? vmcs12->page_fault_error_code_match : 0);
+   if (vmx_need_pf_intercept(&vmx->vcpu)) {
+   /*
+* TODO: if both L0 and L1 need the same MASK and MATCH,
+* go ahead and use it?
+*/
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
+   } else {
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 
vmcs12->page_fault_error_code_mask);
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 
vmcs12->page_fault_error_code_match);
+   }
 
if (cpu_has_vmx_apicv()) {
vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 178ee92551a9..770b090969fb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -780,7 +780,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
eb |= 1u << BP_VECTOR;
if (to_vmx(vcpu)->rmode.vm86_active)
eb = ~0;
-   if (enable_ept)
+   if (!vmx_need_pf_intercept(vcpu))
eb &= ~(1u << PF_VECTOR);
 
/* When we are running a nested L2 guest and L1 specified for it a
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 639798e4a6ca..b0e5e210f1c1 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -550,6 +550,11 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
 }
 
+static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
+{
+   return !enable_ept;
+}
+
 void dump_vmcs(void);
 
 #endif /* __KVM_X86_VMX_H */
-- 
2.26.2



[PATCH v3 9/9] KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable

2020-07-10 Thread Mohammed Gamal
The reason behind including this patch is unexpected behaviour we see
with NPT vmexit handling in AMD processor.

With previous patch ("KVM: SVM: Add guest physical address check in
NPF/PF interception") we see the followning error multiple times in
the 'access' test in kvm-unit-tests:

test pte.p pte.36 pde.p: FAIL: pte 221 expected 201
Dump mapping: address: 0x1234
--L4: 24c3027
--L3: 24c4027
--L2: 24c5021
--L1: 100221

This shows that the PTE's accessed bit is apparently being set by
the CPU hardware before the NPF vmexit. This completely handled by
hardware and can not be fixed in software.

This patch introduces a workaround. We add a boolean variable:
'allow_smaller_maxphyaddr'
Which is set individually by VMX and SVM init routines. On VMX it's
always set to true, on SVM it's only set to true when NPT is not
enabled.

We also add a new capability KVM_CAP_SMALLER_MAXPHYADDR which
allows userspace to query if the underlying architecture would
support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
(e.g. qemu can decide if it would ignore the -cpu ..,phys-bits=X)

CC: Tom Lendacky 
CC: Babu Moger 
Signed-off-by: Mohammed Gamal 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/svm/svm.c  | 15 +++
 arch/x86/kvm/vmx/vmx.c  |  7 +++
 arch/x86/kvm/x86.c  |  6 ++
 include/uapi/linux/kvm.h|  1 +
 5 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bb4044ffb7b7..26002e1b47f7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1304,7 +1304,7 @@ struct kvm_arch_async_pf {
 };
 
 extern u64 __read_mostly host_efer;
-
+extern bool __read_mostly allow_smaller_maxphyaddr;
 extern struct kvm_x86_ops kvm_x86_ops;
 
 #define __KVM_HAVE_ARCH_VM_ALLOC
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 79c33b3539f0..f3d7ae26875c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -924,6 +924,21 @@ static __init int svm_hardware_setup(void)
 
svm_set_cpu_caps();
 
+   /*
+* It seems that on AMD processors PTE's accessed bit is
+* being set by the CPU hardware before the NPF vmexit.
+* This is not expected behaviour and our tests fail because
+* of it.
+* A workaround here is to disable support for
+* GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled.
+* In this case userspace can know if there is support using
+* KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle
+* it
+* If future AMD CPU models change the behaviour described above,
+* this variable can be changed accordingly
+*/
+   allow_smaller_maxphyaddr = !npt_enabled;
+
return 0;
 
 err:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0cebc4832805..8a8e85e6c529 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8294,6 +8294,13 @@ static int __init vmx_init(void)
 #endif
vmx_check_vmcs12_offsets();
 
+   /*
+* Intel processors don't have problems with
+* GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
+* it for VMX by default
+*/
+   allow_smaller_maxphyaddr = true;
+
return 0;
 }
 module_init(vmx_init);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03c401963062..167becd6a634 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -187,6 +187,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
 
+bool __read_mostly allow_smaller_maxphyaddr;
+EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
+
 static u64 __read_mostly host_xss;
 u64 __read_mostly supported_xss;
 EXPORT_SYMBOL_GPL(supported_xss);
@@ -3538,6 +3541,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
r = kvm_x86_ops.nested_ops->enable_evmcs != NULL;
break;
+   case KVM_CAP_SMALLER_MAXPHYADDR:
+   r = (int) allow_smaller_maxphyaddr;
+   break;
default:
break;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..68cd3a0af9bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_SMALLER_MAXPHYADDR 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.26.2



[PATCH v3 4/9] KVM: x86: rename update_bp_intercept to update_exception_bitmap

2020-07-10 Thread Mohammed Gamal
From: Paolo Bonzini 

We would like to introduce a callback to update the #PF intercept
when CPUID changes.  Just reuse update_bp_intercept since VMX is
already using update_exception_bitmap instead of a bespoke function.

While at it, remove an unnecessary assignment in the SVM version,
which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
and has nothing to do with the exception bitmap.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/svm/svm.c  | 7 +++
 arch/x86/kvm/vmx/vmx.c  | 2 +-
 arch/x86/kvm/x86.c  | 2 +-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 62373cc06c72..bb4044ffb7b7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1098,7 +1098,7 @@ struct kvm_x86_ops {
void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
void (*vcpu_put)(struct kvm_vcpu *vcpu);
 
-   void (*update_bp_intercept)(struct kvm_vcpu *vcpu);
+   void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c0da4dd78ac5..79c33b3539f0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1627,7 +1627,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
mark_dirty(svm->vmcb, VMCB_SEG);
 }
 
-static void update_bp_intercept(struct kvm_vcpu *vcpu)
+static void update_exception_bitmap(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1636,8 +1636,7 @@ static void update_bp_intercept(struct kvm_vcpu *vcpu)
if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) {
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP)
set_exception_intercept(svm, BP_VECTOR);
-   } else
-   vcpu->guest_debug = 0;
+   }
 }
 
 static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
@@ -3989,7 +3988,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_blocking = svm_vcpu_blocking,
.vcpu_unblocking = svm_vcpu_unblocking,
 
-   .update_bp_intercept = update_bp_intercept,
+   .update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = svm_get_msr_feature,
.get_msr = svm_get_msr,
.set_msr = svm_set_msr,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 13745f2a5ecd..178ee92551a9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7859,7 +7859,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_load = vmx_vcpu_load,
.vcpu_put = vmx_vcpu_put,
 
-   .update_bp_intercept = update_exception_bitmap,
+   .update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vmx_get_msr,
.set_msr = vmx_set_msr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f5f4074fc59..03c401963062 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9281,7 +9281,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 */
kvm_set_rflags(vcpu, rflags);
 
-   kvm_x86_ops.update_bp_intercept(vcpu);
+   kvm_x86_ops.update_exception_bitmap(vcpu);
 
r = 0;
 
-- 
2.26.2



[PATCH v3 8/9] KVM: VMX: optimize #PF injection when MAXPHYADDR does not match

2020-07-10 Thread Mohammed Gamal
From: Paolo Bonzini 

Ignore non-present page faults, since those cannot have reserved
bits set.

When running access.flat with "-cpu Haswell,phys-bits=36", the
number of trapped page faults goes down from 8872644 to 3978948.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/vmx.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index de3f436b2d32..0cebc4832805 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4355,6 +4355,16 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
+
+   /*
+* If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
+* between guest and host.  In that case we only care about present
+* faults.
+*/
+   if (enable_ept) {
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
+   vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
+   }
 }
 
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
-- 
2.26.2



[PATCH v3 3/9] KVM: x86: mmu: Add guest physical address check in translate_gpa()

2020-07-10 Thread Mohammed Gamal
In case of running a guest with 4-level page tables on a 5-level page
table host, it might happen that a guest might have a physical address
with reserved bits set, but the host won't see that and trap it.

Hence, we need to check page faults' physical addresses against the guest's
maximum physical memory and if it's exceeded, we need to add
the PFERR_RSVD_MASK bits to the PF's error code.

Also make sure the error code isn't overwritten by the page table walker.

Signed-off-by: Mohammed Gamal 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f8b3c5181466..e03e85b21cda 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -518,6 +518,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 
spte)
 static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
   struct x86_exception *exception)
 {
+   /* Check if guest physical address doesn't exceed guest maximum */
+   if (kvm_mmu_is_illegal_gpa(vcpu, gpa)) {
+   exception->error_code |= PFERR_RSVD_MASK;
+   return UNMAPPED_GVA;
+   }
+
 return gpa;
 }
 
-- 
2.26.2



[PATCH] hv: clocksource: Add notrace attribute to read_hv_sched_clock_*() functions

2020-09-24 Thread Mohammed Gamal
When selecting function_graph tracer with the command:
 # echo function_graph > /sys/kernel/debug/tracing/current_tracer

The kernel crashes with the following stack trace:

[69703.122389] BUG: stack guard page was hit at 1056545c (stack is 
fa3f8fed..05d39503)
[69703.122403] kernel stack overflow (double-fault):  [#1] SMP PTI
[69703.122413] CPU: 0 PID: 16982 Comm: bash Kdump: loaded Not tainted 
4.18.0-236.el8.x86_64 #1
[69703.122420] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[69703.122433] RIP: 0010repare_ftrace_return+0xa/0x110
[69703.122458] Code: 05 00 0f 0b 48 c7 c7 10 ca 69 ae 0f b6 f0 e8 4b 52 0c 00 
31 c0 eb ca 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 <53> 48 83 
ec 18 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 48 85
[69703.122467] RSP: 0018:bd6d01118000 EFLAGS: 00010086
[69703.122476] RAX:  RBX:  RCX: 0003
[69703.122484] RDX:  RSI: bd6d011180d8 RDI: adce7550
[69703.122491] RBP: bd6d01118018 R08:  R09: 9d4b09266000
[69703.122498] R10: 9d4b0fc04540 R11: 9d4b0fc20a00 R12: 9d4b6e42aa90
[69703.122506] R13: 9d4b0fc20ab8 R14: 03e8 R15: bd6d0111837c
[69703.122514] FS:  7fd5f2588740() GS:9d4b6e40() 
knlGS:
[69703.122521] CS:  0010 DS:  ES:  CR0: 80050033
[69703.122528] CR2: bd6d01117ff8 CR3: 565d8001 CR4: 003606f0
[69703.122538] DR0:  DR1:  DR2: 
[69703.122545] DR3:  DR6: fffe0ff0 DR7: 0400
[69703.122552] Call Trace:
[69703.122568]  ftrace_graph_caller+0x6b/0xa0
[69703.122589]  ? read_hv_sched_clock_tsc+0x5/0x20
[69703.122599]  read_hv_sched_clock_tsc+0x5/0x20
[69703.122611]  sched_clock+0x5/0x10
[69703.122621]  sched_clock_local+0x12/0x80
[69703.122631]  sched_clock_cpu+0x8c/0xb0
[69703.122644]  trace_clock_global+0x21/0x90
[69703.122655]  ring_buffer_lock_reserve+0x100/0x3c0
[69703.122671]  trace_buffer_lock_reserve+0x16/0x50
[69703.122683]  __trace_graph_entry+0x28/0x90
[69703.122695]  trace_graph_entry+0xfd/0x1a0
[69703.122705]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122714]  ? sched_clock+0x5/0x10
[69703.122723]  prepare_ftrace_return+0x99/0x110
[69703.122734]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122743]  ? sched_clock+0x5/0x10
[69703.122752]  ftrace_graph_caller+0x6b/0xa0
[69703.122768]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122777]  ? sched_clock+0x5/0x10
[69703.122786]  ? read_hv_sched_clock_tsc+0x5/0x20
[69703.122796]  ? ring_buffer_unlock_commit+0x1d/0xa0
[69703.122805]  read_hv_sched_clock_tsc+0x5/0x20
[69703.122814]  ftrace_graph_caller+0xa0/0xa0
[69703.122823]  ? trace_clock_local+0x5/0x10
[69703.122831]  ? ftrace_push_return_trace+0x5d/0x120
[69703.122842]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122850]  ? sched_clock+0x5/0x10
[69703.122860]  ? prepare_ftrace_return+0xd5/0x110
[69703.122871]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122879]  ? sched_clock+0x5/0x10
[69703.122889]  ? ftrace_graph_caller+0x6b/0xa0
[69703.122904]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.122912]  ? sched_clock+0x5/0x10
[69703.122922]  ? read_hv_sched_clock_tsc+0x5/0x20
[69703.122931]  ? ring_buffer_unlock_commit+0x1d/0xa0
[69703.122940]  ? read_hv_sched_clock_tsc+0x5/0x20
[69703.122966]  ? ftrace_graph_caller+0xa0/0xa0
[69703.122975]  ? trace_clock_local+0x5/0x10
[69703.122984]  ? ftrace_push_return_trace+0x5d/0x120
[69703.122995]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.123006]  ? sched_clock+0x5/0x10
[69703.123016]  ? prepare_ftrace_return+0xd5/0x110
[69703.123026]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.123035]  ? sched_clock+0x5/0x10
[69703.123044]  ? ftrace_graph_caller+0x6b/0xa0
[69703.123059]  ? read_hv_clock_tsc_cs+0x10/0x10
[69703.123068]  ? sched_clock+0x5/0x10

Setting the notrace attribute for read_hv_sched_clock_msr() and
read_hv_sched_clock_tsc() fixes it

Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V specific
sched clock function")
Signed-off-by: Vitaly Kuznetsov 
Signed-off-by: Mohammed Gamal 
---
 drivers/clocksource/hyperv_timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/hyperv_timer.c 
b/drivers/clocksource/hyperv_timer.c
index 09aa44cb8a91d..ba04cb381cd3f 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -341,7 +341,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource 
*arg)
return read_hv_clock_tsc();
 }
 
-static u64 read_hv_sched_clock_tsc(void)
+static u64 notrace read_hv_sched_clock_tsc(void)
 {
return (read_hv_clock_tsc() - hv_sched_clock_offset) *
(NSEC_PER_SEC / HV_CLOCK_HZ);
@@ -404,7 +404,7 @@ static u64 notrace read_hv_clock_msr_cs(struct clocksource 
*arg)
return read_hv_clock_msr(

[PATCH] netvsc: Remove redundant use of ipv6_hdr()

2017-07-19 Thread Mohammed Gamal
This condition already uses an object of type ipv6hdr in the line above.
Use the object directly instead of calling ipv6_hdr

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 63c98bb..06d591c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -339,7 +339,7 @@ static u32 net_checksum_info(struct sk_buff *skb)
 
if (ip6->nexthdr == IPPROTO_TCP)
return TRANSPORT_INFO_IPV6_TCP;
-   else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP)
+   else if (ip6->nexthdr == IPPROTO_UDP)
return TRANSPORT_INFO_IPV6_UDP;
}
 
-- 
2.9.4



Re: [PATCH] netvsc: Remove redundant use of ipv6_hdr()

2017-07-19 Thread Mohammed Gamal


- Original Message -
> On Wed, 19 Jul 2017 15:19:28 +0200
> Mohammed Gamal  wrote:
> 
> > This condition already uses an object of type ipv6hdr in the line above.
> > Use the object directly instead of calling ipv6_hdr
> > 
> > Signed-off-by: Mohammed Gamal 
> > ---
> >  drivers/net/hyperv/netvsc_drv.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/hyperv/netvsc_drv.c
> > b/drivers/net/hyperv/netvsc_drv.c
> > index 63c98bb..06d591c 100644
> > --- a/drivers/net/hyperv/netvsc_drv.c
> > +++ b/drivers/net/hyperv/netvsc_drv.c
> > @@ -339,7 +339,7 @@ static u32 net_checksum_info(struct sk_buff *skb)
> >  
> > if (ip6->nexthdr == IPPROTO_TCP)
> > return TRANSPORT_INFO_IPV6_TCP;
> > -   else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP)
> > +   else if (ip6->nexthdr == IPPROTO_UDP)
> > return TRANSPORT_INFO_IPV6_UDP;
> > }
> >  
> 
> Patch looks fine.
> Network patches go through net...@vger.kernel.org not linux driver mailing
> list.
> I will add it to my next patch of patches that are going to netdev for
> net-next.
> 
Thanks for the heads up. Will take that into consideration next time.

It's worth pointing out that MAINTAINERS points that files under 
drivers/net/hyperv are to be sent to de...@linuxdriverproject.org.
Perhaps that should be updated.