[PATCH] mmc: Reducing cache operations in the host driver

2013-09-26 Thread Vishal Annapurve

Hi,

I am attaching a patch which attempts to reduce the cache operations 
while doing
MMC transactions. I have tested it only on arm and the tests performed 
with benchmarks
like iozone/bonnie showed that the data integrity is maintained while 
I/O bandwidth is increased.

I have tested it with K3.1 and I believe it can apply to 3.12 also.

My understanding from the important API's dealing with DMA memory is as 
follows:
1) dma_map_sg/ dma_sync_sg_for_device -> make sure that cache is flushed 
after CPU is done
updating the memory allocated for DMA and is called before giving 
control of DMA memory to

the device.
2) dma_unmap_sg/dma_sync_sg_for_cpu -> Make sure that cache is 
invalidated before

reading from the DMA area which was used by the device to write the data.

About the patch:
Changes in sdhci_adma_table_pre make sure that we only flush if we have 
updated DMA

area after the call to dma_map_sg.

Changes in sdhci_adma_table_post take care of following:
1) Remove invalidation of cache for memory locations which are going to 
be updated by CPU,

as they are not being read.
2) Perform the unmap of sg before CPU accesses DMA area as the changes 
we did for unaligned
cases might get lost due to invalidation afterwards. I was not able to 
induce unaligned buffer
accesses using normal filesystem/raw device operations. Maybe that's why 
this issue was not

discovered so far.
3) Only drawback is sg->dma_address gets used after the call to 
dma_unmap_sg.


I would like to understand if this patch can cause any regressions for 
any of the architectures

or with the MMC functionality.

Thanks & Regards,
Vishal Annapurve


Author: Vishal Annapurve 

MMC: Remove unnecessary cache operations

1) This change removes unnecessary cache operations happening
after and before DMA setup in MMC host driver.
    
    Signed-off-by: Vishal Annapurve 

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 6e43c84..d5cd0ae 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -434,6 +434,7 @@ static int sdhci_adma_table_pre(struct sdhci_host *host,
 	dma_addr_t addr;
 	dma_addr_t align_addr;
 	int len, offset;
+	int align_buf_modified_len = 0;
 
 	struct scatterlist *sg;
 	int i;
@@ -498,6 +499,7 @@ static int sdhci_adma_table_pre(struct sdhci_host *host,
 
 			align += 4;
 			align_addr += 4;
+			align_buf_modified_len += 4;
 
 			desc += 8;
 
@@ -538,9 +540,10 @@ static int sdhci_adma_table_pre(struct sdhci_host *host,
 	/*
 	 * Resync align buffer as we might have changed it.
 	 */
-	if (data->flags & MMC_DATA_WRITE) {
+	if ((data->flags & MMC_DATA_WRITE) &&
+			align_buf_modified_len) {
 		dma_sync_single_for_device(mmc_dev(host->mmc),
-			host->align_addr, 128 * 4, direction);
+			host->align_addr, align_buf_modified_len, direction);
 	}
 
 	host->adma_addr = dma_map_single(mmc_dev(host->mmc),
@@ -583,9 +586,10 @@ static void sdhci_adma_table_post(struct sdhci_host *host,
 	dma_unmap_single(mmc_dev(host->mmc), host->align_addr,
 		128 * 4, direction);
 
+	dma_unmap_sg(mmc_dev(host->mmc), data->sg,
+		data->sg_len, direction);
+
 	if (data->flags & MMC_DATA_READ) {
-		dma_sync_sg_for_cpu(mmc_dev(host->mmc), data->sg,
-			data->sg_len, direction);
 
 		align = host->align_buffer;
 
@@ -602,9 +606,6 @@ static void sdhci_adma_table_post(struct sdhci_host *host,
 			}
 		}
 	}
-
-	dma_unmap_sg(mmc_dev(host->mmc), data->sg,
-		data->sg_len, direction);
 }
 
 static u8 sdhci_calc_timeout(struct sdhci_host *host, struct mmc_command *cmd)


[PATCH] usb-storage: scsiglue: Changing the command result

2013-09-26 Thread Vishal Annapurve

Hi,

There was a recent commit in mainline for the scsi devices which do not
respond properly to medium access command:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

I came across a USB drive which showed similar problem and what I see is
usb storage is still not able to cope with such devices properly.

The control flow downwards is like:
scsi_times_out --> Setting cmd->result as DID_TIME_OUT
scsi_error_handler
scsi_unjam_host
scsi_eh_abort_cmds
command_abort(sets US_FLIDX_TIMED_OUT for us->dflags
  calls stop_transport,
  and waits for)usb_stor_control_thread 
(which is waiting for
   transport 
call to return inside
   
usb_stor_invoke_transport)
   both 
usb_stor_control_thread and
   
usb_stor_invoke_transport
check for us->dflags 
timed_out bit and
 set the result as DID_ABORT
 and signal completion for 
command_abort
 to complete
..
sd_eh_action
checks for cmd->result and finds out that it's DID_ABORT rather than
DID_TIME_OUT.

This patch updates the command result to be TIME_OUT explicitly before
returning from command_abort in scsiglue.c.

I would like to know if this patch can work out for such USB Storage
devices? What would be the better way to do the same?


Regards,
Vishal Annapurve



From: Vishal Annapurve 
Subject: [PATCH] usb-storage: Changing the command result

This change updates the returned result to scsi layer to use
DID_TIMEOUT flag as a result status rather than DID_ABORT for
commands aborted due to timeout.

Signed-off-by: Vishal Annapurve 
---
 drivers/usb/storage/scsiglue.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c
index a74e9d8..e7b532e 100644
--- a/drivers/usb/storage/scsiglue.c
+++ b/drivers/usb/storage/scsiglue.c
@@ -353,6 +353,15 @@ static int command_abort(struct scsi_cmnd *srb)
 
 	/* Wait for the aborted command to finish */
 	wait_for_completion(&us->notify);
+	/* Given that this bit would be only set in case of timedout
+	 * command, we can return with DID_TIME_OUT, as scsi layer needs
+	 * to use this result rather than DID_ABORT.
+	 * This is a WR to avoid removing DID_ABORT altogether
+	 * and before the final control goes to scsi layer change the
+	 * result to be timedout.
+	 */
+	us->srb->result = DID_TIME_OUT << 16;
+
 	return SUCCESS;
 }
 
-- 
1.8.4




Re: [RFC PATCH 00/39] 1G page support for guest_memfd

2024-09-14 Thread Vishal Annapurve
On Fri, Sep 13, 2024 at 6:08 PM Du, Fan  wrote:
>
> ...
> >
> > Hello,
> >
> > This patchset is our exploration of how to support 1G pages in guest_memfd,
> > and
> > how the pages will be used in Confidential VMs.
> >
> > The patchset covers:
> >
> > + How to get 1G pages
> > + Allowing mmap() of guest_memfd to userspace so that both private and
> > shared
>
> Hi Ackerley
>
> Thanks for posting new version :)
>
> W.r.t above description and below patch snippet from Patch 26-29,
> Does this new design aim to backup shared and private GPA with a single
> Hugetlb spool which equal VM instance total memory?

Yes.
>
> By my understanding, before this new changes, shared memfd and gmem fd
> has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.

Selftests attached to this series use single gmem fd to back guest memory.

>
> Does Qemu require new changes as well? I'd like to have a test of this series
> if you can share Qemu branch?
>

We are going to discuss this RFC series and related issues at LPC.
Once the next steps are finalized, the plan will be to send out an
improved version. You can use/modify the selftests that are part of
this series to test this feature with software protected VMs for now.

Qemu will require changes for this feature on top of already floated
gmem integration series [1] that adds software protected VM support to
Qemu. If you are interested in testing this feature with TDX VMs then
it needs multiple series to set up the right test environment
(including [2]). We haven't considered posting Qemu patches and it
will be a while before we can get to it.

[1] https://patchew.org/QEMU/20230914035117.3285885-1-xiaoyao...@intel.com/
[2] 
https://patchwork.kernel.org/project/kvm/cover/20231115071519.2864957-1-xiaoyao...@intel.com/



RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-10-15 Thread Vishal Annapurve
Hi Alan,

USB storage maybe just has to say that the abort occurred. By setting the
US_FLIDX_TIMED_OUT bit USB storage is getting signaled that the reason was
time out and the command is being aborted.
 
Now, it's arguable whether to change the implication of US_FLIDX_TIMED_OUT
bit for scsi - USB storage bridge or for entire usb storage Or maybe scsi has
decided to abort so it should override the result.

Regards,
Vishal

-Original Message-
From: Alan Stern [mailto:st...@rowland.harvard.edu] 
Sent: Tuesday, October 15, 2013 7:03 PM
To: Vishal Annapurve
Cc: Ming Lei; Linux Kernel Mailing List; linux-usb
Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result

On Tue, 15 Oct 2013, Vishal Annapurve wrote:

> Hi Alan,
> 
> This issue seems more related to the devices using SCSI protocol and 
> the changes otherwise will be at more places giving the same end result.
> 
> I think as the comment says over the command_abort function, 
> intentional result change should only happen in case of timeout.

usb-storage doesn't know _why_ a command was aborted; it knows only that the 
abort occurred.

If you look carefully at the code, you'll see that the result is set to 
DID_ABORT only when the US_FLIDX_TIMED_OUT bit is set, and this bit gets set 
only when a SCSI abort occurs.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-10-17 Thread Vishal Annapurve
Hi Alan,

What I wanted to say was If the bit US_FLIDX_TIMED_OUT can have more
meanings than timed out then maybe it would be best to override the
results after usb-storage is done with the command maybe in scsi layer
itself who aborted it in the first place.

My concern was that overriding the result in usb storage or scsi layers
will have more side effects than doing it in scsiglue.c.
And by scsi-usb storage bridge what I meant was specifically the code in 
scsiglue.

Question about your last mail, do you want to change all the occurrences of
DID_ABORT from usb-storage to DID_TIMEOUT?

Regards,
Vishal

-Original Message-
From: Alan Stern [mailto:st...@rowland.harvard.edu] 
Sent: Tuesday, October 15, 2013 10:22 PM
To: Vishal Annapurve
Cc: Ming Lei; Linux Kernel Mailing List; linux-usb
Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result

On Tue, 15 Oct 2013, Vishal Annapurve wrote:

> Hi Alan,
> 
> USB storage maybe just has to say that the abort occurred. By setting 
> the US_FLIDX_TIMED_OUT bit USB storage is getting signaled that the 
> reason was time out and the command is being aborted.

No.  By setting the US_FLIDX_TIMED_OUT bit, usb-storage indicates that the 
command was aborted.  This doesn't indicate anything about the reason for the 
abort.  (Maybe this bit's name wasn't chosen very well.)

> Now, it's arguable whether to change the implication of 
> US_FLIDX_TIMED_OUT bit for scsi - USB storage bridge or for entire usb 
> storage

I don't understand this.  What's the difference between "scsi - USB storage 
bridge" and "entire usb storage"?  Aren't they the same thing?

>  Or maybe scsi has
> decided to abort so it should override the result.

Of course the SCSI midlayer has decided to abort.  That's the only way this bit 
can get set.  But usb-storage doesn't know why SCSI decided to abort.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-10-19 Thread Vishal Annapurve
Hi,

Attaching the new patch which will replace all the occurrences of DID_ABORT
with DID_TIMOUT in USB Storage whenever it sees US_FLIDX_TIMED_OUT bit
is set.
 
Regards,
Vishal

-Original Message-
From: Alan Stern [mailto:st...@rowland.harvard.edu] 
Sent: Friday, October 18, 2013 4:10 PM
To: Vishal Annapurve
Cc: Ming Lei; Linux Kernel Mailing List; linux-usb
Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result

On Fri, 18 Oct 2013, Vishal Annapurve wrote:

> Hi Alan,
> 
> What I wanted to say was If the bit US_FLIDX_TIMED_OUT can have more 
> meanings than timed out then maybe it would be best to override the 
> results after usb-storage is done with the command maybe in scsi layer 
> itself who aborted it in the first place.

US_FLIDX_TIMED_OUT has one very specific meaning: command_abort() was called.

Since command_abort() is the .eh_abort_handler routine, US_FLIDX_TIMED_OUT 
means that the SCSI layer decided to abort the command.

Does the SCSI layer ever abort a command for any reason other than a timeout?  
If not, you may conclude that US_FLIDX_TIMED_OUT indicates a timeout.  But if 
it does, you should not make this conclusion.

> My concern was that overriding the result in usb storage or scsi 
> layers will have more side effects than doing it in scsiglue.c.
> And by scsi-usb storage bridge what I meant was specifically the code 
> in scsiglue.
> 
> Question about your last mail, do you want to change all the 
> occurrences of DID_ABORT from usb-storage to DID_TIMEOUT?

Put it this way: There's no good reason for changing some of them but not all 
of them.

And if you're going to change them at all, it makes no sense to first set the 
result to DID_ABORT and then change it to DID_TIMEOUT.  You should simply set 
it to DID_TIMEOUT in the first place.

Alan Stern



usb_storage_change_abort_to_timeout.patch
Description: usb_storage_change_abort_to_timeout.patch


RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-10-15 Thread Vishal Annapurve
Hi Alan,

This issue seems more related to the devices using SCSI protocol and the
changes otherwise will be at more places giving the same end result.

I think as the comment says over the command_abort function,
intentional result change should only happen in case of timeout.

Regards,
Vishal

-Original Message-
From: Alan Stern [mailto:st...@rowland.harvard.edu] 
Sent: Tuesday, October 15, 2013 5:55 PM
To: Ming Lei
Cc: Vishal Annapurve; Linux Kernel Mailing List; linux-usb
Subject: Re: [PATCH] usb-storage: scsiglue: Changing the command result

On Tue, 15 Oct 2013, Ming Lei wrote:

> On Thu, Sep 26, 2013 at 7:35 PM, Vishal Annapurve  
> wrote:
> > Hi,
> >
> > There was a recent commit in mainline for the scsi devices which do 
> > not respond properly to medium access command:
> >
> > commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8
> >
> > [SCSI] Handle disk devices which can not process medium access 
> > commands We have experienced several devices which fail in a fashion 
> > we do not currently handle gracefully in SCSI. After a failure these 
> > devices will respond to the SCSI primary command set (INQUIRY, TEST 
> > UNIT READY, etc.) but any command accessing the storage medium will time 
> > out.
> >
> > I came across a USB drive which showed similar problem and what I 
> > see is usb storage is still not able to cope with such devices properly.
> >
> > The control flow downwards is like:
> > scsi_times_out --> Setting cmd->result as DID_TIME_OUT 
> > scsi_error_handler scsi_unjam_host scsi_eh_abort_cmds
> > command_abort(sets US_FLIDX_TIMED_OUT for us->dflags
> >   calls stop_transport,
> >   and waits for)usb_stor_control_thread
> > (which is waiting for
> >
> > transport call to return inside
> >
> > usb_stor_invoke_transport)
> >
> > both usb_stor_control_thread and
> >
> > usb_stor_invoke_transport
> > check for 
> > us->dflags timed_out bit and
> >  set the result 
> > as DID_ABORT
> >  and signal 
> > completion for command_abort
> >  to complete 
> > ..
> > sd_eh_action
> > checks for cmd->result and finds out that it's DID_ABORT rather than 
> > DID_TIME_OUT.
> >
> > This patch updates the command result to be TIME_OUT explicitly 
> > before returning from command_abort in scsiglue.c.
> >
> > I would like to know if this patch can work out for such USB Storage 
> > devices? What would be the better way to do the same?
> 
> Looks your diagnose is correct, and patch should be doable, the only 
> side effect is that previous returned DID_ABORT in srb->result is 
> replaced with DID_TIME_OUT now.
> 
> Another way is to implement .eh_timed_out callback to return different 
> error for the two failure, but looks not a big deal.

Instead of overriding the result code, a better way to do this would simply be 
to change the places where srb->result is currently set to DID_ABORT << 16.  
Just make it DID_TIME_OUT << 16 instead.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] usb-storage: scsiglue: Changing the command result

2013-12-09 Thread Vishal Annapurve

Hi Greg,

Attached are the patches.

Regards,
Vishal

On Monday 09 December 2013 07:29 AM, Greg KH wrote:

On Sat, Nov 16, 2013 at 12:23:50PM +0530, Vishal Annapurve wrote:

Hi,

Here are the updated patches:   

Can you please resend these in a format which I can apply them in
without having to hand-edit all three of them?

thanks,

greg k-h




01-PATCH-usb-storage-Proper-cmd-result-assignment.patch.gz
Description: GNU Zip compressed data


02-keucr-Proper-cmd-result-assignment.patch.gz
Description: GNU Zip compressed data


03-usb-storage-uas-Proper-cmd-result-assignment.patch.gz
Description: GNU Zip compressed data


Re: [PATCH] usb-storage: scsiglue: Changing the command result

2013-12-09 Thread Vishal Annapurve

Hi Greg,

Does this look fine? I will send over other patches
once you confirm.

Patch set 1:
-
[PATCH 1/3] usb: storage: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result
whenever US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

As the USB storage was setting command result as aborted rather than
timed out, SCSI layer was not recognizing the above mentioned failure
pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/usb/storage/cypress_atacb.c |  1 +
 drivers/usb/storage/isd200.c|  2 +-
 drivers/usb/storage/transport.c |  8 
 drivers/usb/storage/usb.c   | 10 ++
 4 files changed, 12 insertions(+), 9 deletions(-)
---
diff --git a/drivers/usb/storage/cypress_atacb.c 
b/drivers/usb/storage/cypress_atacb.c

index 8514a2d..3477ca19 100644
--- a/drivers/usb/storage/cypress_atacb.c
+++ b/drivers/usb/storage/cypress_atacb.c
@@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct 
scsi_cmnd *srb, struct us_data *us)

  */
 if ((srb->result != (DID_ERROR << 16) &&
 srb->result != (DID_ABORT << 16)) &&
+srb->result != (DID_TIME_OUT << 16) &&
 save_cmnd[2] & 0x20) {
 struct scsi_eh_save ses;
 unsigned char regs[8];
diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c
index 599d8bf..ffd5d58 100644
--- a/drivers/usb/storage/isd200.c
+++ b/drivers/usb/storage/isd200.c
@@ -703,7 +703,7 @@ static void isd200_invoke_transport( struct us_data *us,
 /* abort processing: the bulk-only transport requires a reset
  * following an abort */
 Handle_Abort:
-srb->result = DID_ABORT << 16;
+srb->result = DID_TIME_OUT << 16;

 /* permit the reset transfer to take place */
 clear_bit(US_FLIDX_ABORTING, &us->dflags);
diff --git a/drivers/usb/storage/transport.c 
b/drivers/usb/storage/transport.c

index 22c7d43..6a90161 100644
--- a/drivers/usb/storage/transport.c
+++ b/drivers/usb/storage/transport.c
@@ -607,8 +607,8 @@ void usb_stor_invoke_transport(struct scsi_cmnd 
*srb, struct us_data *us)

  * short-circuit all other processing
  */
 if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-usb_stor_dbg(us, "-- command was aborted\n");
-srb->result = DID_ABORT << 16;
+usb_stor_dbg("-- command was aborted because of timeout\n");
+srb->result = DID_TIME_OUT << 16;
 goto Handle_Errors;
 }

@@ -717,8 +717,8 @@ Retry_Sense:
 scsi_eh_restore_cmnd(srb, &ses);

 if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-usb_stor_dbg(us, "-- auto-sense aborted\n");
-srb->result = DID_ABORT << 16;
+usb_stor_dbg("-- auto-sense aborted due to timeout\n");
+srb->result = DID_TIME_OUT << 16;

 /* If SANE_SENSE caused this problem, disable it */
 if (sense_size != US_SENSE_SIZE) {
diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c
index 5c4fe07..04a68eb 100644
--- a/drivers/usb/storage/usb.c
+++ b/drivers/usb/storage/usb.c
@@ -325,7 +325,7 @@ static int usb_stor_control_thread(void * __us)

 /* has the command timed out *already* ? */
 if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-us->srb->result = DID_ABORT << 16;
+us->srb->result = DID_TIME_OUT << 16;
 goto SkipForAbort;
 }

@@ -379,7 +379,8 @@ static int usb_stor_control_thread(void * __us)
 scsi_lock(host);

 /* indicate that the command is done */
-if (us->srb->result != DID_ABORT << 16) {
+if ((us->srb->result != DID_ABORT << 16) &&
+(us->srb->result != DID_TIME_OUT << 16)) {
 usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
  us->srb->result);
 us->srb->scsi_done(us->srb);
@@ -390,8 +391,9 @@ SkipForAbort:

 /* If an abort request was received we need to signal that
  * the abort has finished.  The proper test for this is
- * the TIMED_OUT flag, not srb->result == DID_ABORT, because
- * the timeout might have occurred after the command had
+ * the TIMED_OUT flag, not srb->result == DID_ABORT or
+ * srb->result =

Re: [PATCH] usb-storage: scsiglue: Changing the command result

2013-12-10 Thread Vishal Annapurve
Hi Greg,

There was problem with my email client.
Hope now it is fine.

---
[PATCH 1/3] usb: storage: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result
whenever US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

As the USB storage was setting command result as aborted rather than
timed out, SCSI layer was not recognizing the above mentioned failure
pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/usb/storage/cypress_atacb.c |  1 +
 drivers/usb/storage/isd200.c|  2 +-
 drivers/usb/storage/transport.c |  8 
 drivers/usb/storage/usb.c   | 10 ++
 4 files changed, 12 insertions(+), 9 deletions(-)
---
diff --git a/drivers/usb/storage/cypress_atacb.c 
b/drivers/usb/storage/cypress_atacb.c
index 8514a2d..3477ca19 100644
--- a/drivers/usb/storage/cypress_atacb.c
+++ b/drivers/usb/storage/cypress_atacb.c
@@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd 
*srb, struct us_data *us)
 */
if ((srb->result != (DID_ERROR << 16) &&
srb->result != (DID_ABORT << 16)) &&
+   srb->result != (DID_TIME_OUT << 16) &&
save_cmnd[2] & 0x20) {
struct scsi_eh_save ses;
unsigned char regs[8];
diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c
index 599d8bf..ffd5d58 100644
--- a/drivers/usb/storage/isd200.c
+++ b/drivers/usb/storage/isd200.c
@@ -703,7 +703,7 @@ static void isd200_invoke_transport( struct us_data *us,
/* abort processing: the bulk-only transport requires a reset
 * following an abort */
Handle_Abort:
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;

/* permit the reset transfer to take place */
clear_bit(US_FLIDX_ABORTING, &us->dflags);
diff --git a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c
index 22c7d43..6a90161 100644
--- a/drivers/usb/storage/transport.c
+++ b/drivers/usb/storage/transport.c
@@ -607,8 +607,8 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
 * short-circuit all other processing
 */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   usb_stor_dbg(us, "-- command was aborted\n");
-   srb->result = DID_ABORT << 16;
+   usb_stor_dbg("-- command was aborted because of timeout\n");
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}

@@ -717,8 +717,8 @@ Retry_Sense:
scsi_eh_restore_cmnd(srb, &ses);

if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   usb_stor_dbg(us, "-- auto-sense aborted\n");
-   srb->result = DID_ABORT << 16;
+   usb_stor_dbg("-- auto-sense aborted due to timeout\n");
+   srb->result = DID_TIME_OUT << 16;

/* If SANE_SENSE caused this problem, disable it */
if (sense_size != US_SENSE_SIZE) {
diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c
index 5c4fe07..04a68eb 100644
--- a/drivers/usb/storage/usb.c
+++ b/drivers/usb/storage/usb.c
@@ -325,7 +325,7 @@ static int usb_stor_control_thread(void * __us)

/* has the command timed out *already* ? */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   us->srb->result = DID_ABORT << 16;
+   us->srb->result = DID_TIME_OUT << 16;
goto SkipForAbort;
}

@@ -379,7 +379,8 @@ static int usb_stor_control_thread(void * __us)
scsi_lock(host);

/* indicate that the command is done */
-   if (us->srb->result != DID_ABORT << 16) {
+   if ((us->srb->result != DID_ABORT << 16) &&
+   (us->srb->result != DID_TIME_OUT << 16)) {
usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
 us->srb->result);
us->srb->scsi_done(us->srb);
@@ -390,8 +391,9 @@ SkipForAbort:

/* If an abort request was received we need 

RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-10-26 Thread Vishal Annapurve
Hi Alan,

Here is the new patch:

From: Vishal Annapurve 
Date: Sat, 26 Oct 2013 21:10:11 +0530
Subject: [PATCH] usb: storage: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result
whenever US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

As the USB storage was setting command result as aborted rather than
timed out, SCSI layer was not recognizing the above mentioned failure
pattern.

Change-Id: Ic58e2247fed11649f4dbea56382354ba2fe0be1b
---
 drivers/staging/keucr/transport.c   | 6 +++---
 drivers/staging/keucr/usb.c | 5 +++--
 drivers/usb/storage/cypress_atacb.c | 1 +
 drivers/usb/storage/isd200.c| 2 +-
 drivers/usb/storage/transport.c | 8 
 drivers/usb/storage/usb.c   | 9 +
 6 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/keucr/transport.c 
b/drivers/staging/keucr/transport.c
index 1a8837d..ac0e3c2 100644
--- a/drivers/staging/keucr/transport.c
+++ b/drivers/staging/keucr/transport.c
@@ -312,7 +312,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
we need to short-circuit all other processing */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- command was aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
@@ -371,7 +371,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
 
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- auto-sense aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
if (temp_result != USB_STOR_TRANSPORT_GOOD) {
@@ -447,7 +447,7 @@ void ENE_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
we need to short-circuit all other processing */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- command was aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
diff --git a/drivers/staging/keucr/usb.c b/drivers/staging/keucr/usb.c
index 66aad3a..79405be 100644
--- a/drivers/staging/keucr/usb.c
+++ b/drivers/staging/keucr/usb.c
@@ -200,7 +200,7 @@ static int usb_stor_control_thread(void * __us)
/* has the command timed out *already* ? */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags))
{
-   us->srb->result = DID_ABORT << 16;
+   us->srb->result = DID_TIME_OUT << 16;
goto SkipForAbort;
}
 
@@ -234,7 +234,8 @@ static int usb_stor_control_thread(void * __us)
scsi_lock(host);
 
/* indicate that the command is done */
-   if (us->srb->result != DID_ABORT << 16)
+   if ((us->srb->result != DID_ABORT << 16) &&
+   (us->srb->result != DID_TIME_OUT << 16))
{
us->srb->scsi_done(us->srb);
}
diff --git a/drivers/usb/storage/cypress_atacb.c 
b/drivers/usb/storage/cypress_atacb.c
index c844718..af24894 100644
--- a/drivers/usb/storage/cypress_atacb.c
+++ b/drivers/usb/storage/cypress_atacb.c
@@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd 
*srb, struct us_data *us)
 */
if ((srb->result != (DID_ERROR << 16) &&
srb->result != (DID_ABORT << 16)) &&
+   srb->result != (DID_TIME_OUT << 16) &&
save_cmnd[2] & 0x20) {
struct scsi_eh_save ses;
unsigned char regs[8];
diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c
index ffc4193..28b688b 100644
--- a/drivers/usb/storage/isd200.c
+++ b/drivers/usb/storage/isd200.c
@@ -705,7 +705,7 @@ static void isd200_invoke_transport( struct us_data *us,
/* abort processing: the bulk-only transport requires a reset
 * following an abort *

RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-11-15 Thread Vishal Annapurve
Hi,

Here are the updated patches:   

[PATCH 1/3] usb: storage: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result
whenever US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

As the USB storage was setting command result as aborted rather than
timed out, SCSI layer was not recognizing the above mentioned failure
pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/usb/storage/cypress_atacb.c |  1 +
 drivers/usb/storage/isd200.c|  2 +-
 drivers/usb/storage/transport.c |  8 
 drivers/usb/storage/usb.c   | 10 ++
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/usb/storage/cypress_atacb.c 
b/drivers/usb/storage/cypress_atacb.c
index 8514a2d..3477ca19 100644
--- a/drivers/usb/storage/cypress_atacb.c
+++ b/drivers/usb/storage/cypress_atacb.c
@@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd 
*srb, struct us_data *us)
 */
if ((srb->result != (DID_ERROR << 16) &&
srb->result != (DID_ABORT << 16)) &&
+   srb->result != (DID_TIME_OUT << 16) &&
save_cmnd[2] & 0x20) {
struct scsi_eh_save ses;
unsigned char regs[8];
diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c
index 599d8bf..ffd5d58 100644
--- a/drivers/usb/storage/isd200.c
+++ b/drivers/usb/storage/isd200.c
@@ -703,7 +703,7 @@ static void isd200_invoke_transport( struct us_data *us,
/* abort processing: the bulk-only transport requires a reset
 * following an abort */
Handle_Abort:
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
 
/* permit the reset transfer to take place */
clear_bit(US_FLIDX_ABORTING, &us->dflags);
diff --git a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c
index 22c7d43..6a90161 100644
--- a/drivers/usb/storage/transport.c
+++ b/drivers/usb/storage/transport.c
@@ -607,8 +607,8 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
 * short-circuit all other processing
 */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   usb_stor_dbg(us, "-- command was aborted\n");
-   srb->result = DID_ABORT << 16;
+   usb_stor_dbg("-- command was aborted because of timeout\n");
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
@@ -717,8 +717,8 @@ Retry_Sense:
scsi_eh_restore_cmnd(srb, &ses);
 
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   usb_stor_dbg(us, "-- auto-sense aborted\n");
-   srb->result = DID_ABORT << 16;
+   usb_stor_dbg("-- auto-sense aborted due to timeout\n");
+   srb->result = DID_TIME_OUT << 16;
 
/* If SANE_SENSE caused this problem, disable it */
if (sense_size != US_SENSE_SIZE) {
diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c
index 5c4fe07..04a68eb 100644
--- a/drivers/usb/storage/usb.c
+++ b/drivers/usb/storage/usb.c
@@ -325,7 +325,7 @@ static int usb_stor_control_thread(void * __us)
 
/* has the command timed out *already* ? */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   us->srb->result = DID_ABORT << 16;
+   us->srb->result = DID_TIME_OUT << 16;
goto SkipForAbort;
}
 
@@ -379,7 +379,8 @@ static int usb_stor_control_thread(void * __us)
scsi_lock(host);
 
/* indicate that the command is done */
-   if (us->srb->result != DID_ABORT << 16) {
+   if ((us->srb->result != DID_ABORT << 16) &&
+   (us->srb->result != DID_TIME_OUT << 16)) {
usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
 us->srb->result);
us->srb->scsi_done(us->srb);
@@ -390,8 +391,9 @@ SkipForAbort:
 
/* If an abort request was received we need to signal that
 *

RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-11-15 Thread Vishal Annapurve
Hi,

[PATCH 2/3] keucr: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result
whenever US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

As the USB storage was setting command result as aborted rather than
timed out, SCSI layer was not recognizing the above mentioned failure
pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/staging/keucr/transport.c | 6 +++---
 drivers/staging/keucr/usb.c   | 5 +++--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/keucr/transport.c 
b/drivers/staging/keucr/transport.c
index aeb2186..6f61c20 100644
--- a/drivers/staging/keucr/transport.c
+++ b/drivers/staging/keucr/transport.c
@@ -353,7 +353,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
we need to short-circuit all other processing */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- command was aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
@@ -412,7 +412,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
 
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- auto-sense aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
if (temp_result != USB_STOR_TRANSPORT_GOOD) {
@@ -488,7 +488,7 @@ void ENE_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
we need to short-circuit all other processing */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- command was aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
diff --git a/drivers/staging/keucr/usb.c b/drivers/staging/keucr/usb.c
index a84ee63..26ec64c 100644
--- a/drivers/staging/keucr/usb.c
+++ b/drivers/staging/keucr/usb.c
@@ -181,7 +181,7 @@ static int usb_stor_control_thread(void *__us)
 
/* has the command timed out *already* ? */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   us->srb->result = DID_ABORT << 16;
+   us->srb->result = DID_TIME_OUT << 16;
goto SkipForAbort;
}
 
@@ -209,7 +209,8 @@ static int usb_stor_control_thread(void *__us)
scsi_lock(host);
 
/* indicate that the command is done */
-   if (us->srb->result != DID_ABORT << 16) {
+   if ((us->srb->result != DID_ABORT << 16) &&
+   (us->srb->result != DID_TIME_OUT << 16)) {
us->srb->scsi_done(us->srb);
} else {
 SkipForAbort:
-- 
1.8.4.2

Regards,
Vishal

-Original Message-
From: Vishal Annapurve 
Sent: Saturday, November 16, 2013 12:24 PM
To: 'Alan Stern'
Cc: Ming Lei; Linux Kernel Mailing List; linux-usb
Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result

Hi,

Here are the updated patches:   

[PATCH 1/3] usb: storage: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever 
US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands We have 
experienced several devices which fail in a fashion we do not currently handle 
gracefully in SCSI. After a failure these devices will respond to the SCSI 
primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing 
the storage medium will time out.

As the USB storage was setting command result as aborted rather than timed out, 
SCSI layer was not recognizing the above mentioned failure pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/usb/storage/cypress_atacb.c |  1 +
 drivers/usb/storage/isd200.c|  2 +-
 drivers/usb/storage/transport.c |  8 
 drivers/usb/storage/usb.c   | 10 ++
 4 files changed, 12 insert

RE: [PATCH] usb-storage: scsiglue: Changing the command result

2013-11-15 Thread Vishal Annapurve
Hi,

[PATCH 3/3] usb: storage: uas: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result
whenever US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands
We have experienced several devices which fail in a fashion we do not
currently handle gracefully in SCSI. After a failure these devices will
respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.)
but any command accessing the storage medium will time out.

As the USB storage was setting command result as aborted rather than
timed out, SCSI layer was not recognizing the above mentioned failure
pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/usb/storage/uas.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index d966b59..3a4a44e 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -271,7 +271,10 @@ static int uas_try_complete(struct scsi_cmnd *cmnd, const 
char *caller)
usb_free_urb(cmdinfo->data_out_urb);
if (cmdinfo->state & COMMAND_ABORTED) {
scmd_printk(KERN_INFO, cmnd, "abort completed\n");
-   cmnd->result = DID_ABORT << 16;
+   /* Better to check for US_FLIDX_TIMED_OUT
+* bit settings.
+*/
+   cmnd->result = DID_TIME_OUT << 16;
}
cmnd->scsi_done(cmnd);
return 0;
-- 
1.8.4.2

Regards,
Vishal

-Original Message-
From: Vishal Annapurve 
Sent: Saturday, November 16, 2013 12:25 PM
To: 'Alan Stern'
Cc: 'Ming Lei'; 'Linux Kernel Mailing List'; 'linux-usb'
Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result

Hi,

[PATCH 2/3] keucr: Proper cmd result assignment

This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever 
US_FLIDX_TIMED_OUT bit is set.

This change is made to bring USB storage inline with a recent change:

commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8

[SCSI] Handle disk devices which can not process medium access commands We have 
experienced several devices which fail in a fashion we do not currently handle 
gracefully in SCSI. After a failure these devices will respond to the SCSI 
primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing 
the storage medium will time out.

As the USB storage was setting command result as aborted rather than timed out, 
SCSI layer was not recognizing the above mentioned failure pattern.

Signed-off-by: Vishal Annapurve 
---
 drivers/staging/keucr/transport.c | 6 +++---
 drivers/staging/keucr/usb.c   | 5 +++--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/keucr/transport.c 
b/drivers/staging/keucr/transport.c
index aeb2186..6f61c20 100644
--- a/drivers/staging/keucr/transport.c
+++ b/drivers/staging/keucr/transport.c
@@ -353,7 +353,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
we need to short-circuit all other processing */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- command was aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
@@ -412,7 +412,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, 
struct us_data *us)
 
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- auto-sense aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
if (temp_result != USB_STOR_TRANSPORT_GOOD) { @@ -488,7 +488,7 
@@ void ENE_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us)
we need to short-circuit all other processing */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
/* pr_info("-- command was aborted\n"); */
-   srb->result = DID_ABORT << 16;
+   srb->result = DID_TIME_OUT << 16;
goto Handle_Errors;
}
 
diff --git a/drivers/staging/keucr/usb.c b/drivers/staging/keucr/usb.c index 
a84ee63..26ec64c 100644
--- a/drivers/staging/keucr/usb.c
+++ b/drivers/staging/keucr/usb.c
@@ -181,7 +181,7 @@ static int usb_stor_control_thread(void *__us)
 
/* has the command timed out *already* ? */
if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) {
-   us->srb->result = DID_ABORT << 16;
+

Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private

2024-10-16 Thread Vishal Annapurve
On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand  wrote:
>
> >> I also don't know how you treat things like folio_test_hugetlb() on
> >> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
> >> didn't yet check the rest of the patchset yet - reading a large series
> >> without a git tree is sometimes challenging to me.
> >>
> >
> > I'm thinking to basically never involve folio_test_hugetlb(), and the
> > VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
> > because only the HugeTLB allocator is used, but by the time the folio is
> > mapped to userspace, it would have already have been split. After the
> > page is split, the folio loses its HugeTLB status. guest_memfd folios
> > will never be mapped to userspace while they still have a HugeTLB
> > status.
>
> We absolutely must convert these hugetlb folios to non-hugetlb folios.
>
> That is one of the reasons why I raised at LPC that we should focus on
> leaving hugetlb out of the picture and rather have a global pool, and
> the option to move folios from the global pool back and forth to hugetlb
> or to guest_memfd.
>
> How exactly that would look like is TBD.
>
> For the time being, I think we could add a "hack" to take hugetlb folios
> from hugetlb for our purposes, but we would absolutely have to convert
> them to non-hugetlb folios, especially when we split them to small
> folios and start using the mapcount. But it doesn't feel quite clean.

As hugepage folios need to be split up in order to support backing
CoCo VMs with hugepages, I would assume any folio based hugepage
memory allocation will need to go through split/merge cycles through
the guest memfd lifetime.

Plan through next RFC series is to abstract out the hugetlb folio
management within guest_memfd so that any hugetlb specific logic is
cleanly separated out and allows guest memfd to allocate memory from
other hugepage allocators in the future.

>
> Simply starting with a separate global pool (e.g., boot-time allocation
> similar to as done by hugetlb, or CMA) might be cleaner, and a lot of
> stuff could be factored out from hugetlb code to achieve that.

I am not sure if a separate global pool necessarily solves all the
issues here unless we come up with more concrete implementation
details. One of the concerns was the ability of implementing/retaining
HVO while transferring memory between the separate global pool and
hugetlb pool i.e. whether it can seamlessly serve all hugepage users
on the host. Another question could be whether the separate
pool/allocator simplifies the split/merge operations at runtime.

>
> --
> Cheers,
>
> David / dhildenb
>



Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private

2024-10-17 Thread Vishal Annapurve
On Thu, Oct 17, 2024 at 10:46 PM Jason Gunthorpe  wrote:
>
> On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
> > On 17.10.24 18:47, Jason Gunthorpe wrote:
> > > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > >
> > > > My question was more torwards whether gmemfd could still expose the
> > > > possibility to be used in VA forms to other modules that may not support
> > > > fd+offsets yet.
> > >
> > > I keep hearing they don't want to support page pinning on a guestmemfd
> > > mapping, so VA based paths could not work.
> >
> > For shared pages it absolutely must work. That's what I keep hearing :)
>
> Oh that's confusing. I assume non longterm pins desired on shared
> pages though??
>
> Jason

For hugepage support to work, longterm pins on guest private pages
need to be avoided [1], If this somehow was the cause of any confusion
here.

[1] 
https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
(slide 12)



Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup

2024-09-20 Thread Vishal Annapurve
On Wed, Sep 11, 2024 at 1:44 AM Ackerley Tng  wrote:
>
> ...
> +}
> +
> +static void kvm_gmem_evict_inode(struct inode *inode)
> +{
> +   u64 flags = (u64)inode->i_private;
> +
> +   if (flags & KVM_GUEST_MEMFD_HUGETLB)
> +   kvm_gmem_hugetlb_teardown(inode);
> +   else
> +   truncate_inode_pages_final(inode->i_mapping);
> +
> +   clear_inode(inode);
> +}
> +
>  static const struct super_operations kvm_gmem_super_operations = {
> .statfs = simple_statfs,
> +   .evict_inode= kvm_gmem_evict_inode,

Ackerley, can we use free_inode[1] callback to free any special
metadata associated with the inode instead of relying on
super_operations?

[1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719

> ...


>
> if (size <= 0 || !PAGE_ALIGNED(size))
> return -EINVAL;
> --
> 2.46.0.598.g6f2099f65c-goog
>



Re: [PATCH v7 0/3] x86/tdx: Fix HLT logic execution for TDX VMs

2025-03-21 Thread Vishal Annapurve
On Thu, Feb 27, 2025 at 5:44 PM Vishal Annapurve  wrote:
>
> Direct HLT instruction execution causes #VEs for TDX VMs which is routed
> to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
> so IRQs need to remain disabled until the TDCALL to ensure that pending
> IRQs are correctly treated as wake events. As per current TDX spec, HLT
> #VE handler doesn't have access to interruptibility state to selectively
> enable interrupts, it ends up enabling interrupts during #VE handling
> before the TDCALL is executed.
>
> Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
> effectively solved this issue for idle routines by defining TDX specific
> idle routine which directly invokes TDCALL while keeping interrupts
> disabled, but missed handling arch_safe_halt(). This series intends to fix
> arch_safe_halt() execution for TDX VMs.
>
> Changes introduced by the series include:
> - Move *halt() variants outside CONFIG_PARAVIRT_XXL and under
>   CONFIG_PARAVIRT [1].
> - Add explicit dependency on CONFIG_PARAVIRT for TDX VMs.
> - Route "sti; hlt" sequences via tdx_safe_halt() for reliability.
> - Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs.
> - Warn and fail emulation if HLT #VE emulation executes with interrupts
>   enabled.
>
> Changes since v6:
> 1) Addressed Kirills's comments.
> 2) Fixed a build failure.
>
> v6: 
> https://lore.kernel.org/lkml/20250225004704.603652-1-vannapu...@google.com/
>
> Kirill A. Shutemov (1):
>   x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
>
> Vishal Annapurve (2):
>   x86/tdx: Fix arch_safe_halt() execution for TDX VMs
>   x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
>
>  arch/x86/Kconfig  |  1 +
>  arch/x86/coco/tdx/tdx.c   | 34 ++-
>  arch/x86/include/asm/irqflags.h   | 40 +++
>  arch/x86/include/asm/paravirt.h   | 20 +++---
>  arch/x86/include/asm/paravirt_types.h |  3 +-
>  arch/x86/include/asm/tdx.h|  4 +--
>  arch/x86/kernel/paravirt.c| 14 ++
>  arch/x86/kernel/process.c |  2 +-
>  8 files changed, 78 insertions(+), 40 deletions(-)
>
> --
> 2.48.1.711.g2feabab25a-goog
>

Just a ping for review of this series if anything more needs to be discussed.



Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-24 Thread Vishal Annapurve
On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
>
> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >
> >
> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> > >>> Yan Zhao  writes:
> > >>>
> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> > > +/*
> > > + * Allocates and then caches a folio in the filemap. Returns a folio 
> > > with
> > > + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> > > + */
> > > +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct 
> > > inode *inode,
> > > +   pgoff_t 
> > > index)
> > > +{
> > > +   struct kvm_gmem_hugetlb *hgmem;
> > > +   pgoff_t aligned_index;
> > > +   struct folio *folio;
> > > +   int nr_pages;
> > > +   int ret;
> > > +
> > > +   hgmem = kvm_gmem_hgmem(inode);
> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> > > +   if (IS_ERR(folio))
> > > +   return folio;
> > > +
> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
> > > +   aligned_index = round_down(index, nr_pages);
> >  Maybe a gap here.
> > 
> >  When a guest_memfd is bound to a slot where slot->base_gfn is not 
> >  aligned to
> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> >  corresponding GFN is not 2M/1G aligned.
> > >>>
> > >>> Thanks for looking into this.
> > >>>
> > >>> In 1G page support for guest_memfd, the offset and size are always
> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> > >>> slot->npages may not be hugepage aligned.
> > >>>
> > 
> >  However, TDX requires that private huge pages be 2M aligned in GFN.
> > 
> > >>>
> > >>> IIUC other factors also contribute to determining the mapping level in
> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> > >>> in kvm_x86_ops.
> > >>>
> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> > >>> will track that and not allow faulting into guest page tables at higher
> > >>> granularity.
> > >>
> > >> lpage_info only checks the alignments of slot->base_gfn and
> > >> slot->base_gfn + npages. e.g.,
> > >>
> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >
> > Should it be?
> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> Right. Good catch. Thanks!
>
> Let me update the example as below:
> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>
> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>
> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
> 2MB folios, whose physical addresses may not be contiguous.
>
> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
> However, guest_memfd just allocates the same 2MB folio for both faults.
>
>
> >
> > >>
> > >>   -
> > >>   |  |  |  |  |  |  |  |  |
> > >>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
> > >>
> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
> > > Sorry, sent too fast this morning. The example is not right. The correct
> > > one is:
> > >
> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> > > KVM will create a 2M mapping for them.
> > >
> > > However, in guest_memfd, GFN 4M and GFN 4M+16

Re: [PATCH REGRESSION-FIX] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

2025-04-15 Thread Vishal Annapurve
> ...
> 6.12.23 fails to build with the following error if CONFIG_XEN_PV is
> not set:
>
> arch/x86/coco/tdx/tdx.c: In function ‘tdx_early_init’:
> arch/x86/coco/tdx/tdx.c:1080:19: error: ‘struct pv_irq_ops’ has no member
> named ‘safe_halt’
>  1080 | pv_ops.irq.safe_halt = tdx_safe_halt;
>   |   ^
> arch/x86/coco/tdx/tdx.c:1081:19: error: ‘struct pv_irq_ops’ has no member
> named ‘halt’
>  1081 | pv_ops.irq.halt = tdx_halt;
>   |   ^
>
> This is because XEN_PV selects PARAVIRT_XXL, and 'safe_halt' and
> 'halt' are only defined for pv_irq_ops if PARAVIRT_XXL is defined.
>
> The build breakage was introduced in 6.12.23 by stable commit
> 805e3ce5e0e3 which is a backport of 9f98a4f4e721 ("x86/tdx: Fix
> arch_safe_halt() execution for TDX VMs").
>
> Consider picking up upstream commit 22cc5ca5de52 ("x86/paravirt:
> Move halt paravirt calls under CONFIG_PARAVIRT") for stable 6.12.y
> which fixes the build regression by moving 'safe_halt' and 'halt'
> out from under the PARAVIRT_XXL config.
>
> This patch is 22cc5ca5de52 backported to 6.12.23.  There were a
> couple of merge conflicts due to the missing upstream commits below:
>

Thanks Brett for looking into this. I have posted a similar patch [1]
and I see that the issue you reported is also being discussed already
[2].

[1] 
https://lore.kernel.org/stable/20250408132341.4175633-1-vannapu...@google.com/
[2] https://lore.kernel.org/stable/20250410180423.GA3430900@ax162/



Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-28 Thread Vishal Annapurve
On Sun, Apr 27, 2025 at 6:08 PM Yan Zhao  wrote:
>
> On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> > Yan Zhao  writes:
> > ...
> > >
> > > For some memory region, e.g., "pc.ram", it's divided into 2 parts:
> > > - one with offset 0, size 0x8000(2G),
> > >   positioned at GPA 0, which is below GPA 4G;
> > > - one with offset 0x8000(2G), size 0x8000(2G),
> > >   positioned at GPA 0x1(4G), which is above GPA 4G.
> > >
> > > For the second part, its slot->base_gfn is 0x1, while 
> > > slot->gmem.pgoff
> > > is 0x8000.
> > >
> >
> > Nope I don't mean to enforce that they are equal, we just need the
> > offsets within the page to be equal.
> >
> > I edited Vishal's code snippet, perhaps it would help explain better:
> >
> > page_size is the size of the hugepage, so in our example,
> >
> >   page_size = SZ_2M;
> >   page_mask = ~(page_size - 1);
> page_mask = page_size - 1  ?
>
> >   offset_within_page = slot->gmem.pgoff & page_mask;
> >   gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;
> >
> > We will enforce that
> >
> >   offset_within_page == gfn_within_page;
> For "pc.ram", if it has 2.5G below 4G, it would be configured as follows
> - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G
> - slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G
>
> When binding these two slots to the same guest_memfd created with flag
> KVM_GUEST_MEMFD_HUGE_1GB:
> - binding the 1st slot will succeed;
> - binding the 2nd slot will fail.
>
> What options does userspace have in this scenario?

Userspace can create new gmem files that have aligned offsets. But I
see your point, enforcing alignment at binding time will lead to
wastage of memory. i.e. Your example above could be reworked to have:
- slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G, gmem_fd = x, gmem_size = 3G
- slot 2: slot->gmem.pgoff=0, base GPA 4G, size=1.5G, gmem_fd = y,
gmem_size = 2G

This will waste 1G of memory as gmem files will have to be hugepage aligned.

> It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
> isn't ideal either.
>
> What about something similar as below?
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d2feacd14786..87c33704a748 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct 
> kvm_memory_slot *slot,
> }
>
> *pfn = folio_file_pfn(folio, index);
> -   if (max_order)
> -   *max_order = folio_order(folio);
> +   if (max_order) {
> +   int order;
> +
> +   order = folio_order(folio);
> +
> +   while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & 
> ((1 << order) - 1)))

This sounds better. Userspace will need to avoid this in general or
keep such ranges short so that most of the guest memory ranges can be
mapped at hugepage granularity. So maybe a pr_warn could be spewed
during binding that the alignment is not optimal.

> +   order--;
> +
> +   *max_order = order;
> +   }
>
> *is_prepared = folio_test_uptodate(folio);
> return folio;
>
>
> > >> Adding checks at binding time will allow hugepage-unaligned offsets (to
> > >> be at parity with non-guest_memfd backing memory) but still fix this
> > >> issue.
> > >>
> > >> lpage_info will make sure that ranges near the bounds will be
> > >> fragmented, but the hugepages in the middle will still be mappable as
> > >> hugepages.
> > >>
> > >> [1] 
> > >> https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg



[PATCH v7 0/3] x86/tdx: Fix HLT logic execution for TDX VMs

2025-02-27 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events. As per current TDX spec, HLT
#VE handler doesn't have access to interruptibility state to selectively
enable interrupts, it ends up enabling interrupts during #VE handling
before the TDCALL is executed.
 
Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
effectively solved this issue for idle routines by defining TDX specific
idle routine which directly invokes TDCALL while keeping interrupts
disabled, but missed handling arch_safe_halt(). This series intends to fix
arch_safe_halt() execution for TDX VMs.

Changes introduced by the series include:
- Move *halt() variants outside CONFIG_PARAVIRT_XXL and under
  CONFIG_PARAVIRT [1].
- Add explicit dependency on CONFIG_PARAVIRT for TDX VMs.
- Route "sti; hlt" sequences via tdx_safe_halt() for reliability.
- Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs.
- Warn and fail emulation if HLT #VE emulation executes with interrupts
  enabled.

Changes since v6:
1) Addressed Kirills's comments.
2) Fixed a build failure.

v6: https://lore.kernel.org/lkml/20250225004704.603652-1-vannapu...@google.com/
 
Kirill A. Shutemov (1):
  x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

Vishal Annapurve (2):
  x86/tdx: Fix arch_safe_halt() execution for TDX VMs
  x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling

 arch/x86/Kconfig  |  1 +
 arch/x86/coco/tdx/tdx.c   | 34 ++-
 arch/x86/include/asm/irqflags.h   | 40 +++
 arch/x86/include/asm/paravirt.h   | 20 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/include/asm/tdx.h|  4 +--
 arch/x86/kernel/paravirt.c| 14 ++
 arch/x86/kernel/process.c |  2 +-
 8 files changed, 78 insertions(+), 40 deletions(-)

-- 
2.48.1.711.g2feabab25a-goog




[PATCH v7 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs

2025-02-27 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE
handler will enable interrupts before TDCALL is routed to hypervisor
leading to missed wakeup events, as current TDX spec doesn't expose
interruptibility state information to allow #VE handler to selectively
enable interrupts.

Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from executing HLT instruction in STI-shadow.
But it missed the paravirt routine which can be reached via this path
as an example:
kvm_wait()   =>
safe_halt()  =>
raw_safe_halt()  =>
arch_safe_halt() =>
irq.safe_halt()  =>
pv_native_safe_halt()

To reliably handle arch_safe_halt() for TDX VMs, introduce explicit
dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt()
routines with TDX-safe versions that execute direct TDCALL and needed
interrupt flag updates. Executing direct TDCALL brings in additional
benefit of avoiding HLT related #VEs altogether.

Cc: sta...@vger.kernel.org
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/coco/tdx/tdx.c| 26 +-
 arch/x86/include/asm/tdx.h |  4 ++--
 arch/x86/kernel/process.c  |  2 +-
 4 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be2c311f5118..933c046e8966 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -902,6 +902,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
depends on EFI_STUB
+   depends on PARAVIRT
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select X86_MCE
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 32809a06dab4..6aad910d119d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve)
return ve_instr_len(ve);
 }
 
-void __cpuidle tdx_safe_halt(void)
+void __cpuidle tdx_halt(void)
 {
const bool irq_disabled = false;
 
@@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void)
WARN_ONCE(1, "HLT instruction emulation failed\n");
 }
 
+static void __cpuidle tdx_safe_halt(void)
+{
+   tdx_halt();
+   /*
+* "__cpuidle" section doesn't support instrumentation, so stick
+* with raw_* variant that avoids tracing hooks.
+*/
+   raw_local_irq_enable();
+}
+
 static int read_msr(struct pt_regs *regs, struct ve_info *ve)
 {
struct tdx_module_args args = {
@@ -1109,6 +1120,19 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_kexec_begin   = tdx_kexec_begin;
x86_platform.guest.enc_kexec_finish  = tdx_kexec_finish;
 
+   /*
+* Avoid "sti;hlt" execution in TDX guests as HLT induces a #VE that
+* will enable interrupts before HLT TDCALL invocation if executed
+* in STI-shadow, possibly resulting in missed wakeup events.
+*
+* Modify all possible HLT execution paths to use TDX specific routines
+* that directly execute TDCALL and toggle the interrupt state as
+* needed after TDCALL completion. This also reduces HLT related #VEs
+* in addition to having a reliable halt logic execution.
+*/
+   pv_ops.irq.safe_halt = tdx_safe_halt;
+   pv_ops.irq.halt = tdx_halt;
+
/*
 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 * bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b4b16dafd55e..40f9a97371a9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
-void tdx_safe_halt(void);
+void tdx_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
@@ -72,7 +72,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
 #else
 
 static inline void tdx_early_init(void) { };
-static inline void tdx_safe_halt(void) { };
+static inline void tdx_halt(void) { };
 
 static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6da6769d7254..d11956a178df 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -934,7 +934,7 @@ void __init select_idle_routine(void)
static_call_update(x86_idle, mwait_idle);
} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_

[PATCH v7 1/3] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

2025-02-27 Thread Vishal Annapurve
From: "Kirill A. Shutemov" 

CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
CONFIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Cc: sta...@vger.kernel.org
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Co-developed-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Reviewed-by: Juergen Gross 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/include/asm/irqflags.h   | 40 +++
 arch/x86/include/asm/paravirt.h   | 20 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c| 14 ++
 4 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index cf7fc2b8e3ce..1c2db11a2c3c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -76,6 +76,28 @@ static __always_inline void 
native_local_irq_restore(unsigned long flags)
 
 #endif
 
+#ifndef CONFIG_PARAVIRT
+#ifndef __ASSEMBLY__
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static __always_inline void arch_safe_halt(void)
+{
+   native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static __always_inline void halt(void)
+{
+   native_halt();
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT */
+
 #ifdef CONFIG_PARAVIRT_XXL
 #include 
 #else
@@ -97,24 +119,6 @@ static __always_inline void arch_local_irq_enable(void)
native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static __always_inline void arch_safe_halt(void)
-{
-   native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static __always_inline void halt(void)
-{
-   native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 041aff51eb50..29e7331a0c98 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -107,6 +107,16 @@ static inline void notify_page_enc_status_changed(unsigned 
long pfn,
PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
 }
 
+static __always_inline void arch_safe_halt(void)
+{
+   PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+   PVOP_VCALL0(irq.halt);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -170,16 +180,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static __always_inline void arch_safe_halt(void)
-{
-   PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-   PVOP_VCALL0(irq.halt);
-}
-
 static inline u64 paravirt_read_msr(unsigned msr)
 {
return PVOP_CALL1(u64, cpu.read_msr, msr);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index fea56b04f436..abccfccc2e3f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -120,10 +120,9 @@ struct pv_irq_ops {
struct paravirt_callee_save save_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
void (*safe_halt)(void);
void (*halt)(void);
-#endif
 } __no_randomize_layout;
 
 struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1ccaa3397a67..c5bb980b8a67 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -110,6 +110,11 @@ int paravirt_disable_iospace(void)
return request_resource(&ioport_resource, &reserve_ioports);
 }
 
+static noinstr void pv_native_safe_halt(void)
+{
+   native_safe_halt();
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static noinstr void pv_native_write_cr2(unsigned long val)
 {
@@ -125,11 +130,6 @@ static noinstr void pv_native_set_debugreg(int regno, 
unsigned long val)
 {
native_set_debugreg(regno, val);
 }
-
-static noinstr void pv_native_safe_halt(void)
-{
-   native_safe_halt();
-}
 #endif
 
 struct pv_info pv_info = {
@@ -186,9 +186,11 @@ struct paravirt_patch_template pv_ops = {
.irq.save_fl 

[PATCH v7 3/3] x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling

2025-02-27 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events.

Emit warning and fail emulation if IRQs are enabled during HLT #VE handling
to avoid running into scenarios where IRQ wake events are lost resulting in
indefinite HLT execution times.

Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/coco/tdx/tdx.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 6aad910d119d..a97ddc6a52c3 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -393,6 +393,14 @@ static int handle_halt(struct ve_info *ve)
 {
const bool irq_disabled = irqs_disabled();
 
+   /*
+* HLT with IRQs enabled is unsafe, as an IRQ that is intended to be a
+* wake event may be consumed before requesting HLT emulation, leaving
+* the vCPU blocking indefinitely.
+*/
+   if (WARN_ONCE(!irq_disabled, "HLT emulation with IRQs enabled"))
+   return -EIO;
+
if (__halt(irq_disabled))
return -EIO;
 
-- 
2.48.1.711.g2feabab25a-goog




Re: [PATCH v6 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs

2025-02-27 Thread Vishal Annapurve
On Wed, Feb 26, 2025 at 3:49 AM Kirill A. Shutemov  wrote:
>
> On Tue, Feb 25, 2025 at 12:47:03AM +, Vishal Annapurve wrote:
> > Direct HLT instruction execution causes #VEs for TDX VMs which is routed
> > to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE
> > handler will enable interrupts before TDCALL is routed to hypervisor
> > leading to missed wakeup events.
> >
> > Current TDX spec doesn't expose interruptibility state information to
> > allow #VE handler to selectively enable interrupts. To bypass this
> > issue, TDX VMs need to replace "sti;hlt" execution with direct TDCALL
> > followed by explicit interrupt flag update.
> >
> > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
> > prevented the idle routines from executing HLT instruction in STI-shadow.
> > But it missed the paravirt routine which can be reached like this as an
> > example:
> > acpi_safe_halt() =>
> > raw_safe_halt()  =>
> > arch_safe_halt() =>
> > irq.safe_halt()  =>
> > pv_native_safe_halt()
>
> I would rather use paravirt spinlock example. It is less controversial.
> I still see no point in ACPI cpuidle be a thing in TDX guests.
>

I will modify the description to include a paravirt spinlock example.

> >
> > To reliably handle arch_safe_halt() for TDX VMs, introduce explicit
> > dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt()
> > routines with TDX-safe versions that execute direct TDCALL and needed
> > interrupt flag updates. Executing direct TDCALL brings in additional
> > benefit of avoiding HLT related #VEs altogether.
> >
> > Cc: sta...@vger.kernel.org
> > Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
> > Signed-off-by: Vishal Annapurve 
>
> Reviewed-by: Kirill A. Shutemov 
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v6 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs

2025-02-27 Thread Vishal Annapurve
On Thu, Feb 27, 2025 at 8:25 AM kernel test robot  wrote:
>
> Hi Vishal,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on tip/x86/core]
> [also build test ERROR on tip/master linus/master v6.14-rc4 next-20250227]
> [cannot apply to tip/x86/tdx tip/auto-latest]
> ...
> All errors (new ones prefixed by >>):
>
>In file included from arch/x86/kernel/process.c:6:
>In file included from include/linux/mm.h:2224:
>include/linux/vmstat.h:504:43: warning: arithmetic between different 
> enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
> [-Wenum-enum-conversion]
>  504 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>  |~ ^
>  505 |item];
>  |
>include/linux/vmstat.h:511:43: warning: arithmetic between different 
> enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
> [-Wenum-enum-conversion]
>  511 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>  |~ ^
>  512 |NR_VM_NUMA_EVENT_ITEMS +
>  |~~
> >> arch/x86/kernel/process.c:937:32: error: use of undeclared identifier 
> >> 'tdx_halt'; did you mean 'tdx_init'?
>  937 | static_call_update(x86_idle, tdx_halt);
>  |  ^~~~
>  |  tdx_init

Will fix this in the next version.



Re: [PATCH V5 1/4] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

2025-02-20 Thread Vishal Annapurve
On Thu, Feb 20, 2025 at 1:47 PM Dave Hansen  wrote:
>
> On 2/20/25 13:16, Vishal Annapurve wrote:
> > Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
> > like platforms, move HLT and SAFE_HLT paravirt calls under
> > CONFIG_PARAVIRT.
>
> I guess it's just one patch, but doesn't this expose CONFIG_PARAVIRT=y
> users to what _was_ specific to CONFIG_PARAVIRT_XXL=y? According to the
> changelog, TDX users shouldn't have to use use PARAVIRT_XXL, so
> PARAVIRT=y and PARAVIRT_XXL=n must be an *IMPORTANT* configuration for
> TDX users.
>
> Before this patch, those users would have no way to hit the
> unsafe-for-TDX pv_native_safe_halt(). After this patch, they will hit it.

Before this patch, those users had access to arch_safe_halt() ->
native_safe_halt() path. With this patch, such users can execute
arch_safe_halt -> pv_native_safe_halt() -> native_safe_halt(), so this
patch doesn't cause any additional regression.

>
> So, there are two possibilities:
>
>  1. This patch breaks bisection for an important TDX configuration
>  2. This patch's conjecture that PARAVIRT_XXL=n is important for TDX
> is wrong and it is not necessary in the first place.
>
> What am I missing?



Re: [PATCH V5 2/4] x86/tdx: Route safe halt execution via tdx_safe_halt()

2025-02-20 Thread Vishal Annapurve
On Thu, Feb 20, 2025 at 3:00 PM Dave Hansen  wrote:
>
> On 2/20/25 13:16, Vishal Annapurve wrote:
> > Direct HLT instruction execution causes #VEs for TDX VMs which is routed
> > to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
> > so IRQs need to remain disabled until the TDCALL to ensure that pending
> > IRQs are correctly treated as wake events.
>
> This isn't quite true. There's only one paravirt safe_halt() and it
> doesn't do HLT or STI.

pv_native_safe_halt() -> native_safe_halt() -> "sti; hlt".

>
> I think it's more true to say that "safe" halts are entered with IRQs
> disabled. They logically do the halt operation and then enable
> interrupts before returning.
>
> > So "sti;hlt" sequence needs to be replaced for TDX VMs with "TDCALL;
> > *_irq_enable()" to keep interrupts disabled during TDCALL execution.
> But this isn't new. TDX already tried to avoid "sti;hlt". It just
> screwed up the implementation.
>
> > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
> > prevented the idle routines from using "sti;hlt". But it missed the
> > paravirt routine which can be reached like this as an example:
> > acpi_safe_halt() =>
> > raw_safe_halt()  =>
> > arch_safe_halt() =>
> > irq.safe_halt()  =>
> > pv_native_safe_halt()
>
> This, on the other hand, *is* important.
>
> > Modify tdx_safe_halt() to implement the sequence "TDCALL;
> > raw_local_irq_enable()" and invoke tdx_halt() from idle routine which just
> > executes TDCALL without toggling interrupt state. Introduce dependency
> > on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines for
> > TDX VMs.
>
> This changelog glosses over one of the key points: Why *MUST* TDX use
> paravirt? It further confuses the reasoning by alluding to the idea that
> "Direct HLT instruction execution ... is routed to hypervisor via TDCALL".
>
> It gives background and a solution, but it's not obvious what the
> problem is or how the solution _fixes_ the problem.
>
> What must TDX now depend on PARAVIRT?
>
> Why not just route the HLT to a TDXCALL via the #VE code?
>
>

Makes sense, I will update the commit message in the next version to
clearly answer these questions and address the above comments.



[PATCH V5 3/4] x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling

2025-02-20 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events.

Emit warning and fail emulation if IRQs are enabled during HLT #VE handling
to avoid running into scenarios where IRQ wake events are lost resulting in
indefinite HLT execution times.

Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/coco/tdx/tdx.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 7ab427e85bd3..16ac337df9fa 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -393,6 +393,14 @@ static int handle_halt(struct ve_info *ve)
 {
const bool irq_disabled = irqs_disabled();
 
+   /*
+* HLT with IRQs enabled is unsafe, as an IRQ that is intended to be a
+* wake event may be consumed before requesting HLT emulation, leaving
+* the vCPU blocking indefinitely.
+*/
+   if (WARN_ONCE(!irq_disabled, "HLT emulation with IRQs enabled"))
+   return -EIO;
+
if (__halt(irq_disabled))
return -EIO;
 
-- 
2.48.1.601.g30ceb7b040-goog




[PATCH v6 0/3] x86/tdx: Fix HLT logic execution for TDX VMs

2025-02-24 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events. As per current TDX spec, HLT
#VE handler doesn't have access to interruptibility state to selectively
enable interrupts, it ends up enabling interrupts during #VE handling
before the TDCALL is executed.
 
Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
effectively solved this issue for idle routines by defining TDX specific
idle routine which directly invokes TDCALL while keeping interrupts
disabled, but missed handling arch_safe_halt(). This series intends to fix
arch_safe_halt() execution for TDX VMs.

Changes introduced by the series include:
- Move *halt() variants outside CONFIG_PARAVIRT_XXL and under
  CONFIG_PARAVIRT [1].
- Add explicit dependency on CONFIG_PARAVIRT for TDX VMs.
- Route "sti; hlt" sequences via tdx_safe_halt() for reliability.
- Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs.
- Warn and fail emulation if HLT #VE emulation executes with interrupts
  enabled.

Changes since v5:
1) Addressed Dave's comments.
2) Dropped the cleanup patch for now, it can be discussed separately.

v5: https://lore.kernel.org/lkml/20250220211628.1832258-1-vannapu...@google.com/

Kirill A. Shutemov (1):
  x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

Vishal Annapurve (2):
  x86/tdx: Fix arch_safe_halt() execution for TDX VMs
  x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling

 arch/x86/Kconfig  |  1 +
 arch/x86/coco/tdx/tdx.c   | 34 ++-
 arch/x86/include/asm/irqflags.h   | 40 +++
 arch/x86/include/asm/paravirt.h   | 20 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/include/asm/tdx.h|  2 +-
 arch/x86/kernel/paravirt.c| 14 ++
 arch/x86/kernel/process.c |  2 +-
 8 files changed, 77 insertions(+), 39 deletions(-)

-- 
2.48.1.658.g4767266eb4-goog




[PATCH v6 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs

2025-02-24 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE
handler will enable interrupts before TDCALL is routed to hypervisor
leading to missed wakeup events.

Current TDX spec doesn't expose interruptibility state information to
allow #VE handler to selectively enable interrupts. To bypass this
issue, TDX VMs need to replace "sti;hlt" execution with direct TDCALL
followed by explicit interrupt flag update.

Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from executing HLT instruction in STI-shadow.
But it missed the paravirt routine which can be reached like this as an
example:
acpi_safe_halt() =>
raw_safe_halt()  =>
arch_safe_halt() =>
irq.safe_halt()  =>
pv_native_safe_halt()

To reliably handle arch_safe_halt() for TDX VMs, introduce explicit
dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt()
routines with TDX-safe versions that execute direct TDCALL and needed
interrupt flag updates. Executing direct TDCALL brings in additional
benefit of avoiding HLT related #VEs altogether.

Cc: sta...@vger.kernel.org
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Signed-off-by: Vishal Annapurve 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/coco/tdx/tdx.c| 26 +-
 arch/x86/include/asm/tdx.h |  2 +-
 arch/x86/kernel/process.c  |  2 +-
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be2c311f5118..933c046e8966 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -902,6 +902,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
depends on EFI_STUB
+   depends on PARAVIRT
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select X86_MCE
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 32809a06dab4..6aad910d119d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve)
return ve_instr_len(ve);
 }
 
-void __cpuidle tdx_safe_halt(void)
+void __cpuidle tdx_halt(void)
 {
const bool irq_disabled = false;
 
@@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void)
WARN_ONCE(1, "HLT instruction emulation failed\n");
 }
 
+static void __cpuidle tdx_safe_halt(void)
+{
+   tdx_halt();
+   /*
+* "__cpuidle" section doesn't support instrumentation, so stick
+* with raw_* variant that avoids tracing hooks.
+*/
+   raw_local_irq_enable();
+}
+
 static int read_msr(struct pt_regs *regs, struct ve_info *ve)
 {
struct tdx_module_args args = {
@@ -1109,6 +1120,19 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_kexec_begin   = tdx_kexec_begin;
x86_platform.guest.enc_kexec_finish  = tdx_kexec_finish;
 
+   /*
+* Avoid "sti;hlt" execution in TDX guests as HLT induces a #VE that
+* will enable interrupts before HLT TDCALL invocation if executed
+* in STI-shadow, possibly resulting in missed wakeup events.
+*
+* Modify all possible HLT execution paths to use TDX specific routines
+* that directly execute TDCALL and toggle the interrupt state as
+* needed after TDCALL completion. This also reduces HLT related #VEs
+* in addition to having a reliable halt logic execution.
+*/
+   pv_ops.irq.safe_halt = tdx_safe_halt;
+   pv_ops.irq.halt = tdx_halt;
+
/*
 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 * bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b4b16dafd55e..393ee2dfaab1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
-void tdx_safe_halt(void);
+void tdx_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6da6769d7254..d11956a178df 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -934,7 +934,7 @@ void __init select_idle_routine(void)
static_call_update(x86_idle, mwait_idle);
} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_info("using TDX aware idle routine\n");
-   static_call_update(x86_idle, tdx_safe_halt);
+   static_call_update(x86_idle, tdx_halt);
} else {
static_call_update(x86_idle, default_idle);
}
-- 
2.48.1.658.g4767266eb4-goog




[PATCH v6 1/3] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

2025-02-24 Thread Vishal Annapurve
From: "Kirill A. Shutemov" 

CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
CONFIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Cc: sta...@vger.kernel.org
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Co-developed-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/include/asm/irqflags.h   | 40 +++
 arch/x86/include/asm/paravirt.h   | 20 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c| 14 ++
 4 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index cf7fc2b8e3ce..1c2db11a2c3c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -76,6 +76,28 @@ static __always_inline void 
native_local_irq_restore(unsigned long flags)
 
 #endif
 
+#ifndef CONFIG_PARAVIRT
+#ifndef __ASSEMBLY__
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static __always_inline void arch_safe_halt(void)
+{
+   native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static __always_inline void halt(void)
+{
+   native_halt();
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT */
+
 #ifdef CONFIG_PARAVIRT_XXL
 #include 
 #else
@@ -97,24 +119,6 @@ static __always_inline void arch_local_irq_enable(void)
native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static __always_inline void arch_safe_halt(void)
-{
-   native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static __always_inline void halt(void)
-{
-   native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 041aff51eb50..29e7331a0c98 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -107,6 +107,16 @@ static inline void notify_page_enc_status_changed(unsigned 
long pfn,
PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
 }
 
+static __always_inline void arch_safe_halt(void)
+{
+   PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+   PVOP_VCALL0(irq.halt);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -170,16 +180,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static __always_inline void arch_safe_halt(void)
-{
-   PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-   PVOP_VCALL0(irq.halt);
-}
-
 static inline u64 paravirt_read_msr(unsigned msr)
 {
return PVOP_CALL1(u64, cpu.read_msr, msr);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index fea56b04f436..abccfccc2e3f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -120,10 +120,9 @@ struct pv_irq_ops {
struct paravirt_callee_save save_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
void (*safe_halt)(void);
void (*halt)(void);
-#endif
 } __no_randomize_layout;
 
 struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1ccaa3397a67..c5bb980b8a67 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -110,6 +110,11 @@ int paravirt_disable_iospace(void)
return request_resource(&ioport_resource, &reserve_ioports);
 }
 
+static noinstr void pv_native_safe_halt(void)
+{
+   native_safe_halt();
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static noinstr void pv_native_write_cr2(unsigned long val)
 {
@@ -125,11 +130,6 @@ static noinstr void pv_native_set_debugreg(int regno, 
unsigned long val)
 {
native_set_debugreg(regno, val);
 }
-
-static noinstr void pv_native_safe_halt(void)
-{
-   native_safe_halt();
-}
 #endif
 
 struct pv_info pv_info = {
@@ -186,9 +186,11 @@ struct paravirt_patch_template pv_ops = {
.irq.save_fl= __PV_IS_C

[PATCH v6 3/3] x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling

2025-02-24 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events.

Emit warning and fail emulation if IRQs are enabled during HLT #VE handling
to avoid running into scenarios where IRQ wake events are lost resulting in
indefinite HLT execution times.

Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/coco/tdx/tdx.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 6aad910d119d..a97ddc6a52c3 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -393,6 +393,14 @@ static int handle_halt(struct ve_info *ve)
 {
const bool irq_disabled = irqs_disabled();
 
+   /*
+* HLT with IRQs enabled is unsafe, as an IRQ that is intended to be a
+* wake event may be consumed before requesting HLT emulation, leaving
+* the vCPU blocking indefinitely.
+*/
+   if (WARN_ONCE(!irq_disabled, "HLT emulation with IRQs enabled"))
+   return -EIO;
+
if (__halt(irq_disabled))
return -EIO;
 
-- 
2.48.1.658.g4767266eb4-goog




[PATCH V5 0/4] x86/tdx: Fix HLT logic execution for TDX VMs

2025-02-20 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events. So "sti;hlt" sequence needs to
be replaced for TDX VMs with TDCALL execution followed by enabling of
interrupts.

Changes introduced by the series include:
- Move *halt() variants outside CONFIG_PARAVIRT_XXL and under
  CONFIG_PARAVIRT [1].
- Route "sti; hlt" sequences via tdx_safe_halt() for reliability.
- Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs.
- Add explicit dependency on CONFIG_PARAVIRT for TDX VMs.
- Warn and fail emulation if HLT #VE emulation executes with interrupts
  enabled.
- Clean up TDX specific idle routine override.

Changes since v4:
1) Addressed Kirill's comments.

v4: https://lore.kernel.org/lkml/20250212000747.3403836-1-vannapu...@google.com/

Changes since v3:
1) Addressed comments from Sean.
2) Included [1] in the series to fix the scenarios where
CONFIG_PARAVIRT_XXL could be disabled.
v3: https://lore.kernel.org/all/20250206222714.1079059-1-vannapu...@google.com/

[1] 
https://lore.kernel.org/lkml/20210517235008.257241-1-sathyanarayanan.kuppusw...@linux.intel.com/

Kirill A. Shutemov (1):
  x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

Vishal Annapurve (3):
  x86/tdx: Route safe halt execution via tdx_safe_halt()
  x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
  x86/tdx: Remove TDX specific idle routine

 arch/x86/Kconfig  |  1 +
 arch/x86/coco/tdx/tdx.c   | 30 +++-
 arch/x86/include/asm/irqflags.h   | 40 +++
 arch/x86/include/asm/paravirt.h   | 20 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/include/asm/tdx.h|  2 --
 arch/x86/kernel/paravirt.c| 14 ++
 arch/x86/kernel/process.c |  3 --
 8 files changed, 71 insertions(+), 42 deletions(-)

-- 
2.48.1.601.g30ceb7b040-goog




[PATCH V5 4/4] x86/tdx: Remove TDX specific idle routine

2025-02-20 Thread Vishal Annapurve
With explicit dependency on CONFIG_PARAVIRT and TDX specific
halt()/safe_halt() routines in place, default_idle() is safe to execute for
TDX VMs. Remove TDX specific idle routine override which is now
redundant.

Signed-off-by: Vishal Annapurve 
---
 arch/x86/coco/tdx/tdx.c| 2 +-
 arch/x86/include/asm/tdx.h | 2 --
 arch/x86/kernel/process.c  | 3 ---
 3 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 16ac337df9fa..46f7bb82c8b7 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -407,7 +407,7 @@ static int handle_halt(struct ve_info *ve)
return ve_instr_len(ve);
 }
 
-void __cpuidle tdx_halt(void)
+static void __cpuidle tdx_halt(void)
 {
const bool irq_disabled = false;
 
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 393ee2dfaab1..6769d1da4c80 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -58,8 +58,6 @@ void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
-void tdx_halt(void);
-
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
 int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index d11956a178df..9b21989c283b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -932,9 +932,6 @@ void __init select_idle_routine(void)
if (prefer_mwait_c1_over_halt()) {
pr_info("using mwait in idle threads\n");
static_call_update(x86_idle, mwait_idle);
-   } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
-   pr_info("using TDX aware idle routine\n");
-   static_call_update(x86_idle, tdx_halt);
} else {
static_call_update(x86_idle, default_idle);
}
-- 
2.48.1.601.g30ceb7b040-goog




[PATCH V5 2/4] x86/tdx: Route safe halt execution via tdx_safe_halt()

2025-02-20 Thread Vishal Annapurve
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow
so IRQs need to remain disabled until the TDCALL to ensure that pending
IRQs are correctly treated as wake events. So "sti;hlt" sequence needs to
be replaced for TDX VMs with "TDCALL; *_irq_enable()" to keep interrupts
disabled during TDCALL execution.

Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from using "sti;hlt". But it missed the
paravirt routine which can be reached like this as an example:
acpi_safe_halt() =>
raw_safe_halt()  =>
arch_safe_halt() =>
irq.safe_halt()  =>
pv_native_safe_halt()

Modify tdx_safe_halt() to implement the sequence "TDCALL;
raw_local_irq_enable()" and invoke tdx_halt() from idle routine which just
executes TDCALL without toggling interrupt state. Introduce dependency
on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines for
TDX VMs.

Cc: sta...@vger.kernel.org
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Signed-off-by: Vishal Annapurve 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/coco/tdx/tdx.c| 22 +-
 arch/x86/include/asm/tdx.h |  2 +-
 arch/x86/kernel/process.c  |  2 +-
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 87198d957e2f..afcdbc9693dc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -902,6 +902,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
depends on EFI_STUB
+   depends on PARAVIRT
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select X86_MCE
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 32809a06dab4..7ab427e85bd3 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve)
return ve_instr_len(ve);
 }
 
-void __cpuidle tdx_safe_halt(void)
+void __cpuidle tdx_halt(void)
 {
const bool irq_disabled = false;
 
@@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void)
WARN_ONCE(1, "HLT instruction emulation failed\n");
 }
 
+static void __cpuidle tdx_safe_halt(void)
+{
+   tdx_halt();
+   /*
+* "__cpuidle" section doesn't support instrumentation, so stick
+* with raw_* variant that avoids tracing hooks.
+*/
+   raw_local_irq_enable();
+}
+
 static int read_msr(struct pt_regs *regs, struct ve_info *ve)
 {
struct tdx_module_args args = {
@@ -1109,6 +1120,15 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_kexec_begin   = tdx_kexec_begin;
x86_platform.guest.enc_kexec_finish  = tdx_kexec_finish;
 
+   /*
+* "sti;hlt" execution in TDX guests will induce a #VE in the STI-shadow
+* which will enable interrupts before HLT TDCALL inocation possibly
+* resulting in missed wakeup events. Modify all possible HLT
+* execution paths to use TDCALL for performance/reliability reasons.
+*/
+   pv_ops.irq.safe_halt = tdx_safe_halt;
+   pv_ops.irq.halt = tdx_halt;
+
/*
 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 * bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b4b16dafd55e..393ee2dfaab1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
-void tdx_safe_halt(void);
+void tdx_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6da6769d7254..d11956a178df 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -934,7 +934,7 @@ void __init select_idle_routine(void)
static_call_update(x86_idle, mwait_idle);
} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_info("using TDX aware idle routine\n");
-   static_call_update(x86_idle, tdx_safe_halt);
+   static_call_update(x86_idle, tdx_halt);
} else {
static_call_update(x86_idle, default_idle);
}
-- 
2.48.1.601.g30ceb7b040-goog




[PATCH V5 1/4] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

2025-02-20 Thread Vishal Annapurve
From: "Kirill A. Shutemov" 

CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
CONFIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Cc: sta...@vger.kernel.org
Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
Co-developed-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Vishal Annapurve 
---
 arch/x86/include/asm/irqflags.h   | 40 +++
 arch/x86/include/asm/paravirt.h   | 20 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c| 14 ++
 4 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index cf7fc2b8e3ce..1c2db11a2c3c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -76,6 +76,28 @@ static __always_inline void 
native_local_irq_restore(unsigned long flags)
 
 #endif
 
+#ifndef CONFIG_PARAVIRT
+#ifndef __ASSEMBLY__
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static __always_inline void arch_safe_halt(void)
+{
+   native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static __always_inline void halt(void)
+{
+   native_halt();
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT */
+
 #ifdef CONFIG_PARAVIRT_XXL
 #include 
 #else
@@ -97,24 +119,6 @@ static __always_inline void arch_local_irq_enable(void)
native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static __always_inline void arch_safe_halt(void)
-{
-   native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static __always_inline void halt(void)
-{
-   native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 041aff51eb50..29e7331a0c98 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -107,6 +107,16 @@ static inline void notify_page_enc_status_changed(unsigned 
long pfn,
PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
 }
 
+static __always_inline void arch_safe_halt(void)
+{
+   PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+   PVOP_VCALL0(irq.halt);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -170,16 +180,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static __always_inline void arch_safe_halt(void)
-{
-   PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-   PVOP_VCALL0(irq.halt);
-}
-
 static inline u64 paravirt_read_msr(unsigned msr)
 {
return PVOP_CALL1(u64, cpu.read_msr, msr);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index fea56b04f436..abccfccc2e3f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -120,10 +120,9 @@ struct pv_irq_ops {
struct paravirt_callee_save save_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
void (*safe_halt)(void);
void (*halt)(void);
-#endif
 } __no_randomize_layout;
 
 struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1ccaa3397a67..c5bb980b8a67 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -110,6 +110,11 @@ int paravirt_disable_iospace(void)
return request_resource(&ioport_resource, &reserve_ioports);
 }
 
+static noinstr void pv_native_safe_halt(void)
+{
+   native_safe_halt();
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static noinstr void pv_native_write_cr2(unsigned long val)
 {
@@ -125,11 +130,6 @@ static noinstr void pv_native_set_debugreg(int regno, 
unsigned long val)
 {
native_set_debugreg(regno, val);
 }
-
-static noinstr void pv_native_safe_halt(void)
-{
-   native_safe_halt();
-}
 #endif
 
 struct pv_info pv_info = {
@@ -186,9 +186,11 @@ struct paravirt_patch_template pv_ops = {
.irq.save_fl= __PV_IS_C