[PATCH] mmc: Reducing cache operations in the host driver
Hi, I am attaching a patch which attempts to reduce the cache operations while doing MMC transactions. I have tested it only on arm and the tests performed with benchmarks like iozone/bonnie showed that the data integrity is maintained while I/O bandwidth is increased. I have tested it with K3.1 and I believe it can apply to 3.12 also. My understanding from the important API's dealing with DMA memory is as follows: 1) dma_map_sg/ dma_sync_sg_for_device -> make sure that cache is flushed after CPU is done updating the memory allocated for DMA and is called before giving control of DMA memory to the device. 2) dma_unmap_sg/dma_sync_sg_for_cpu -> Make sure that cache is invalidated before reading from the DMA area which was used by the device to write the data. About the patch: Changes in sdhci_adma_table_pre make sure that we only flush if we have updated DMA area after the call to dma_map_sg. Changes in sdhci_adma_table_post take care of following: 1) Remove invalidation of cache for memory locations which are going to be updated by CPU, as they are not being read. 2) Perform the unmap of sg before CPU accesses DMA area as the changes we did for unaligned cases might get lost due to invalidation afterwards. I was not able to induce unaligned buffer accesses using normal filesystem/raw device operations. Maybe that's why this issue was not discovered so far. 3) Only drawback is sg->dma_address gets used after the call to dma_unmap_sg. I would like to understand if this patch can cause any regressions for any of the architectures or with the MMC functionality. Thanks & Regards, Vishal Annapurve Author: Vishal Annapurve MMC: Remove unnecessary cache operations 1) This change removes unnecessary cache operations happening after and before DMA setup in MMC host driver. Signed-off-by: Vishal Annapurve diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index 6e43c84..d5cd0ae 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -434,6 +434,7 @@ static int sdhci_adma_table_pre(struct sdhci_host *host, dma_addr_t addr; dma_addr_t align_addr; int len, offset; + int align_buf_modified_len = 0; struct scatterlist *sg; int i; @@ -498,6 +499,7 @@ static int sdhci_adma_table_pre(struct sdhci_host *host, align += 4; align_addr += 4; + align_buf_modified_len += 4; desc += 8; @@ -538,9 +540,10 @@ static int sdhci_adma_table_pre(struct sdhci_host *host, /* * Resync align buffer as we might have changed it. */ - if (data->flags & MMC_DATA_WRITE) { + if ((data->flags & MMC_DATA_WRITE) && + align_buf_modified_len) { dma_sync_single_for_device(mmc_dev(host->mmc), - host->align_addr, 128 * 4, direction); + host->align_addr, align_buf_modified_len, direction); } host->adma_addr = dma_map_single(mmc_dev(host->mmc), @@ -583,9 +586,10 @@ static void sdhci_adma_table_post(struct sdhci_host *host, dma_unmap_single(mmc_dev(host->mmc), host->align_addr, 128 * 4, direction); + dma_unmap_sg(mmc_dev(host->mmc), data->sg, + data->sg_len, direction); + if (data->flags & MMC_DATA_READ) { - dma_sync_sg_for_cpu(mmc_dev(host->mmc), data->sg, - data->sg_len, direction); align = host->align_buffer; @@ -602,9 +606,6 @@ static void sdhci_adma_table_post(struct sdhci_host *host, } } } - - dma_unmap_sg(mmc_dev(host->mmc), data->sg, - data->sg_len, direction); } static u8 sdhci_calc_timeout(struct sdhci_host *host, struct mmc_command *cmd)
[PATCH] usb-storage: scsiglue: Changing the command result
Hi, There was a recent commit in mainline for the scsi devices which do not respond properly to medium access command: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. I came across a USB drive which showed similar problem and what I see is usb storage is still not able to cope with such devices properly. The control flow downwards is like: scsi_times_out --> Setting cmd->result as DID_TIME_OUT scsi_error_handler scsi_unjam_host scsi_eh_abort_cmds command_abort(sets US_FLIDX_TIMED_OUT for us->dflags calls stop_transport, and waits for)usb_stor_control_thread (which is waiting for transport call to return inside usb_stor_invoke_transport) both usb_stor_control_thread and usb_stor_invoke_transport check for us->dflags timed_out bit and set the result as DID_ABORT and signal completion for command_abort to complete .. sd_eh_action checks for cmd->result and finds out that it's DID_ABORT rather than DID_TIME_OUT. This patch updates the command result to be TIME_OUT explicitly before returning from command_abort in scsiglue.c. I would like to know if this patch can work out for such USB Storage devices? What would be the better way to do the same? Regards, Vishal Annapurve From: Vishal Annapurve Subject: [PATCH] usb-storage: Changing the command result This change updates the returned result to scsi layer to use DID_TIMEOUT flag as a result status rather than DID_ABORT for commands aborted due to timeout. Signed-off-by: Vishal Annapurve --- drivers/usb/storage/scsiglue.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c index a74e9d8..e7b532e 100644 --- a/drivers/usb/storage/scsiglue.c +++ b/drivers/usb/storage/scsiglue.c @@ -353,6 +353,15 @@ static int command_abort(struct scsi_cmnd *srb) /* Wait for the aborted command to finish */ wait_for_completion(&us->notify); + /* Given that this bit would be only set in case of timedout + * command, we can return with DID_TIME_OUT, as scsi layer needs + * to use this result rather than DID_ABORT. + * This is a WR to avoid removing DID_ABORT altogether + * and before the final control goes to scsi layer change the + * result to be timedout. + */ + us->srb->result = DID_TIME_OUT << 16; + return SUCCESS; } -- 1.8.4
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
On Fri, Sep 13, 2024 at 6:08 PM Du, Fan wrote: > > ... > > > > Hello, > > > > This patchset is our exploration of how to support 1G pages in guest_memfd, > > and > > how the pages will be used in Confidential VMs. > > > > The patchset covers: > > > > + How to get 1G pages > > + Allowing mmap() of guest_memfd to userspace so that both private and > > shared > > Hi Ackerley > > Thanks for posting new version :) > > W.r.t above description and below patch snippet from Patch 26-29, > Does this new design aim to backup shared and private GPA with a single > Hugetlb spool which equal VM instance total memory? Yes. > > By my understanding, before this new changes, shared memfd and gmem fd > has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool. Selftests attached to this series use single gmem fd to back guest memory. > > Does Qemu require new changes as well? I'd like to have a test of this series > if you can share Qemu branch? > We are going to discuss this RFC series and related issues at LPC. Once the next steps are finalized, the plan will be to send out an improved version. You can use/modify the selftests that are part of this series to test this feature with software protected VMs for now. Qemu will require changes for this feature on top of already floated gmem integration series [1] that adds software protected VM support to Qemu. If you are interested in testing this feature with TDX VMs then it needs multiple series to set up the right test environment (including [2]). We haven't considered posting Qemu patches and it will be a while before we can get to it. [1] https://patchew.org/QEMU/20230914035117.3285885-1-xiaoyao...@intel.com/ [2] https://patchwork.kernel.org/project/kvm/cover/20231115071519.2864957-1-xiaoyao...@intel.com/
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Alan, USB storage maybe just has to say that the abort occurred. By setting the US_FLIDX_TIMED_OUT bit USB storage is getting signaled that the reason was time out and the command is being aborted. Now, it's arguable whether to change the implication of US_FLIDX_TIMED_OUT bit for scsi - USB storage bridge or for entire usb storage Or maybe scsi has decided to abort so it should override the result. Regards, Vishal -Original Message- From: Alan Stern [mailto:st...@rowland.harvard.edu] Sent: Tuesday, October 15, 2013 7:03 PM To: Vishal Annapurve Cc: Ming Lei; Linux Kernel Mailing List; linux-usb Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result On Tue, 15 Oct 2013, Vishal Annapurve wrote: > Hi Alan, > > This issue seems more related to the devices using SCSI protocol and > the changes otherwise will be at more places giving the same end result. > > I think as the comment says over the command_abort function, > intentional result change should only happen in case of timeout. usb-storage doesn't know _why_ a command was aborted; it knows only that the abort occurred. If you look carefully at the code, you'll see that the result is set to DID_ABORT only when the US_FLIDX_TIMED_OUT bit is set, and this bit gets set only when a SCSI abort occurs. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Alan, What I wanted to say was If the bit US_FLIDX_TIMED_OUT can have more meanings than timed out then maybe it would be best to override the results after usb-storage is done with the command maybe in scsi layer itself who aborted it in the first place. My concern was that overriding the result in usb storage or scsi layers will have more side effects than doing it in scsiglue.c. And by scsi-usb storage bridge what I meant was specifically the code in scsiglue. Question about your last mail, do you want to change all the occurrences of DID_ABORT from usb-storage to DID_TIMEOUT? Regards, Vishal -Original Message- From: Alan Stern [mailto:st...@rowland.harvard.edu] Sent: Tuesday, October 15, 2013 10:22 PM To: Vishal Annapurve Cc: Ming Lei; Linux Kernel Mailing List; linux-usb Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result On Tue, 15 Oct 2013, Vishal Annapurve wrote: > Hi Alan, > > USB storage maybe just has to say that the abort occurred. By setting > the US_FLIDX_TIMED_OUT bit USB storage is getting signaled that the > reason was time out and the command is being aborted. No. By setting the US_FLIDX_TIMED_OUT bit, usb-storage indicates that the command was aborted. This doesn't indicate anything about the reason for the abort. (Maybe this bit's name wasn't chosen very well.) > Now, it's arguable whether to change the implication of > US_FLIDX_TIMED_OUT bit for scsi - USB storage bridge or for entire usb > storage I don't understand this. What's the difference between "scsi - USB storage bridge" and "entire usb storage"? Aren't they the same thing? > Or maybe scsi has > decided to abort so it should override the result. Of course the SCSI midlayer has decided to abort. That's the only way this bit can get set. But usb-storage doesn't know why SCSI decided to abort. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi, Attaching the new patch which will replace all the occurrences of DID_ABORT with DID_TIMOUT in USB Storage whenever it sees US_FLIDX_TIMED_OUT bit is set. Regards, Vishal -Original Message- From: Alan Stern [mailto:st...@rowland.harvard.edu] Sent: Friday, October 18, 2013 4:10 PM To: Vishal Annapurve Cc: Ming Lei; Linux Kernel Mailing List; linux-usb Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result On Fri, 18 Oct 2013, Vishal Annapurve wrote: > Hi Alan, > > What I wanted to say was If the bit US_FLIDX_TIMED_OUT can have more > meanings than timed out then maybe it would be best to override the > results after usb-storage is done with the command maybe in scsi layer > itself who aborted it in the first place. US_FLIDX_TIMED_OUT has one very specific meaning: command_abort() was called. Since command_abort() is the .eh_abort_handler routine, US_FLIDX_TIMED_OUT means that the SCSI layer decided to abort the command. Does the SCSI layer ever abort a command for any reason other than a timeout? If not, you may conclude that US_FLIDX_TIMED_OUT indicates a timeout. But if it does, you should not make this conclusion. > My concern was that overriding the result in usb storage or scsi > layers will have more side effects than doing it in scsiglue.c. > And by scsi-usb storage bridge what I meant was specifically the code > in scsiglue. > > Question about your last mail, do you want to change all the > occurrences of DID_ABORT from usb-storage to DID_TIMEOUT? Put it this way: There's no good reason for changing some of them but not all of them. And if you're going to change them at all, it makes no sense to first set the result to DID_ABORT and then change it to DID_TIMEOUT. You should simply set it to DID_TIMEOUT in the first place. Alan Stern usb_storage_change_abort_to_timeout.patch Description: usb_storage_change_abort_to_timeout.patch
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Alan, This issue seems more related to the devices using SCSI protocol and the changes otherwise will be at more places giving the same end result. I think as the comment says over the command_abort function, intentional result change should only happen in case of timeout. Regards, Vishal -Original Message- From: Alan Stern [mailto:st...@rowland.harvard.edu] Sent: Tuesday, October 15, 2013 5:55 PM To: Ming Lei Cc: Vishal Annapurve; Linux Kernel Mailing List; linux-usb Subject: Re: [PATCH] usb-storage: scsiglue: Changing the command result On Tue, 15 Oct 2013, Ming Lei wrote: > On Thu, Sep 26, 2013 at 7:35 PM, Vishal Annapurve > wrote: > > Hi, > > > > There was a recent commit in mainline for the scsi devices which do > > not respond properly to medium access command: > > > > commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 > > > > [SCSI] Handle disk devices which can not process medium access > > commands We have experienced several devices which fail in a fashion > > we do not currently handle gracefully in SCSI. After a failure these > > devices will respond to the SCSI primary command set (INQUIRY, TEST > > UNIT READY, etc.) but any command accessing the storage medium will time > > out. > > > > I came across a USB drive which showed similar problem and what I > > see is usb storage is still not able to cope with such devices properly. > > > > The control flow downwards is like: > > scsi_times_out --> Setting cmd->result as DID_TIME_OUT > > scsi_error_handler scsi_unjam_host scsi_eh_abort_cmds > > command_abort(sets US_FLIDX_TIMED_OUT for us->dflags > > calls stop_transport, > > and waits for)usb_stor_control_thread > > (which is waiting for > > > > transport call to return inside > > > > usb_stor_invoke_transport) > > > > both usb_stor_control_thread and > > > > usb_stor_invoke_transport > > check for > > us->dflags timed_out bit and > > set the result > > as DID_ABORT > > and signal > > completion for command_abort > > to complete > > .. > > sd_eh_action > > checks for cmd->result and finds out that it's DID_ABORT rather than > > DID_TIME_OUT. > > > > This patch updates the command result to be TIME_OUT explicitly > > before returning from command_abort in scsiglue.c. > > > > I would like to know if this patch can work out for such USB Storage > > devices? What would be the better way to do the same? > > Looks your diagnose is correct, and patch should be doable, the only > side effect is that previous returned DID_ABORT in srb->result is > replaced with DID_TIME_OUT now. > > Another way is to implement .eh_timed_out callback to return different > error for the two failure, but looks not a big deal. Instead of overriding the result code, a better way to do this would simply be to change the places where srb->result is currently set to DID_ABORT << 16. Just make it DID_TIME_OUT << 16 instead. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Greg, Attached are the patches. Regards, Vishal On Monday 09 December 2013 07:29 AM, Greg KH wrote: On Sat, Nov 16, 2013 at 12:23:50PM +0530, Vishal Annapurve wrote: Hi, Here are the updated patches: Can you please resend these in a format which I can apply them in without having to hand-edit all three of them? thanks, greg k-h 01-PATCH-usb-storage-Proper-cmd-result-assignment.patch.gz Description: GNU Zip compressed data 02-keucr-Proper-cmd-result-assignment.patch.gz Description: GNU Zip compressed data 03-usb-storage-uas-Proper-cmd-result-assignment.patch.gz Description: GNU Zip compressed data
Re: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Greg, Does this look fine? I will send over other patches once you confirm. Patch set 1: - [PATCH 1/3] usb: storage: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/usb/storage/cypress_atacb.c | 1 + drivers/usb/storage/isd200.c| 2 +- drivers/usb/storage/transport.c | 8 drivers/usb/storage/usb.c | 10 ++ 4 files changed, 12 insertions(+), 9 deletions(-) --- diff --git a/drivers/usb/storage/cypress_atacb.c b/drivers/usb/storage/cypress_atacb.c index 8514a2d..3477ca19 100644 --- a/drivers/usb/storage/cypress_atacb.c +++ b/drivers/usb/storage/cypress_atacb.c @@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd *srb, struct us_data *us) */ if ((srb->result != (DID_ERROR << 16) && srb->result != (DID_ABORT << 16)) && +srb->result != (DID_TIME_OUT << 16) && save_cmnd[2] & 0x20) { struct scsi_eh_save ses; unsigned char regs[8]; diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c index 599d8bf..ffd5d58 100644 --- a/drivers/usb/storage/isd200.c +++ b/drivers/usb/storage/isd200.c @@ -703,7 +703,7 @@ static void isd200_invoke_transport( struct us_data *us, /* abort processing: the bulk-only transport requires a reset * following an abort */ Handle_Abort: -srb->result = DID_ABORT << 16; +srb->result = DID_TIME_OUT << 16; /* permit the reset transfer to take place */ clear_bit(US_FLIDX_ABORTING, &us->dflags); diff --git a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c index 22c7d43..6a90161 100644 --- a/drivers/usb/storage/transport.c +++ b/drivers/usb/storage/transport.c @@ -607,8 +607,8 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) * short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { -usb_stor_dbg(us, "-- command was aborted\n"); -srb->result = DID_ABORT << 16; +usb_stor_dbg("-- command was aborted because of timeout\n"); +srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } @@ -717,8 +717,8 @@ Retry_Sense: scsi_eh_restore_cmnd(srb, &ses); if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { -usb_stor_dbg(us, "-- auto-sense aborted\n"); -srb->result = DID_ABORT << 16; +usb_stor_dbg("-- auto-sense aborted due to timeout\n"); +srb->result = DID_TIME_OUT << 16; /* If SANE_SENSE caused this problem, disable it */ if (sense_size != US_SENSE_SIZE) { diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c index 5c4fe07..04a68eb 100644 --- a/drivers/usb/storage/usb.c +++ b/drivers/usb/storage/usb.c @@ -325,7 +325,7 @@ static int usb_stor_control_thread(void * __us) /* has the command timed out *already* ? */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { -us->srb->result = DID_ABORT << 16; +us->srb->result = DID_TIME_OUT << 16; goto SkipForAbort; } @@ -379,7 +379,8 @@ static int usb_stor_control_thread(void * __us) scsi_lock(host); /* indicate that the command is done */ -if (us->srb->result != DID_ABORT << 16) { +if ((us->srb->result != DID_ABORT << 16) && +(us->srb->result != DID_TIME_OUT << 16)) { usb_stor_dbg(us, "scsi cmd done, result=0x%x\n", us->srb->result); us->srb->scsi_done(us->srb); @@ -390,8 +391,9 @@ SkipForAbort: /* If an abort request was received we need to signal that * the abort has finished. The proper test for this is - * the TIMED_OUT flag, not srb->result == DID_ABORT, because - * the timeout might have occurred after the command had + * the TIMED_OUT flag, not srb->result == DID_ABORT or + * srb->result =
Re: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Greg, There was problem with my email client. Hope now it is fine. --- [PATCH 1/3] usb: storage: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/usb/storage/cypress_atacb.c | 1 + drivers/usb/storage/isd200.c| 2 +- drivers/usb/storage/transport.c | 8 drivers/usb/storage/usb.c | 10 ++ 4 files changed, 12 insertions(+), 9 deletions(-) --- diff --git a/drivers/usb/storage/cypress_atacb.c b/drivers/usb/storage/cypress_atacb.c index 8514a2d..3477ca19 100644 --- a/drivers/usb/storage/cypress_atacb.c +++ b/drivers/usb/storage/cypress_atacb.c @@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd *srb, struct us_data *us) */ if ((srb->result != (DID_ERROR << 16) && srb->result != (DID_ABORT << 16)) && + srb->result != (DID_TIME_OUT << 16) && save_cmnd[2] & 0x20) { struct scsi_eh_save ses; unsigned char regs[8]; diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c index 599d8bf..ffd5d58 100644 --- a/drivers/usb/storage/isd200.c +++ b/drivers/usb/storage/isd200.c @@ -703,7 +703,7 @@ static void isd200_invoke_transport( struct us_data *us, /* abort processing: the bulk-only transport requires a reset * following an abort */ Handle_Abort: - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; /* permit the reset transfer to take place */ clear_bit(US_FLIDX_ABORTING, &us->dflags); diff --git a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c index 22c7d43..6a90161 100644 --- a/drivers/usb/storage/transport.c +++ b/drivers/usb/storage/transport.c @@ -607,8 +607,8 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) * short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - usb_stor_dbg(us, "-- command was aborted\n"); - srb->result = DID_ABORT << 16; + usb_stor_dbg("-- command was aborted because of timeout\n"); + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } @@ -717,8 +717,8 @@ Retry_Sense: scsi_eh_restore_cmnd(srb, &ses); if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - usb_stor_dbg(us, "-- auto-sense aborted\n"); - srb->result = DID_ABORT << 16; + usb_stor_dbg("-- auto-sense aborted due to timeout\n"); + srb->result = DID_TIME_OUT << 16; /* If SANE_SENSE caused this problem, disable it */ if (sense_size != US_SENSE_SIZE) { diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c index 5c4fe07..04a68eb 100644 --- a/drivers/usb/storage/usb.c +++ b/drivers/usb/storage/usb.c @@ -325,7 +325,7 @@ static int usb_stor_control_thread(void * __us) /* has the command timed out *already* ? */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - us->srb->result = DID_ABORT << 16; + us->srb->result = DID_TIME_OUT << 16; goto SkipForAbort; } @@ -379,7 +379,8 @@ static int usb_stor_control_thread(void * __us) scsi_lock(host); /* indicate that the command is done */ - if (us->srb->result != DID_ABORT << 16) { + if ((us->srb->result != DID_ABORT << 16) && + (us->srb->result != DID_TIME_OUT << 16)) { usb_stor_dbg(us, "scsi cmd done, result=0x%x\n", us->srb->result); us->srb->scsi_done(us->srb); @@ -390,8 +391,9 @@ SkipForAbort: /* If an abort request was received we need
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi Alan, Here is the new patch: From: Vishal Annapurve Date: Sat, 26 Oct 2013 21:10:11 +0530 Subject: [PATCH] usb: storage: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Change-Id: Ic58e2247fed11649f4dbea56382354ba2fe0be1b --- drivers/staging/keucr/transport.c | 6 +++--- drivers/staging/keucr/usb.c | 5 +++-- drivers/usb/storage/cypress_atacb.c | 1 + drivers/usb/storage/isd200.c| 2 +- drivers/usb/storage/transport.c | 8 drivers/usb/storage/usb.c | 9 + 6 files changed, 17 insertions(+), 14 deletions(-) diff --git a/drivers/staging/keucr/transport.c b/drivers/staging/keucr/transport.c index 1a8837d..ac0e3c2 100644 --- a/drivers/staging/keucr/transport.c +++ b/drivers/staging/keucr/transport.c @@ -312,7 +312,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) we need to short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- command was aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } @@ -371,7 +371,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- auto-sense aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } if (temp_result != USB_STOR_TRANSPORT_GOOD) { @@ -447,7 +447,7 @@ void ENE_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) we need to short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- command was aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } diff --git a/drivers/staging/keucr/usb.c b/drivers/staging/keucr/usb.c index 66aad3a..79405be 100644 --- a/drivers/staging/keucr/usb.c +++ b/drivers/staging/keucr/usb.c @@ -200,7 +200,7 @@ static int usb_stor_control_thread(void * __us) /* has the command timed out *already* ? */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - us->srb->result = DID_ABORT << 16; + us->srb->result = DID_TIME_OUT << 16; goto SkipForAbort; } @@ -234,7 +234,8 @@ static int usb_stor_control_thread(void * __us) scsi_lock(host); /* indicate that the command is done */ - if (us->srb->result != DID_ABORT << 16) + if ((us->srb->result != DID_ABORT << 16) && + (us->srb->result != DID_TIME_OUT << 16)) { us->srb->scsi_done(us->srb); } diff --git a/drivers/usb/storage/cypress_atacb.c b/drivers/usb/storage/cypress_atacb.c index c844718..af24894 100644 --- a/drivers/usb/storage/cypress_atacb.c +++ b/drivers/usb/storage/cypress_atacb.c @@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd *srb, struct us_data *us) */ if ((srb->result != (DID_ERROR << 16) && srb->result != (DID_ABORT << 16)) && + srb->result != (DID_TIME_OUT << 16) && save_cmnd[2] & 0x20) { struct scsi_eh_save ses; unsigned char regs[8]; diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c index ffc4193..28b688b 100644 --- a/drivers/usb/storage/isd200.c +++ b/drivers/usb/storage/isd200.c @@ -705,7 +705,7 @@ static void isd200_invoke_transport( struct us_data *us, /* abort processing: the bulk-only transport requires a reset * following an abort *
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi, Here are the updated patches: [PATCH 1/3] usb: storage: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/usb/storage/cypress_atacb.c | 1 + drivers/usb/storage/isd200.c| 2 +- drivers/usb/storage/transport.c | 8 drivers/usb/storage/usb.c | 10 ++ 4 files changed, 12 insertions(+), 9 deletions(-) diff --git a/drivers/usb/storage/cypress_atacb.c b/drivers/usb/storage/cypress_atacb.c index 8514a2d..3477ca19 100644 --- a/drivers/usb/storage/cypress_atacb.c +++ b/drivers/usb/storage/cypress_atacb.c @@ -168,6 +168,7 @@ static void cypress_atacb_passthrough(struct scsi_cmnd *srb, struct us_data *us) */ if ((srb->result != (DID_ERROR << 16) && srb->result != (DID_ABORT << 16)) && + srb->result != (DID_TIME_OUT << 16) && save_cmnd[2] & 0x20) { struct scsi_eh_save ses; unsigned char regs[8]; diff --git a/drivers/usb/storage/isd200.c b/drivers/usb/storage/isd200.c index 599d8bf..ffd5d58 100644 --- a/drivers/usb/storage/isd200.c +++ b/drivers/usb/storage/isd200.c @@ -703,7 +703,7 @@ static void isd200_invoke_transport( struct us_data *us, /* abort processing: the bulk-only transport requires a reset * following an abort */ Handle_Abort: - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; /* permit the reset transfer to take place */ clear_bit(US_FLIDX_ABORTING, &us->dflags); diff --git a/drivers/usb/storage/transport.c b/drivers/usb/storage/transport.c index 22c7d43..6a90161 100644 --- a/drivers/usb/storage/transport.c +++ b/drivers/usb/storage/transport.c @@ -607,8 +607,8 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) * short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - usb_stor_dbg(us, "-- command was aborted\n"); - srb->result = DID_ABORT << 16; + usb_stor_dbg("-- command was aborted because of timeout\n"); + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } @@ -717,8 +717,8 @@ Retry_Sense: scsi_eh_restore_cmnd(srb, &ses); if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - usb_stor_dbg(us, "-- auto-sense aborted\n"); - srb->result = DID_ABORT << 16; + usb_stor_dbg("-- auto-sense aborted due to timeout\n"); + srb->result = DID_TIME_OUT << 16; /* If SANE_SENSE caused this problem, disable it */ if (sense_size != US_SENSE_SIZE) { diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c index 5c4fe07..04a68eb 100644 --- a/drivers/usb/storage/usb.c +++ b/drivers/usb/storage/usb.c @@ -325,7 +325,7 @@ static int usb_stor_control_thread(void * __us) /* has the command timed out *already* ? */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - us->srb->result = DID_ABORT << 16; + us->srb->result = DID_TIME_OUT << 16; goto SkipForAbort; } @@ -379,7 +379,8 @@ static int usb_stor_control_thread(void * __us) scsi_lock(host); /* indicate that the command is done */ - if (us->srb->result != DID_ABORT << 16) { + if ((us->srb->result != DID_ABORT << 16) && + (us->srb->result != DID_TIME_OUT << 16)) { usb_stor_dbg(us, "scsi cmd done, result=0x%x\n", us->srb->result); us->srb->scsi_done(us->srb); @@ -390,8 +391,9 @@ SkipForAbort: /* If an abort request was received we need to signal that *
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi, [PATCH 2/3] keucr: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/staging/keucr/transport.c | 6 +++--- drivers/staging/keucr/usb.c | 5 +++-- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/staging/keucr/transport.c b/drivers/staging/keucr/transport.c index aeb2186..6f61c20 100644 --- a/drivers/staging/keucr/transport.c +++ b/drivers/staging/keucr/transport.c @@ -353,7 +353,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) we need to short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- command was aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } @@ -412,7 +412,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- auto-sense aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } if (temp_result != USB_STOR_TRANSPORT_GOOD) { @@ -488,7 +488,7 @@ void ENE_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) we need to short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- command was aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } diff --git a/drivers/staging/keucr/usb.c b/drivers/staging/keucr/usb.c index a84ee63..26ec64c 100644 --- a/drivers/staging/keucr/usb.c +++ b/drivers/staging/keucr/usb.c @@ -181,7 +181,7 @@ static int usb_stor_control_thread(void *__us) /* has the command timed out *already* ? */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - us->srb->result = DID_ABORT << 16; + us->srb->result = DID_TIME_OUT << 16; goto SkipForAbort; } @@ -209,7 +209,8 @@ static int usb_stor_control_thread(void *__us) scsi_lock(host); /* indicate that the command is done */ - if (us->srb->result != DID_ABORT << 16) { + if ((us->srb->result != DID_ABORT << 16) && + (us->srb->result != DID_TIME_OUT << 16)) { us->srb->scsi_done(us->srb); } else { SkipForAbort: -- 1.8.4.2 Regards, Vishal -Original Message- From: Vishal Annapurve Sent: Saturday, November 16, 2013 12:24 PM To: 'Alan Stern' Cc: Ming Lei; Linux Kernel Mailing List; linux-usb Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result Hi, Here are the updated patches: [PATCH 1/3] usb: storage: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/usb/storage/cypress_atacb.c | 1 + drivers/usb/storage/isd200.c| 2 +- drivers/usb/storage/transport.c | 8 drivers/usb/storage/usb.c | 10 ++ 4 files changed, 12 insert
RE: [PATCH] usb-storage: scsiglue: Changing the command result
Hi, [PATCH 3/3] usb: storage: uas: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/usb/storage/uas.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c index d966b59..3a4a44e 100644 --- a/drivers/usb/storage/uas.c +++ b/drivers/usb/storage/uas.c @@ -271,7 +271,10 @@ static int uas_try_complete(struct scsi_cmnd *cmnd, const char *caller) usb_free_urb(cmdinfo->data_out_urb); if (cmdinfo->state & COMMAND_ABORTED) { scmd_printk(KERN_INFO, cmnd, "abort completed\n"); - cmnd->result = DID_ABORT << 16; + /* Better to check for US_FLIDX_TIMED_OUT +* bit settings. +*/ + cmnd->result = DID_TIME_OUT << 16; } cmnd->scsi_done(cmnd); return 0; -- 1.8.4.2 Regards, Vishal -Original Message- From: Vishal Annapurve Sent: Saturday, November 16, 2013 12:25 PM To: 'Alan Stern' Cc: 'Ming Lei'; 'Linux Kernel Mailing List'; 'linux-usb' Subject: RE: [PATCH] usb-storage: scsiglue: Changing the command result Hi, [PATCH 2/3] keucr: Proper cmd result assignment This change replaces DID_ABORT with DID_TIMEOUT as a command result whenever US_FLIDX_TIMED_OUT bit is set. This change is made to bring USB storage inline with a recent change: commit18a4d0a22ed6c54b67af7718c305cd010f09ddf8 [SCSI] Handle disk devices which can not process medium access commands We have experienced several devices which fail in a fashion we do not currently handle gracefully in SCSI. After a failure these devices will respond to the SCSI primary command set (INQUIRY, TEST UNIT READY, etc.) but any command accessing the storage medium will time out. As the USB storage was setting command result as aborted rather than timed out, SCSI layer was not recognizing the above mentioned failure pattern. Signed-off-by: Vishal Annapurve --- drivers/staging/keucr/transport.c | 6 +++--- drivers/staging/keucr/usb.c | 5 +++-- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/staging/keucr/transport.c b/drivers/staging/keucr/transport.c index aeb2186..6f61c20 100644 --- a/drivers/staging/keucr/transport.c +++ b/drivers/staging/keucr/transport.c @@ -353,7 +353,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) we need to short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- command was aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } @@ -412,7 +412,7 @@ void usb_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- auto-sense aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } if (temp_result != USB_STOR_TRANSPORT_GOOD) { @@ -488,7 +488,7 @@ void ENE_stor_invoke_transport(struct scsi_cmnd *srb, struct us_data *us) we need to short-circuit all other processing */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { /* pr_info("-- command was aborted\n"); */ - srb->result = DID_ABORT << 16; + srb->result = DID_TIME_OUT << 16; goto Handle_Errors; } diff --git a/drivers/staging/keucr/usb.c b/drivers/staging/keucr/usb.c index a84ee63..26ec64c 100644 --- a/drivers/staging/keucr/usb.c +++ b/drivers/staging/keucr/usb.c @@ -181,7 +181,7 @@ static int usb_stor_control_thread(void *__us) /* has the command timed out *already* ? */ if (test_bit(US_FLIDX_TIMED_OUT, &us->dflags)) { - us->srb->result = DID_ABORT << 16; +
Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand wrote: > > >> I also don't know how you treat things like folio_test_hugetlb() on > >> possible assumptions that the VMA must be a hugetlb vma. I'd confess I > >> didn't yet check the rest of the patchset yet - reading a large series > >> without a git tree is sometimes challenging to me. > >> > > > > I'm thinking to basically never involve folio_test_hugetlb(), and the > > VMAs used by guest_memfd will also never be a HugeTLB VMA. That's > > because only the HugeTLB allocator is used, but by the time the folio is > > mapped to userspace, it would have already have been split. After the > > page is split, the folio loses its HugeTLB status. guest_memfd folios > > will never be mapped to userspace while they still have a HugeTLB > > status. > > We absolutely must convert these hugetlb folios to non-hugetlb folios. > > That is one of the reasons why I raised at LPC that we should focus on > leaving hugetlb out of the picture and rather have a global pool, and > the option to move folios from the global pool back and forth to hugetlb > or to guest_memfd. > > How exactly that would look like is TBD. > > For the time being, I think we could add a "hack" to take hugetlb folios > from hugetlb for our purposes, but we would absolutely have to convert > them to non-hugetlb folios, especially when we split them to small > folios and start using the mapcount. But it doesn't feel quite clean. As hugepage folios need to be split up in order to support backing CoCo VMs with hugepages, I would assume any folio based hugepage memory allocation will need to go through split/merge cycles through the guest memfd lifetime. Plan through next RFC series is to abstract out the hugetlb folio management within guest_memfd so that any hugetlb specific logic is cleanly separated out and allows guest memfd to allocate memory from other hugepage allocators in the future. > > Simply starting with a separate global pool (e.g., boot-time allocation > similar to as done by hugetlb, or CMA) might be cleaner, and a lot of > stuff could be factored out from hugetlb code to achieve that. I am not sure if a separate global pool necessarily solves all the issues here unless we come up with more concrete implementation details. One of the concerns was the ability of implementing/retaining HVO while transferring memory between the separate global pool and hugetlb pool i.e. whether it can seamlessly serve all hugepage users on the host. Another question could be whether the separate pool/allocator simplifies the split/merge operations at runtime. > > -- > Cheers, > > David / dhildenb >
Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
On Thu, Oct 17, 2024 at 10:46 PM Jason Gunthorpe wrote: > > On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote: > > On 17.10.24 18:47, Jason Gunthorpe wrote: > > > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote: > > > > > > > My question was more torwards whether gmemfd could still expose the > > > > possibility to be used in VA forms to other modules that may not support > > > > fd+offsets yet. > > > > > > I keep hearing they don't want to support page pinning on a guestmemfd > > > mapping, so VA based paths could not work. > > > > For shared pages it absolutely must work. That's what I keep hearing :) > > Oh that's confusing. I assume non longterm pins desired on shared > pages though?? > > Jason For hugepage support to work, longterm pins on guest private pages need to be avoided [1], If this somehow was the cause of any confusion here. [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf (slide 12)
Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
On Wed, Sep 11, 2024 at 1:44 AM Ackerley Tng wrote: > > ... > +} > + > +static void kvm_gmem_evict_inode(struct inode *inode) > +{ > + u64 flags = (u64)inode->i_private; > + > + if (flags & KVM_GUEST_MEMFD_HUGETLB) > + kvm_gmem_hugetlb_teardown(inode); > + else > + truncate_inode_pages_final(inode->i_mapping); > + > + clear_inode(inode); > +} > + > static const struct super_operations kvm_gmem_super_operations = { > .statfs = simple_statfs, > + .evict_inode= kvm_gmem_evict_inode, Ackerley, can we use free_inode[1] callback to free any special metadata associated with the inode instead of relying on super_operations? [1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719 > ... > > if (size <= 0 || !PAGE_ALIGNED(size)) > return -EINVAL; > -- > 2.46.0.598.g6f2099f65c-goog >
Re: [PATCH v7 0/3] x86/tdx: Fix HLT logic execution for TDX VMs
On Thu, Feb 27, 2025 at 5:44 PM Vishal Annapurve wrote: > > Direct HLT instruction execution causes #VEs for TDX VMs which is routed > to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow > so IRQs need to remain disabled until the TDCALL to ensure that pending > IRQs are correctly treated as wake events. As per current TDX spec, HLT > #VE handler doesn't have access to interruptibility state to selectively > enable interrupts, it ends up enabling interrupts during #VE handling > before the TDCALL is executed. > > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") > effectively solved this issue for idle routines by defining TDX specific > idle routine which directly invokes TDCALL while keeping interrupts > disabled, but missed handling arch_safe_halt(). This series intends to fix > arch_safe_halt() execution for TDX VMs. > > Changes introduced by the series include: > - Move *halt() variants outside CONFIG_PARAVIRT_XXL and under > CONFIG_PARAVIRT [1]. > - Add explicit dependency on CONFIG_PARAVIRT for TDX VMs. > - Route "sti; hlt" sequences via tdx_safe_halt() for reliability. > - Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs. > - Warn and fail emulation if HLT #VE emulation executes with interrupts > enabled. > > Changes since v6: > 1) Addressed Kirills's comments. > 2) Fixed a build failure. > > v6: > https://lore.kernel.org/lkml/20250225004704.603652-1-vannapu...@google.com/ > > Kirill A. Shutemov (1): > x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT > > Vishal Annapurve (2): > x86/tdx: Fix arch_safe_halt() execution for TDX VMs > x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling > > arch/x86/Kconfig | 1 + > arch/x86/coco/tdx/tdx.c | 34 ++- > arch/x86/include/asm/irqflags.h | 40 +++ > arch/x86/include/asm/paravirt.h | 20 +++--- > arch/x86/include/asm/paravirt_types.h | 3 +- > arch/x86/include/asm/tdx.h| 4 +-- > arch/x86/kernel/paravirt.c| 14 ++ > arch/x86/kernel/process.c | 2 +- > 8 files changed, 78 insertions(+), 40 deletions(-) > > -- > 2.48.1.711.g2feabab25a-goog > Just a ping for review of this series if anything more needs to be discussed.
Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao wrote: > > On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote: > > > > > > On 4/24/2025 12:25 PM, Yan Zhao wrote: > > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote: > > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote: > > >>> Yan Zhao writes: > > >>> > > On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote: > > > +/* > > > + * Allocates and then caches a folio in the filemap. Returns a folio > > > with > > > + * refcount of 2: 1 after allocation, and 1 taken by the filemap. > > > + */ > > > +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct > > > inode *inode, > > > + pgoff_t > > > index) > > > +{ > > > + struct kvm_gmem_hugetlb *hgmem; > > > + pgoff_t aligned_index; > > > + struct folio *folio; > > > + int nr_pages; > > > + int ret; > > > + > > > + hgmem = kvm_gmem_hgmem(inode); > > > + folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool); > > > + if (IS_ERR(folio)) > > > + return folio; > > > + > > > + nr_pages = 1UL << huge_page_order(hgmem->h); > > > + aligned_index = round_down(index, nr_pages); > > Maybe a gap here. > > > > When a guest_memfd is bound to a slot where slot->base_gfn is not > > aligned to > > 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the > > corresponding GFN is not 2M/1G aligned. > > >>> > > >>> Thanks for looking into this. > > >>> > > >>> In 1G page support for guest_memfd, the offset and size are always > > >>> hugepage aligned to the hugepage size requested at guest_memfd creation > > >>> time, and it is true that when binding to a memslot, slot->base_gfn and > > >>> slot->npages may not be hugepage aligned. > > >>> > > > > However, TDX requires that private huge pages be 2M aligned in GFN. > > > > >>> > > >>> IIUC other factors also contribute to determining the mapping level in > > >>> the guest page tables, like lpage_info and .private_max_mapping_level() > > >>> in kvm_x86_ops. > > >>> > > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info > > >>> will track that and not allow faulting into guest page tables at higher > > >>> granularity. > > >> > > >> lpage_info only checks the alignments of slot->base_gfn and > > >> slot->base_gfn + npages. e.g., > > >> > > >> if slot->base_gfn is 8K, npages is 8M, then for this slot, > > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K); > > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K); > > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K); > > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K); > > > > Should it be? > > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M); > > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M); > > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M); > > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M); > > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K); > Right. Good catch. Thanks! > > Let me update the example as below: > slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range) > > lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB); > lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB); > lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB); > lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB); > lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB); > > lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA > 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two > 2MB folios, whose physical addresses may not be contiguous. > > Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB, > KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB). > However, guest_memfd just allocates the same 2MB folio for both faults. > > > > > > >> > > >> - > > >> | | | | | | | | | > > >> 8K2M 2M+8K 4M 4M+8K 6M 6M+8K 8M 8M+8K > > >> > > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge > > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd. > > >> So, guest_memfd allocates the same huge folio of 2M order for them. > > > Sorry, sent too fast this morning. The example is not right. The correct > > > one is: > > > > > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So, > > > KVM will create a 2M mapping for them. > > > > > > However, in guest_memfd, GFN 4M and GFN 4M+16
Re: [PATCH REGRESSION-FIX] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
> ... > 6.12.23 fails to build with the following error if CONFIG_XEN_PV is > not set: > > arch/x86/coco/tdx/tdx.c: In function ‘tdx_early_init’: > arch/x86/coco/tdx/tdx.c:1080:19: error: ‘struct pv_irq_ops’ has no member > named ‘safe_halt’ > 1080 | pv_ops.irq.safe_halt = tdx_safe_halt; > | ^ > arch/x86/coco/tdx/tdx.c:1081:19: error: ‘struct pv_irq_ops’ has no member > named ‘halt’ > 1081 | pv_ops.irq.halt = tdx_halt; > | ^ > > This is because XEN_PV selects PARAVIRT_XXL, and 'safe_halt' and > 'halt' are only defined for pv_irq_ops if PARAVIRT_XXL is defined. > > The build breakage was introduced in 6.12.23 by stable commit > 805e3ce5e0e3 which is a backport of 9f98a4f4e721 ("x86/tdx: Fix > arch_safe_halt() execution for TDX VMs"). > > Consider picking up upstream commit 22cc5ca5de52 ("x86/paravirt: > Move halt paravirt calls under CONFIG_PARAVIRT") for stable 6.12.y > which fixes the build regression by moving 'safe_halt' and 'halt' > out from under the PARAVIRT_XXL config. > > This patch is 22cc5ca5de52 backported to 6.12.23. There were a > couple of merge conflicts due to the missing upstream commits below: > Thanks Brett for looking into this. I have posted a similar patch [1] and I see that the issue you reported is also being discussed already [2]. [1] https://lore.kernel.org/stable/20250408132341.4175633-1-vannapu...@google.com/ [2] https://lore.kernel.org/stable/20250410180423.GA3430900@ax162/
Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
On Sun, Apr 27, 2025 at 6:08 PM Yan Zhao wrote: > > On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote: > > Yan Zhao writes: > > ... > > > > > > For some memory region, e.g., "pc.ram", it's divided into 2 parts: > > > - one with offset 0, size 0x8000(2G), > > > positioned at GPA 0, which is below GPA 4G; > > > - one with offset 0x8000(2G), size 0x8000(2G), > > > positioned at GPA 0x1(4G), which is above GPA 4G. > > > > > > For the second part, its slot->base_gfn is 0x1, while > > > slot->gmem.pgoff > > > is 0x8000. > > > > > > > Nope I don't mean to enforce that they are equal, we just need the > > offsets within the page to be equal. > > > > I edited Vishal's code snippet, perhaps it would help explain better: > > > > page_size is the size of the hugepage, so in our example, > > > > page_size = SZ_2M; > > page_mask = ~(page_size - 1); > page_mask = page_size - 1 ? > > > offset_within_page = slot->gmem.pgoff & page_mask; > > gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask; > > > > We will enforce that > > > > offset_within_page == gfn_within_page; > For "pc.ram", if it has 2.5G below 4G, it would be configured as follows > - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G > - slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G > > When binding these two slots to the same guest_memfd created with flag > KVM_GUEST_MEMFD_HUGE_1GB: > - binding the 1st slot will succeed; > - binding the 2nd slot will fail. > > What options does userspace have in this scenario? Userspace can create new gmem files that have aligned offsets. But I see your point, enforcing alignment at binding time will lead to wastage of memory. i.e. Your example above could be reworked to have: - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G, gmem_fd = x, gmem_size = 3G - slot 2: slot->gmem.pgoff=0, base GPA 4G, size=1.5G, gmem_fd = y, gmem_size = 2G This will waste 1G of memory as gmem files will have to be hugepage aligned. > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff > isn't ideal either. > > What about something similar as below? > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index d2feacd14786..87c33704a748 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct > kvm_memory_slot *slot, > } > > *pfn = folio_file_pfn(folio, index); > - if (max_order) > - *max_order = folio_order(folio); > + if (max_order) { > + int order; > + > + order = folio_order(folio); > + > + while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & > ((1 << order) - 1))) This sounds better. Userspace will need to avoid this in general or keep such ranges short so that most of the guest memory ranges can be mapped at hugepage granularity. So maybe a pr_warn could be spewed during binding that the alignment is not optimal. > + order--; > + > + *max_order = order; > + } > > *is_prepared = folio_test_uptodate(folio); > return folio; > > > > >> Adding checks at binding time will allow hugepage-unaligned offsets (to > > >> be at parity with non-guest_memfd backing memory) but still fix this > > >> issue. > > >> > > >> lpage_info will make sure that ranges near the bounds will be > > >> fragmented, but the hugepages in the middle will still be mappable as > > >> hugepages. > > >> > > >> [1] > > >> https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg
[PATCH v7 0/3] x86/tdx: Fix HLT logic execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. As per current TDX spec, HLT #VE handler doesn't have access to interruptibility state to selectively enable interrupts, it ends up enabling interrupts during #VE handling before the TDCALL is executed. Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") effectively solved this issue for idle routines by defining TDX specific idle routine which directly invokes TDCALL while keeping interrupts disabled, but missed handling arch_safe_halt(). This series intends to fix arch_safe_halt() execution for TDX VMs. Changes introduced by the series include: - Move *halt() variants outside CONFIG_PARAVIRT_XXL and under CONFIG_PARAVIRT [1]. - Add explicit dependency on CONFIG_PARAVIRT for TDX VMs. - Route "sti; hlt" sequences via tdx_safe_halt() for reliability. - Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs. - Warn and fail emulation if HLT #VE emulation executes with interrupts enabled. Changes since v6: 1) Addressed Kirills's comments. 2) Fixed a build failure. v6: https://lore.kernel.org/lkml/20250225004704.603652-1-vannapu...@google.com/ Kirill A. Shutemov (1): x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT Vishal Annapurve (2): x86/tdx: Fix arch_safe_halt() execution for TDX VMs x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling arch/x86/Kconfig | 1 + arch/x86/coco/tdx/tdx.c | 34 ++- arch/x86/include/asm/irqflags.h | 40 +++ arch/x86/include/asm/paravirt.h | 20 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/include/asm/tdx.h| 4 +-- arch/x86/kernel/paravirt.c| 14 ++ arch/x86/kernel/process.c | 2 +- 8 files changed, 78 insertions(+), 40 deletions(-) -- 2.48.1.711.g2feabab25a-goog
[PATCH v7 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE handler will enable interrupts before TDCALL is routed to hypervisor leading to missed wakeup events, as current TDX spec doesn't expose interruptibility state information to allow #VE handler to selectively enable interrupts. Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") prevented the idle routines from executing HLT instruction in STI-shadow. But it missed the paravirt routine which can be reached via this path as an example: kvm_wait() => safe_halt() => raw_safe_halt() => arch_safe_halt() => irq.safe_halt() => pv_native_safe_halt() To reliably handle arch_safe_halt() for TDX VMs, introduce explicit dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines with TDX-safe versions that execute direct TDCALL and needed interrupt flag updates. Executing direct TDCALL brings in additional benefit of avoiding HLT related #VEs altogether. Cc: sta...@vger.kernel.org Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") Reviewed-by: Kirill A. Shutemov Signed-off-by: Vishal Annapurve --- arch/x86/Kconfig | 1 + arch/x86/coco/tdx/tdx.c| 26 +- arch/x86/include/asm/tdx.h | 4 ++-- arch/x86/kernel/process.c | 2 +- 4 files changed, 29 insertions(+), 4 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index be2c311f5118..933c046e8966 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -902,6 +902,7 @@ config INTEL_TDX_GUEST depends on X86_64 && CPU_SUP_INTEL depends on X86_X2APIC depends on EFI_STUB + depends on PARAVIRT select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 32809a06dab4..6aad910d119d 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve) return ve_instr_len(ve); } -void __cpuidle tdx_safe_halt(void) +void __cpuidle tdx_halt(void) { const bool irq_disabled = false; @@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void) WARN_ONCE(1, "HLT instruction emulation failed\n"); } +static void __cpuidle tdx_safe_halt(void) +{ + tdx_halt(); + /* +* "__cpuidle" section doesn't support instrumentation, so stick +* with raw_* variant that avoids tracing hooks. +*/ + raw_local_irq_enable(); +} + static int read_msr(struct pt_regs *regs, struct ve_info *ve) { struct tdx_module_args args = { @@ -1109,6 +1120,19 @@ void __init tdx_early_init(void) x86_platform.guest.enc_kexec_begin = tdx_kexec_begin; x86_platform.guest.enc_kexec_finish = tdx_kexec_finish; + /* +* Avoid "sti;hlt" execution in TDX guests as HLT induces a #VE that +* will enable interrupts before HLT TDCALL invocation if executed +* in STI-shadow, possibly resulting in missed wakeup events. +* +* Modify all possible HLT execution paths to use TDX specific routines +* that directly execute TDCALL and toggle the interrupt state as +* needed after TDCALL completion. This also reduces HLT related #VEs +* in addition to having a reliable halt logic execution. +*/ + pv_ops.irq.safe_halt = tdx_safe_halt; + pv_ops.irq.halt = tdx_halt; + /* * TDX intercepts the RDMSR to read the X2APIC ID in the parallel * bringup low level code. That raises #VE which cannot be handled diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index b4b16dafd55e..40f9a97371a9 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve); bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); -void tdx_safe_halt(void); +void tdx_halt(void); bool tdx_early_handle_ve(struct pt_regs *regs); @@ -72,7 +72,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls); #else static inline void tdx_early_init(void) { }; -static inline void tdx_safe_halt(void) { }; +static inline void tdx_halt(void) { }; static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; } diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 6da6769d7254..d11956a178df 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -934,7 +934,7 @@ void __init select_idle_routine(void) static_call_update(x86_idle, mwait_idle); } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { pr_
[PATCH v7 1/3] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
From: "Kirill A. Shutemov" CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For other VM guest types, features supported under CONFIG_PARAVIRT are self sufficient. CONFIG_PARAVIRT mainly provides support for TLB flush operations and time related operations. For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets most of its requirement except the need of HLT and SAFE_HLT paravirt calls, which is currently defined under CONFIG_PARAVIRT_XXL. Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest like platforms, move HLT and SAFE_HLT paravirt calls under CONFIG_PARAVIRT. Moving HLT and SAFE_HLT paravirt calls are not fatal and should not break any functionality for current users of CONFIG_PARAVIRT. Cc: sta...@vger.kernel.org Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") Co-developed-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Reviewed-by: Juergen Gross Signed-off-by: Vishal Annapurve --- arch/x86/include/asm/irqflags.h | 40 +++ arch/x86/include/asm/paravirt.h | 20 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/kernel/paravirt.c| 14 ++ 4 files changed, 41 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h index cf7fc2b8e3ce..1c2db11a2c3c 100644 --- a/arch/x86/include/asm/irqflags.h +++ b/arch/x86/include/asm/irqflags.h @@ -76,6 +76,28 @@ static __always_inline void native_local_irq_restore(unsigned long flags) #endif +#ifndef CONFIG_PARAVIRT +#ifndef __ASSEMBLY__ +/* + * Used in the idle loop; sti takes one instruction cycle + * to complete: + */ +static __always_inline void arch_safe_halt(void) +{ + native_safe_halt(); +} + +/* + * Used when interrupts are already enabled or to + * shutdown the processor: + */ +static __always_inline void halt(void) +{ + native_halt(); +} +#endif /* __ASSEMBLY__ */ +#endif /* CONFIG_PARAVIRT */ + #ifdef CONFIG_PARAVIRT_XXL #include #else @@ -97,24 +119,6 @@ static __always_inline void arch_local_irq_enable(void) native_irq_enable(); } -/* - * Used in the idle loop; sti takes one instruction cycle - * to complete: - */ -static __always_inline void arch_safe_halt(void) -{ - native_safe_halt(); -} - -/* - * Used when interrupts are already enabled or to - * shutdown the processor: - */ -static __always_inline void halt(void) -{ - native_halt(); -} - /* * For spinlocks, etc: */ diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 041aff51eb50..29e7331a0c98 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -107,6 +107,16 @@ static inline void notify_page_enc_status_changed(unsigned long pfn, PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); } +static __always_inline void arch_safe_halt(void) +{ + PVOP_VCALL0(irq.safe_halt); +} + +static inline void halt(void) +{ + PVOP_VCALL0(irq.halt); +} + #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { @@ -170,16 +180,6 @@ static inline void __write_cr4(unsigned long x) PVOP_VCALL1(cpu.write_cr4, x); } -static __always_inline void arch_safe_halt(void) -{ - PVOP_VCALL0(irq.safe_halt); -} - -static inline void halt(void) -{ - PVOP_VCALL0(irq.halt); -} - static inline u64 paravirt_read_msr(unsigned msr) { return PVOP_CALL1(u64, cpu.read_msr, msr); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index fea56b04f436..abccfccc2e3f 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -120,10 +120,9 @@ struct pv_irq_ops { struct paravirt_callee_save save_fl; struct paravirt_callee_save irq_disable; struct paravirt_callee_save irq_enable; - +#endif void (*safe_halt)(void); void (*halt)(void); -#endif } __no_randomize_layout; struct pv_mmu_ops { diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 1ccaa3397a67..c5bb980b8a67 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -110,6 +110,11 @@ int paravirt_disable_iospace(void) return request_resource(&ioport_resource, &reserve_ioports); } +static noinstr void pv_native_safe_halt(void) +{ + native_safe_halt(); +} + #ifdef CONFIG_PARAVIRT_XXL static noinstr void pv_native_write_cr2(unsigned long val) { @@ -125,11 +130,6 @@ static noinstr void pv_native_set_debugreg(int regno, unsigned long val) { native_set_debugreg(regno, val); } - -static noinstr void pv_native_safe_halt(void) -{ - native_safe_halt(); -} #endif struct pv_info pv_info = { @@ -186,9 +186,11 @@ struct paravirt_patch_template pv_ops = { .irq.save_fl
[PATCH v7 3/3] x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. Emit warning and fail emulation if IRQs are enabled during HLT #VE handling to avoid running into scenarios where IRQ wake events are lost resulting in indefinite HLT execution times. Reviewed-by: Kirill A. Shutemov Signed-off-by: Vishal Annapurve --- arch/x86/coco/tdx/tdx.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 6aad910d119d..a97ddc6a52c3 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -393,6 +393,14 @@ static int handle_halt(struct ve_info *ve) { const bool irq_disabled = irqs_disabled(); + /* +* HLT with IRQs enabled is unsafe, as an IRQ that is intended to be a +* wake event may be consumed before requesting HLT emulation, leaving +* the vCPU blocking indefinitely. +*/ + if (WARN_ONCE(!irq_disabled, "HLT emulation with IRQs enabled")) + return -EIO; + if (__halt(irq_disabled)) return -EIO; -- 2.48.1.711.g2feabab25a-goog
Re: [PATCH v6 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs
On Wed, Feb 26, 2025 at 3:49 AM Kirill A. Shutemov wrote: > > On Tue, Feb 25, 2025 at 12:47:03AM +, Vishal Annapurve wrote: > > Direct HLT instruction execution causes #VEs for TDX VMs which is routed > > to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE > > handler will enable interrupts before TDCALL is routed to hypervisor > > leading to missed wakeup events. > > > > Current TDX spec doesn't expose interruptibility state information to > > allow #VE handler to selectively enable interrupts. To bypass this > > issue, TDX VMs need to replace "sti;hlt" execution with direct TDCALL > > followed by explicit interrupt flag update. > > > > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") > > prevented the idle routines from executing HLT instruction in STI-shadow. > > But it missed the paravirt routine which can be reached like this as an > > example: > > acpi_safe_halt() => > > raw_safe_halt() => > > arch_safe_halt() => > > irq.safe_halt() => > > pv_native_safe_halt() > > I would rather use paravirt spinlock example. It is less controversial. > I still see no point in ACPI cpuidle be a thing in TDX guests. > I will modify the description to include a paravirt spinlock example. > > > > To reliably handle arch_safe_halt() for TDX VMs, introduce explicit > > dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() > > routines with TDX-safe versions that execute direct TDCALL and needed > > interrupt flag updates. Executing direct TDCALL brings in additional > > benefit of avoiding HLT related #VEs altogether. > > > > Cc: sta...@vger.kernel.org > > Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") > > Signed-off-by: Vishal Annapurve > > Reviewed-by: Kirill A. Shutemov > > -- > Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCH v6 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs
On Thu, Feb 27, 2025 at 8:25 AM kernel test robot wrote: > > Hi Vishal, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on tip/x86/core] > [also build test ERROR on tip/master linus/master v6.14-rc4 next-20250227] > [cannot apply to tip/x86/tdx tip/auto-latest] > ... > All errors (new ones prefixed by >>): > >In file included from arch/x86/kernel/process.c:6: >In file included from include/linux/mm.h:2224: >include/linux/vmstat.h:504:43: warning: arithmetic between different > enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') > [-Wenum-enum-conversion] > 504 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + > |~ ^ > 505 |item]; > | >include/linux/vmstat.h:511:43: warning: arithmetic between different > enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') > [-Wenum-enum-conversion] > 511 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + > |~ ^ > 512 |NR_VM_NUMA_EVENT_ITEMS + > |~~ > >> arch/x86/kernel/process.c:937:32: error: use of undeclared identifier > >> 'tdx_halt'; did you mean 'tdx_init'? > 937 | static_call_update(x86_idle, tdx_halt); > | ^~~~ > | tdx_init Will fix this in the next version.
Re: [PATCH V5 1/4] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
On Thu, Feb 20, 2025 at 1:47 PM Dave Hansen wrote: > > On 2/20/25 13:16, Vishal Annapurve wrote: > > Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest > > like platforms, move HLT and SAFE_HLT paravirt calls under > > CONFIG_PARAVIRT. > > I guess it's just one patch, but doesn't this expose CONFIG_PARAVIRT=y > users to what _was_ specific to CONFIG_PARAVIRT_XXL=y? According to the > changelog, TDX users shouldn't have to use use PARAVIRT_XXL, so > PARAVIRT=y and PARAVIRT_XXL=n must be an *IMPORTANT* configuration for > TDX users. > > Before this patch, those users would have no way to hit the > unsafe-for-TDX pv_native_safe_halt(). After this patch, they will hit it. Before this patch, those users had access to arch_safe_halt() -> native_safe_halt() path. With this patch, such users can execute arch_safe_halt -> pv_native_safe_halt() -> native_safe_halt(), so this patch doesn't cause any additional regression. > > So, there are two possibilities: > > 1. This patch breaks bisection for an important TDX configuration > 2. This patch's conjecture that PARAVIRT_XXL=n is important for TDX > is wrong and it is not necessary in the first place. > > What am I missing?
Re: [PATCH V5 2/4] x86/tdx: Route safe halt execution via tdx_safe_halt()
On Thu, Feb 20, 2025 at 3:00 PM Dave Hansen wrote: > > On 2/20/25 13:16, Vishal Annapurve wrote: > > Direct HLT instruction execution causes #VEs for TDX VMs which is routed > > to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow > > so IRQs need to remain disabled until the TDCALL to ensure that pending > > IRQs are correctly treated as wake events. > > This isn't quite true. There's only one paravirt safe_halt() and it > doesn't do HLT or STI. pv_native_safe_halt() -> native_safe_halt() -> "sti; hlt". > > I think it's more true to say that "safe" halts are entered with IRQs > disabled. They logically do the halt operation and then enable > interrupts before returning. > > > So "sti;hlt" sequence needs to be replaced for TDX VMs with "TDCALL; > > *_irq_enable()" to keep interrupts disabled during TDCALL execution. > But this isn't new. TDX already tried to avoid "sti;hlt". It just > screwed up the implementation. > > > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") > > prevented the idle routines from using "sti;hlt". But it missed the > > paravirt routine which can be reached like this as an example: > > acpi_safe_halt() => > > raw_safe_halt() => > > arch_safe_halt() => > > irq.safe_halt() => > > pv_native_safe_halt() > > This, on the other hand, *is* important. > > > Modify tdx_safe_halt() to implement the sequence "TDCALL; > > raw_local_irq_enable()" and invoke tdx_halt() from idle routine which just > > executes TDCALL without toggling interrupt state. Introduce dependency > > on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines for > > TDX VMs. > > This changelog glosses over one of the key points: Why *MUST* TDX use > paravirt? It further confuses the reasoning by alluding to the idea that > "Direct HLT instruction execution ... is routed to hypervisor via TDCALL". > > It gives background and a solution, but it's not obvious what the > problem is or how the solution _fixes_ the problem. > > What must TDX now depend on PARAVIRT? > > Why not just route the HLT to a TDXCALL via the #VE code? > > Makes sense, I will update the commit message in the next version to clearly answer these questions and address the above comments.
[PATCH V5 3/4] x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. Emit warning and fail emulation if IRQs are enabled during HLT #VE handling to avoid running into scenarios where IRQ wake events are lost resulting in indefinite HLT execution times. Reviewed-by: Kirill A. Shutemov Signed-off-by: Vishal Annapurve --- arch/x86/coco/tdx/tdx.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 7ab427e85bd3..16ac337df9fa 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -393,6 +393,14 @@ static int handle_halt(struct ve_info *ve) { const bool irq_disabled = irqs_disabled(); + /* +* HLT with IRQs enabled is unsafe, as an IRQ that is intended to be a +* wake event may be consumed before requesting HLT emulation, leaving +* the vCPU blocking indefinitely. +*/ + if (WARN_ONCE(!irq_disabled, "HLT emulation with IRQs enabled")) + return -EIO; + if (__halt(irq_disabled)) return -EIO; -- 2.48.1.601.g30ceb7b040-goog
[PATCH v6 0/3] x86/tdx: Fix HLT logic execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. As per current TDX spec, HLT #VE handler doesn't have access to interruptibility state to selectively enable interrupts, it ends up enabling interrupts during #VE handling before the TDCALL is executed. Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") effectively solved this issue for idle routines by defining TDX specific idle routine which directly invokes TDCALL while keeping interrupts disabled, but missed handling arch_safe_halt(). This series intends to fix arch_safe_halt() execution for TDX VMs. Changes introduced by the series include: - Move *halt() variants outside CONFIG_PARAVIRT_XXL and under CONFIG_PARAVIRT [1]. - Add explicit dependency on CONFIG_PARAVIRT for TDX VMs. - Route "sti; hlt" sequences via tdx_safe_halt() for reliability. - Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs. - Warn and fail emulation if HLT #VE emulation executes with interrupts enabled. Changes since v5: 1) Addressed Dave's comments. 2) Dropped the cleanup patch for now, it can be discussed separately. v5: https://lore.kernel.org/lkml/20250220211628.1832258-1-vannapu...@google.com/ Kirill A. Shutemov (1): x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT Vishal Annapurve (2): x86/tdx: Fix arch_safe_halt() execution for TDX VMs x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling arch/x86/Kconfig | 1 + arch/x86/coco/tdx/tdx.c | 34 ++- arch/x86/include/asm/irqflags.h | 40 +++ arch/x86/include/asm/paravirt.h | 20 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/include/asm/tdx.h| 2 +- arch/x86/kernel/paravirt.c| 14 ++ arch/x86/kernel/process.c | 2 +- 8 files changed, 77 insertions(+), 39 deletions(-) -- 2.48.1.658.g4767266eb4-goog
[PATCH v6 2/3] x86/tdx: Fix arch_safe_halt() execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE handler will enable interrupts before TDCALL is routed to hypervisor leading to missed wakeup events. Current TDX spec doesn't expose interruptibility state information to allow #VE handler to selectively enable interrupts. To bypass this issue, TDX VMs need to replace "sti;hlt" execution with direct TDCALL followed by explicit interrupt flag update. Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") prevented the idle routines from executing HLT instruction in STI-shadow. But it missed the paravirt routine which can be reached like this as an example: acpi_safe_halt() => raw_safe_halt() => arch_safe_halt() => irq.safe_halt() => pv_native_safe_halt() To reliably handle arch_safe_halt() for TDX VMs, introduce explicit dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines with TDX-safe versions that execute direct TDCALL and needed interrupt flag updates. Executing direct TDCALL brings in additional benefit of avoiding HLT related #VEs altogether. Cc: sta...@vger.kernel.org Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") Signed-off-by: Vishal Annapurve --- arch/x86/Kconfig | 1 + arch/x86/coco/tdx/tdx.c| 26 +- arch/x86/include/asm/tdx.h | 2 +- arch/x86/kernel/process.c | 2 +- 4 files changed, 28 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index be2c311f5118..933c046e8966 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -902,6 +902,7 @@ config INTEL_TDX_GUEST depends on X86_64 && CPU_SUP_INTEL depends on X86_X2APIC depends on EFI_STUB + depends on PARAVIRT select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 32809a06dab4..6aad910d119d 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve) return ve_instr_len(ve); } -void __cpuidle tdx_safe_halt(void) +void __cpuidle tdx_halt(void) { const bool irq_disabled = false; @@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void) WARN_ONCE(1, "HLT instruction emulation failed\n"); } +static void __cpuidle tdx_safe_halt(void) +{ + tdx_halt(); + /* +* "__cpuidle" section doesn't support instrumentation, so stick +* with raw_* variant that avoids tracing hooks. +*/ + raw_local_irq_enable(); +} + static int read_msr(struct pt_regs *regs, struct ve_info *ve) { struct tdx_module_args args = { @@ -1109,6 +1120,19 @@ void __init tdx_early_init(void) x86_platform.guest.enc_kexec_begin = tdx_kexec_begin; x86_platform.guest.enc_kexec_finish = tdx_kexec_finish; + /* +* Avoid "sti;hlt" execution in TDX guests as HLT induces a #VE that +* will enable interrupts before HLT TDCALL invocation if executed +* in STI-shadow, possibly resulting in missed wakeup events. +* +* Modify all possible HLT execution paths to use TDX specific routines +* that directly execute TDCALL and toggle the interrupt state as +* needed after TDCALL completion. This also reduces HLT related #VEs +* in addition to having a reliable halt logic execution. +*/ + pv_ops.irq.safe_halt = tdx_safe_halt; + pv_ops.irq.halt = tdx_halt; + /* * TDX intercepts the RDMSR to read the X2APIC ID in the parallel * bringup low level code. That raises #VE which cannot be handled diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index b4b16dafd55e..393ee2dfaab1 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve); bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); -void tdx_safe_halt(void); +void tdx_halt(void); bool tdx_early_handle_ve(struct pt_regs *regs); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 6da6769d7254..d11956a178df 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -934,7 +934,7 @@ void __init select_idle_routine(void) static_call_update(x86_idle, mwait_idle); } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { pr_info("using TDX aware idle routine\n"); - static_call_update(x86_idle, tdx_safe_halt); + static_call_update(x86_idle, tdx_halt); } else { static_call_update(x86_idle, default_idle); } -- 2.48.1.658.g4767266eb4-goog
[PATCH v6 1/3] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
From: "Kirill A. Shutemov" CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For other VM guest types, features supported under CONFIG_PARAVIRT are self sufficient. CONFIG_PARAVIRT mainly provides support for TLB flush operations and time related operations. For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets most of its requirement except the need of HLT and SAFE_HLT paravirt calls, which is currently defined under CONFIG_PARAVIRT_XXL. Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest like platforms, move HLT and SAFE_HLT paravirt calls under CONFIG_PARAVIRT. Moving HLT and SAFE_HLT paravirt calls are not fatal and should not break any functionality for current users of CONFIG_PARAVIRT. Cc: sta...@vger.kernel.org Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") Co-developed-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Vishal Annapurve --- arch/x86/include/asm/irqflags.h | 40 +++ arch/x86/include/asm/paravirt.h | 20 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/kernel/paravirt.c| 14 ++ 4 files changed, 41 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h index cf7fc2b8e3ce..1c2db11a2c3c 100644 --- a/arch/x86/include/asm/irqflags.h +++ b/arch/x86/include/asm/irqflags.h @@ -76,6 +76,28 @@ static __always_inline void native_local_irq_restore(unsigned long flags) #endif +#ifndef CONFIG_PARAVIRT +#ifndef __ASSEMBLY__ +/* + * Used in the idle loop; sti takes one instruction cycle + * to complete: + */ +static __always_inline void arch_safe_halt(void) +{ + native_safe_halt(); +} + +/* + * Used when interrupts are already enabled or to + * shutdown the processor: + */ +static __always_inline void halt(void) +{ + native_halt(); +} +#endif /* __ASSEMBLY__ */ +#endif /* CONFIG_PARAVIRT */ + #ifdef CONFIG_PARAVIRT_XXL #include #else @@ -97,24 +119,6 @@ static __always_inline void arch_local_irq_enable(void) native_irq_enable(); } -/* - * Used in the idle loop; sti takes one instruction cycle - * to complete: - */ -static __always_inline void arch_safe_halt(void) -{ - native_safe_halt(); -} - -/* - * Used when interrupts are already enabled or to - * shutdown the processor: - */ -static __always_inline void halt(void) -{ - native_halt(); -} - /* * For spinlocks, etc: */ diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 041aff51eb50..29e7331a0c98 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -107,6 +107,16 @@ static inline void notify_page_enc_status_changed(unsigned long pfn, PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); } +static __always_inline void arch_safe_halt(void) +{ + PVOP_VCALL0(irq.safe_halt); +} + +static inline void halt(void) +{ + PVOP_VCALL0(irq.halt); +} + #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { @@ -170,16 +180,6 @@ static inline void __write_cr4(unsigned long x) PVOP_VCALL1(cpu.write_cr4, x); } -static __always_inline void arch_safe_halt(void) -{ - PVOP_VCALL0(irq.safe_halt); -} - -static inline void halt(void) -{ - PVOP_VCALL0(irq.halt); -} - static inline u64 paravirt_read_msr(unsigned msr) { return PVOP_CALL1(u64, cpu.read_msr, msr); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index fea56b04f436..abccfccc2e3f 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -120,10 +120,9 @@ struct pv_irq_ops { struct paravirt_callee_save save_fl; struct paravirt_callee_save irq_disable; struct paravirt_callee_save irq_enable; - +#endif void (*safe_halt)(void); void (*halt)(void); -#endif } __no_randomize_layout; struct pv_mmu_ops { diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 1ccaa3397a67..c5bb980b8a67 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -110,6 +110,11 @@ int paravirt_disable_iospace(void) return request_resource(&ioport_resource, &reserve_ioports); } +static noinstr void pv_native_safe_halt(void) +{ + native_safe_halt(); +} + #ifdef CONFIG_PARAVIRT_XXL static noinstr void pv_native_write_cr2(unsigned long val) { @@ -125,11 +130,6 @@ static noinstr void pv_native_set_debugreg(int regno, unsigned long val) { native_set_debugreg(regno, val); } - -static noinstr void pv_native_safe_halt(void) -{ - native_safe_halt(); -} #endif struct pv_info pv_info = { @@ -186,9 +186,11 @@ struct paravirt_patch_template pv_ops = { .irq.save_fl= __PV_IS_C
[PATCH v6 3/3] x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. Emit warning and fail emulation if IRQs are enabled during HLT #VE handling to avoid running into scenarios where IRQ wake events are lost resulting in indefinite HLT execution times. Reviewed-by: Kirill A. Shutemov Signed-off-by: Vishal Annapurve --- arch/x86/coco/tdx/tdx.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 6aad910d119d..a97ddc6a52c3 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -393,6 +393,14 @@ static int handle_halt(struct ve_info *ve) { const bool irq_disabled = irqs_disabled(); + /* +* HLT with IRQs enabled is unsafe, as an IRQ that is intended to be a +* wake event may be consumed before requesting HLT emulation, leaving +* the vCPU blocking indefinitely. +*/ + if (WARN_ONCE(!irq_disabled, "HLT emulation with IRQs enabled")) + return -EIO; + if (__halt(irq_disabled)) return -EIO; -- 2.48.1.658.g4767266eb4-goog
[PATCH V5 0/4] x86/tdx: Fix HLT logic execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. So "sti;hlt" sequence needs to be replaced for TDX VMs with TDCALL execution followed by enabling of interrupts. Changes introduced by the series include: - Move *halt() variants outside CONFIG_PARAVIRT_XXL and under CONFIG_PARAVIRT [1]. - Route "sti; hlt" sequences via tdx_safe_halt() for reliability. - Route "hlt" sequences via tdx_halt() to avoid unnecessary #VEs. - Add explicit dependency on CONFIG_PARAVIRT for TDX VMs. - Warn and fail emulation if HLT #VE emulation executes with interrupts enabled. - Clean up TDX specific idle routine override. Changes since v4: 1) Addressed Kirill's comments. v4: https://lore.kernel.org/lkml/20250212000747.3403836-1-vannapu...@google.com/ Changes since v3: 1) Addressed comments from Sean. 2) Included [1] in the series to fix the scenarios where CONFIG_PARAVIRT_XXL could be disabled. v3: https://lore.kernel.org/all/20250206222714.1079059-1-vannapu...@google.com/ [1] https://lore.kernel.org/lkml/20210517235008.257241-1-sathyanarayanan.kuppusw...@linux.intel.com/ Kirill A. Shutemov (1): x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT Vishal Annapurve (3): x86/tdx: Route safe halt execution via tdx_safe_halt() x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling x86/tdx: Remove TDX specific idle routine arch/x86/Kconfig | 1 + arch/x86/coco/tdx/tdx.c | 30 +++- arch/x86/include/asm/irqflags.h | 40 +++ arch/x86/include/asm/paravirt.h | 20 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/include/asm/tdx.h| 2 -- arch/x86/kernel/paravirt.c| 14 ++ arch/x86/kernel/process.c | 3 -- 8 files changed, 71 insertions(+), 42 deletions(-) -- 2.48.1.601.g30ceb7b040-goog
[PATCH V5 4/4] x86/tdx: Remove TDX specific idle routine
With explicit dependency on CONFIG_PARAVIRT and TDX specific halt()/safe_halt() routines in place, default_idle() is safe to execute for TDX VMs. Remove TDX specific idle routine override which is now redundant. Signed-off-by: Vishal Annapurve --- arch/x86/coco/tdx/tdx.c| 2 +- arch/x86/include/asm/tdx.h | 2 -- arch/x86/kernel/process.c | 3 --- 3 files changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 16ac337df9fa..46f7bb82c8b7 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -407,7 +407,7 @@ static int handle_halt(struct ve_info *ve) return ve_instr_len(ve); } -void __cpuidle tdx_halt(void) +static void __cpuidle tdx_halt(void) { const bool irq_disabled = false; diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 393ee2dfaab1..6769d1da4c80 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -58,8 +58,6 @@ void tdx_get_ve_info(struct ve_info *ve); bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); -void tdx_halt(void); - bool tdx_early_handle_ve(struct pt_regs *regs); int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index d11956a178df..9b21989c283b 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -932,9 +932,6 @@ void __init select_idle_routine(void) if (prefer_mwait_c1_over_halt()) { pr_info("using mwait in idle threads\n"); static_call_update(x86_idle, mwait_idle); - } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { - pr_info("using TDX aware idle routine\n"); - static_call_update(x86_idle, tdx_halt); } else { static_call_update(x86_idle, default_idle); } -- 2.48.1.601.g30ceb7b040-goog
[PATCH V5 2/4] x86/tdx: Route safe halt execution via tdx_safe_halt()
Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. So "sti;hlt" sequence needs to be replaced for TDX VMs with "TDCALL; *_irq_enable()" to keep interrupts disabled during TDCALL execution. Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") prevented the idle routines from using "sti;hlt". But it missed the paravirt routine which can be reached like this as an example: acpi_safe_halt() => raw_safe_halt() => arch_safe_halt() => irq.safe_halt() => pv_native_safe_halt() Modify tdx_safe_halt() to implement the sequence "TDCALL; raw_local_irq_enable()" and invoke tdx_halt() from idle routine which just executes TDCALL without toggling interrupt state. Introduce dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines for TDX VMs. Cc: sta...@vger.kernel.org Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") Signed-off-by: Vishal Annapurve --- arch/x86/Kconfig | 1 + arch/x86/coco/tdx/tdx.c| 22 +- arch/x86/include/asm/tdx.h | 2 +- arch/x86/kernel/process.c | 2 +- 4 files changed, 24 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 87198d957e2f..afcdbc9693dc 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -902,6 +902,7 @@ config INTEL_TDX_GUEST depends on X86_64 && CPU_SUP_INTEL depends on X86_X2APIC depends on EFI_STUB + depends on PARAVIRT select ARCH_HAS_CC_PLATFORM select X86_MEM_ENCRYPT select X86_MCE diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 32809a06dab4..7ab427e85bd3 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -398,7 +399,7 @@ static int handle_halt(struct ve_info *ve) return ve_instr_len(ve); } -void __cpuidle tdx_safe_halt(void) +void __cpuidle tdx_halt(void) { const bool irq_disabled = false; @@ -409,6 +410,16 @@ void __cpuidle tdx_safe_halt(void) WARN_ONCE(1, "HLT instruction emulation failed\n"); } +static void __cpuidle tdx_safe_halt(void) +{ + tdx_halt(); + /* +* "__cpuidle" section doesn't support instrumentation, so stick +* with raw_* variant that avoids tracing hooks. +*/ + raw_local_irq_enable(); +} + static int read_msr(struct pt_regs *regs, struct ve_info *ve) { struct tdx_module_args args = { @@ -1109,6 +1120,15 @@ void __init tdx_early_init(void) x86_platform.guest.enc_kexec_begin = tdx_kexec_begin; x86_platform.guest.enc_kexec_finish = tdx_kexec_finish; + /* +* "sti;hlt" execution in TDX guests will induce a #VE in the STI-shadow +* which will enable interrupts before HLT TDCALL inocation possibly +* resulting in missed wakeup events. Modify all possible HLT +* execution paths to use TDCALL for performance/reliability reasons. +*/ + pv_ops.irq.safe_halt = tdx_safe_halt; + pv_ops.irq.halt = tdx_halt; + /* * TDX intercepts the RDMSR to read the X2APIC ID in the parallel * bringup low level code. That raises #VE which cannot be handled diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index b4b16dafd55e..393ee2dfaab1 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -58,7 +58,7 @@ void tdx_get_ve_info(struct ve_info *ve); bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); -void tdx_safe_halt(void); +void tdx_halt(void); bool tdx_early_handle_ve(struct pt_regs *regs); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 6da6769d7254..d11956a178df 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -934,7 +934,7 @@ void __init select_idle_routine(void) static_call_update(x86_idle, mwait_idle); } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { pr_info("using TDX aware idle routine\n"); - static_call_update(x86_idle, tdx_safe_halt); + static_call_update(x86_idle, tdx_halt); } else { static_call_update(x86_idle, default_idle); } -- 2.48.1.601.g30ceb7b040-goog
[PATCH V5 1/4] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
From: "Kirill A. Shutemov" CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For other VM guest types, features supported under CONFIG_PARAVIRT are self sufficient. CONFIG_PARAVIRT mainly provides support for TLB flush operations and time related operations. For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets most of its requirement except the need of HLT and SAFE_HLT paravirt calls, which is currently defined under CONFIG_PARAVIRT_XXL. Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest like platforms, move HLT and SAFE_HLT paravirt calls under CONFIG_PARAVIRT. Moving HLT and SAFE_HLT paravirt calls are not fatal and should not break any functionality for current users of CONFIG_PARAVIRT. Cc: sta...@vger.kernel.org Fixes: bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") Co-developed-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Vishal Annapurve --- arch/x86/include/asm/irqflags.h | 40 +++ arch/x86/include/asm/paravirt.h | 20 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/kernel/paravirt.c| 14 ++ 4 files changed, 41 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h index cf7fc2b8e3ce..1c2db11a2c3c 100644 --- a/arch/x86/include/asm/irqflags.h +++ b/arch/x86/include/asm/irqflags.h @@ -76,6 +76,28 @@ static __always_inline void native_local_irq_restore(unsigned long flags) #endif +#ifndef CONFIG_PARAVIRT +#ifndef __ASSEMBLY__ +/* + * Used in the idle loop; sti takes one instruction cycle + * to complete: + */ +static __always_inline void arch_safe_halt(void) +{ + native_safe_halt(); +} + +/* + * Used when interrupts are already enabled or to + * shutdown the processor: + */ +static __always_inline void halt(void) +{ + native_halt(); +} +#endif /* __ASSEMBLY__ */ +#endif /* CONFIG_PARAVIRT */ + #ifdef CONFIG_PARAVIRT_XXL #include #else @@ -97,24 +119,6 @@ static __always_inline void arch_local_irq_enable(void) native_irq_enable(); } -/* - * Used in the idle loop; sti takes one instruction cycle - * to complete: - */ -static __always_inline void arch_safe_halt(void) -{ - native_safe_halt(); -} - -/* - * Used when interrupts are already enabled or to - * shutdown the processor: - */ -static __always_inline void halt(void) -{ - native_halt(); -} - /* * For spinlocks, etc: */ diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 041aff51eb50..29e7331a0c98 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -107,6 +107,16 @@ static inline void notify_page_enc_status_changed(unsigned long pfn, PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); } +static __always_inline void arch_safe_halt(void) +{ + PVOP_VCALL0(irq.safe_halt); +} + +static inline void halt(void) +{ + PVOP_VCALL0(irq.halt); +} + #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { @@ -170,16 +180,6 @@ static inline void __write_cr4(unsigned long x) PVOP_VCALL1(cpu.write_cr4, x); } -static __always_inline void arch_safe_halt(void) -{ - PVOP_VCALL0(irq.safe_halt); -} - -static inline void halt(void) -{ - PVOP_VCALL0(irq.halt); -} - static inline u64 paravirt_read_msr(unsigned msr) { return PVOP_CALL1(u64, cpu.read_msr, msr); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index fea56b04f436..abccfccc2e3f 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -120,10 +120,9 @@ struct pv_irq_ops { struct paravirt_callee_save save_fl; struct paravirt_callee_save irq_disable; struct paravirt_callee_save irq_enable; - +#endif void (*safe_halt)(void); void (*halt)(void); -#endif } __no_randomize_layout; struct pv_mmu_ops { diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 1ccaa3397a67..c5bb980b8a67 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -110,6 +110,11 @@ int paravirt_disable_iospace(void) return request_resource(&ioport_resource, &reserve_ioports); } +static noinstr void pv_native_safe_halt(void) +{ + native_safe_halt(); +} + #ifdef CONFIG_PARAVIRT_XXL static noinstr void pv_native_write_cr2(unsigned long val) { @@ -125,11 +130,6 @@ static noinstr void pv_native_set_debugreg(int regno, unsigned long val) { native_set_debugreg(regno, val); } - -static noinstr void pv_native_safe_halt(void) -{ - native_safe_halt(); -} #endif struct pv_info pv_info = { @@ -186,9 +186,11 @@ struct paravirt_patch_template pv_ops = { .irq.save_fl= __PV_IS_C