Thanks for the review, Max! :)
On 4/11/25 18:04, Max Carrara wrote:
On Fri Apr 11, 2025 at 5:08 PM CEST, Daniel Kral wrote:
As mentioned in the Bugzilla and indicated above, I haven't found any
clear indicator for this happening besides that the most affected
devices seem to be USB devices, which use the mentioned UAS kernel
module.
Have you perhaps found any way to test this? I could then try to
replicate this behaviour. Otherwise no hard feelings; I think setting a
shorter timeout for (usually) smaller commands is something we should do
in general.
Unfortunately not, I've tried all the (4) USB devices I had on me, but
sadly none of them had those quirks ;). I tested only that the error
path works correctly with simply substituting the smartctl command with
`sleep 11` and `sh -c 'exit 3'` for the timeout + non-zero return.
It'd be sure great if someone with an affected disk could test this
directly, I'll forward it to the Bugzilla entry and forum post so it
might get more coverage.
(That being said, looking through the code of PVE::Tools::run_command---
I'm surprised we don't set a default timeout there at all. I think
introducing one there could perhaps break something unexpected, though,
so I'd rather not touch it.)
Yes, I'd guess that there would be some places where the $noerr is set,
but $timeout will error anyway now AFAICS as here, so there'd be quite a
few places which do not have error handlers setup. I hope that smartctl
is more of an odd case here as the timeout is quite high because of reasons.
I'm fine lowering the timeout further, but 10 seconds seemed reasonable
if only one disk is affected for now, so that loading takes some time
and not seemingly forever.
Given that I've never had a single device take longer than a split
second, I think this is quite reasonable too.
I was also thinking about just caching which disks have had that
behavior and just not running the command for them, but I thought this
would add more complexity than needed here.
I agree that this would be a little too much; you'd also have to
invalidate cache entries after a certain time / a certain condition etc.
You'd also have to handle the case where the disk starts to magically
respond to `smartctl` again. Better to just keep the timeout here as-is.
Agreed, that would be way too much for this. And as it seems from the
forum, it was probably a faulty controller / firmware (?) anyway [0].
[0] https://forum.proxmox.com/threads/164799/#post-763192
Either way, nice work! For both patches, consider:
Reviewed-by: Max Carrara <m.carr...@proxmox.com>
(Though, I'd still like to test this somehow, if you found a way to do so)
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel