Hi Tobias! On 2024-03-07T15:28:21+0100, Tobias Burnus <tbur...@baylibre.com> wrote: > Thomas Schwinge wrote: >> OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"? > > I think the real question is: what does a 'cuDeviceGetCount' fail mean?
Internally to the CUDA stack: the error codes that you've cited below. Per the state we're in when calling 'cuDeviceGetCount', we only expect 'CUDA_SUCCESS'. Therefore, in our actual use: anything else means a fatal condition that we don't attempt to recover from, like for most of all other device access failures. > Does it mean a serious error – or could it just be a permissions issue > such that the user has no device access but otherwise is fine? As you can see, we've done a 'cuInit' right before, so in case there was any permission issue (or similar), that's already settled (in whichever way) by the time we do the 'cuDeviceGetCount'. > Because if it is, e.g., a permission problem – just returning '0' (no > devices) would seem to be the proper solution. > > But if it is expected to be always something serious, well, then a fatal > error makes more sense. ACK; pushed in commit 37078f241a22c45db6380c5e9a79b4d08054bb3d. Grüße Thomas > The possible exit codes are: > > CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, > CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE > > which does not really help. > > My impression is that 0 is usually returned if something goes wrong > (e.g. with permissions) such that an error is a real exception. But all > three choices seem to make about equally sense: either host fallback > (with 0 or -1) or a fatal error. > > Tobias