Re: nvptx: 'cuDeviceGetCount' failure is fatal

Thomas Schwinge Fri, 08 Mar 2024 07:58:44 -0800

Hi Tobias!

On 2024-03-07T15:28:21+0100, Tobias Burnus <tbur...@baylibre.com> wrote:
> Thomas Schwinge wrote:
>> OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"?
>
> I think the real question is: what does a 'cuDeviceGetCount' fail mean?


Internally to the CUDA stack: the error codes that you've cited below.
Per the state we're in when calling 'cuDeviceGetCount', we only expect
'CUDA_SUCCESS'.  Therefore, in our actual use: anything else means a
fatal condition that we don't attempt to recover from, like for most of
all other device access failures.

> Does it mean a serious error – or could it just be a permissions issue 
> such that the user has no device access but otherwise is fine?

As you can see, we've done a 'cuInit' right before, so in case there was
any permission issue (or similar), that's already settled (in whichever
way) by the time we do the 'cuDeviceGetCount'.

> Because if it is, e.g., a permission problem – just returning '0' (no 
> devices) would seem to be the proper solution.
>
> But if it is expected to be always something serious, well, then a fatal 
> error makes more sense.

ACK; pushed in commit 37078f241a22c45db6380c5e9a79b4d08054bb3d.


Grüße
 Thomas


> The possible exit codes are:
>
> CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, 
> CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE
>
> which does not really help.
>
> My impression is that 0 is usually returned if something goes wrong 
> (e.g. with permissions) such that an error is a real exception. But all 
> three choices seem to make about equally sense: either host fallback 
> (with 0 or -1) or a fatal error.
>
> Tobias

Re: nvptx: 'cuDeviceGetCount' failure is fatal

Reply via email to