On 11/13/2024 7:46 AM, Felix Kuehling wrote:
>
> On 2024-11-12 2:35, Zhu Lingshan wrote:
>> On 11/12/2024 6:21 AM, Felix Kuehling wrote:
>>> On 2024-11-11 03:08, Zhu Lingshan wrote:
>>>> On 11/5/2024 4:50 AM, Felix Kuehling wrote:
>>>>> On 2024-10-31 22:35, Zhu Lingshan wrote:
>>>>>> On 10/31/2024 11:30 PM, Felix Kuehling wrote:
>>>>>>> On 2024-10-31 6:50, Zhu Lingshan wrote:
>>>>>>>> The ioctl functions may fail, causing the args unreliable.
>>>>>>>> Therefore, the args should not be copied to user space.
>>>>>>>>
>>>>>>>> The return code provides enough information for
>>>>>>>> error handling in user space.
>>>>>>>>
>>>>>>>> This commit checks the return code of the ioctl functions
>>>>>>>> and handles errors appropriately when they fail.
>>>>>>> I have reviewed and rejected this patch before. My opinion has not 
>>>>>>> changed. The existing code copies the ioctl arg structure back to user 
>>>>>>> mode even in error cases because user mode needs additional information 
>>>>>>> from that structure for some ioctls.
>>>>>> how can the user space program distinguish the "good informational 
>>>>>> parameters" from the  "bad default legacy parameters"? There can be 
>>>>>> other user space programs other than thunk.
>>>>>>
>>>>>> what if the user space program doing pulling mode, it can pull the args 
>>>>>> changes because ioctl is usually slower, our code should be robust.
>>>>>>
>>>>>> usually the return code provides enough information for the user space 
>>>>>> programs.
>>>>> I don't understand your concern. Even without your patch, the failing 
>>>>> ioctl still returns the error code to user mode. User mode can safely 
>>>>> ignore additional information returned in the argument structure. You are 
>>>>> raising concerns about performance or robustness. I don't see that either 
>>>>> of those are negatively impacted by copying additional information in the 
>>>>> argument struct to user mode.
>>>> Still the questions:
>>>> 1) how can the user space program distinguish the "good informational 
>>>> parameters" from the  "bad default legacy parameters"? 2) what if the user 
>>>> space program doing pulling mode, pull the args before error code 
>>>> returned. Memory changes are usually faster than error code.
>>> There are no "bad default legacy parameters". Ioctls that were defined to 
>>> return additional information in the parameter structure on errors have 
>>> always done so. This should be documented in the kfd_ioctl.h header, though 
>>> some ioctls have better documentation than others. For a good example, see 
>>> kfd_ioctl_dbg_trap_get_queue_snapshot_args and 
>>> kfd_ioctl_dbg_trap_get_device_snapshot_args, which do return the number of 
>>> queues or devices in the parameter structure if user mode didn't allocate 
>>> enough space.
>>>
>>> Another example is kfd_ioctl_map_memory_to_gpu_args and 
>>> kfd_ioctl_unmap_memory_from_gpu_args, which returns the number of 
>>> successful mappings if the ioctl fails. This is necessary to restart the 
>>> operation after -ERESTARTSYS and skip mappings that were already completed.
>> I believe these are bugs that should be fixed. If request N but only 
>> allocate M where M < N, the kernel space should return a proper error code 
>> and then user space should reduce the requesting number, like N/2.
> That's not how these ioctls work. The ioctl needs to return data for all 
> queues or all devices. User mode may not know that number ahead of time and 
> may not have allocated enough memory. So the ioctl needs to be able to return 
> the required allocation size in the failure case. This is what it's designed 
> to do.
shall we document this? Or how can the developers know about these constraints 
when implementing changes.
>
>> We must keep do defensive programming to reduce potential risks & bugs, keep 
>> our system robust, we should not assume the other side work perfectly as 
>> expected.  
> Exactly. More information makes this easier. Removing information that has 
> always been available breaks the ABI and does not improve robustness.
>
>>>
>>>>> You mention that there can be other user mode clients other than Thunk. 
>>>>> That's true. E.g. rocm-gdb calls KFD ioctls directly. And it depends on 
>>>>> some of the additional information about errors. If you know of other 
>>>>> user mode clients that are broken by the current behaviour, please point 
>>>>> them out.
>>>>>
>>>>> Before anything else, we do not break existing user mode. Your patch 
>>>>> breaks that rule. There is really no room for discussion here. I'm not 
>>>>> seeing any reasonable argument to even consider your proposal.
>>>> If a user space program needs to read arguments to do error recovery, then 
>>>> it is a buggy user space program that should be fixed.
>>>> Usually the error code provides enough information for error handling. Why 
>>>> our KFD user space are exceptive?
>>> See my examples above. User mode is not buggy if it uses documented API 
>>> behaviour, like what I showed above. In the case of -ERESTARTSYS, saving 
>>> information in the argument structure is also necessary for the kernel mode 
>>> driver itself, not just user mode.
>> so we need a list to document what parameters are guaranteed to be safe to 
>> read in user space?
> I don't think it's a safety concern. If you make it one, you need to define 
> what you consider unsafe. From the kernel's point of view, user mode can do 
> whatever they want with the data returned to user mode. If user mode ends up 
> crashing because of it, that's a user mode bug unless the kernel violated 
> some API contract.
>
> That said, I did point you to API documentation that explains exactly what 
> ioctl args contain useful information after certain ioctls return an error.
>
>  And Kernel must always make sure the parameters are reasonable even when 
> ioctls fail in the first place?> I am not sure this is a good practice, we 
> don't see many drivers rely on this kind of design.
>
> I disagree. You can see the same behaviour in drm_ioctl.c:
>
>         retcode = drm_ioctl_kernel(filp, func, kdata, ioctl->flags);
>         if (copy_to_user((void __user *)arg, kdata, out_size) != 0)
>                 retcode = -EFAULT;
>
>       ...
>
> This is the ioctl handler used by many GPU drivers in the kernel, including 
> amdgpu.
>
>> Here is an example explains why this is messy:
>>
>> kfd_ioctl returns -EINVAL if the requesting ioctl number > 
>> AMDKFD_CORE_IOCTL_COUNT. Here no kfd ioctl functions are invoked, therefore 
>> the arguments are not touched, they are legacy values or random values, even 
>> all 0.> 
>> Other kfd ioctl functions, like kfd_ioctl_create_queue and 
>> kfd_ioctl_destroy_queue, they also may return -EINVAL, but the arguments may 
>> be modified. How can the user space tell whether the arguments are modified 
>> by kernel, containing useful information, or not touched at all?
> They know by reading the API documentation.
yeah, so at least we need documentation, at least comments in the source code.

Thanks
Lingshan
>
> Regards,
>   Felix
>
>> And what if the user space pull memory changes of the arguments?> 
>> Thanks
>> Lingshan
>>> Regards,
>>>   Felix
>>>
>>>
>>>> Thanks
>>>> Lingshan
>>>>> Regards,
>>>>>    Felix
>>>>>
>>>>>> Thanks
>>>>>> Lingshan
>>>>>>> Regards,
>>>>>>>    Felix
>>>>>>>
>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan....@amd.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 3 +++
>>>>>>>>   1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>>>>>> index 3e6b4736a7fe..a184ca0023b5 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>>>>>> @@ -3327,6 +3327,8 @@ static long kfd_ioctl(struct file *filep, 
>>>>>>>> unsigned int cmd, unsigned long arg)
>>>>>>>>       }
>>>>>>>>         retcode = func(filep, process, kdata);
>>>>>>>> +    if (retcode)
>>>>>>>> +        goto err_retcode;
>>>>>>>>         if (cmd & IOC_OUT)
>>>>>>>>           if (copy_to_user((void __user *)arg, kdata, usize) != 0)
>>>>>>>> @@ -3340,6 +3342,7 @@ static long kfd_ioctl(struct file *filep, 
>>>>>>>> unsigned int cmd, unsigned long arg)
>>>>>>>>       if (kdata != stack_kdata)
>>>>>>>>           kfree(kdata);
>>>>>>>>   +err_retcode:
>>>>>>>>       if (retcode)
>>>>>>>>           dev_dbg(kfd_device, "ioctl cmd (#0x%x), arg 0x%lx, ret = 
>>>>>>>> %d\n",
>>>>>>>>                   nr, arg, retcode);

Reply via email to