On Wed, Jul 01, 2026 at 11:29:01AM -0700, H. Peter Anvin wrote: > On July 1, 2026 10:42:08 AM PDT, "Michal Suchánek" <[email protected]> wrote: > >The return value of syscall_enter_from_user_mode is used both for the > >adjusted syscall number and the indicator that a syscall should be > >skipped. > > > >As seccomp can be invoked on any syscall, including invalid ones this > >somewhat undermines seccomp. > > > >While the seccomp variants that terminate the process do not need to > >care about this for the filter that sets the syscall return value this > >disctinction is required. > > > >Pass the syscall number as a pointer to the inline entry functions, and > >use the return value exclusively for the indication that the syscall is > >already handled. > > > >This should avoid the need for the s390 PIF_SYSCALL_RET_SET which is the > >workaround for exactly this deficiency. > > > >If this is desirable the patch could be split into some series that > >adjusts the code flow where needed so that the final change is mostly > >mechanical. > > > >There is also another way to handle this problem. > > > >With x86 using bit 30 to denote compatibility syscall it sounds like > >declaring syscall number a 30bit quantity would work. > > > >Then bit 31 could be used to denote an invalid syscall that can never be > >executed, and the -1 returned from syscall_enter_from_user_mode would > >then be inherently invalid. > > > >That is so long as no architectures use syscall numbers outside of this > >range so far, and the limitation is considered fine. > > > > Negative numbers most definitely not be assigned as valid system calls, not > now, not ever.
Negativity of a number is a matter of intepretation. Sometimes the syscall number is decleared as int, sometimes long, sometimes unsigned long. Passing -1 to strtoul generates some bit pattern that can then be compared to another bit pattern inside a seccomp filter program, for example. > Therein lies some serious madness. > > I believe setting the syscall number to -1 to skip is an ABI already in e.g. > ptrace, so I doubt we can just get rid of it anyway. Yes, and seccomp can set the syscall number to -1 indicating it was handled already even if the number was -1 to start with. While -1 is not a valid syscall number it can still be filtered, at least on some architectures. > I would say as follows: > > Let's formally define that: > > - valid system call numbers are positive 32-bit numbers, using the > appropriate ABI convention for "int". > > - bits [30:n] for some value of n are reserved for architecture-specific > flags/modes. MIPS uses an offset of 2000 decimal between its syscall ABIs, > which would imply n ~ 11, although I personally think that is too restrictive > (MIPS could in fact use such a flag to provide an escape into a larger number > space if we ever need more than 2000 system calls.) > > I would suggest n = 24, at least for now. It is easier to give up additional > bits later than to claw them back when already used. > > Thus: > > 1. The type for a system call is int. > > 2. A valid system call number is always going to be positive. > > 3. Bits [30:24] are available for architecture ABI use. The "architecture > independent" part of the system call number is therefore 24 bits wide. Will that also work correctly with seccomp? As I understand it the current situation is that on x86 the BPF code passed to seccomp must filter the compat syscall bit in the PBF code, and I do not see how restricting the syscall value to 24bit would happen without changing the seccomp filter API. See eg. https://lore.kernel.org/linuxppc-dev/[email protected]/ for sample code. > > 4. The exact ABI is platform-specific, obviously, but as a general guideline > (especially for new platforms/ABIs) should follow the rules for a platform > "int" if practical. Notably, when passing a value in a register larger than > 32 bits, which side of the calling interface is responsible for > sign-extending a value passed in a register. If caller side, the kernel > should validate, if callee side the kernel should ignore the additional bits > and do the extension. Do we even want to play with sign-extend? If the syscall number is >= 1<<n after masking off flags recognized by the platfrom (if any) it's invalid. > 5. A negative system call number is guaranteed to return -ENOSYS (unless > intercepted by seccomp, ptrace, or another mechanism under user space > control.) Interception by seccomp is exactly the case that's wonky. > 6. If the platform needs to algorithmically modify the system call number due > to platform-specific concerns (say, the platform uses a 16-bit special > purpose register for the syscall number, or it has multiple kernel entry > points with different behavior), it should if at all possible transcode the > system call number as necessary to match this convention in APIs that are > exposed to general kernel code. > > For example, in the future I could very much see the IA32 code in the x86 > kernel using bit 29 internally to indicate an ia32 system call, simplifying > the is_compat implementation on x86. It should not mean that passing bit 29 > to either the syscall instruction or int $0x80 will be accepted. As I understand the code it uses bit 30 for that. Maybe I missed something? Thanks Michal
