On Tue, Dec 03, 2024 at 09:39:03PM +0100, Gioele Barabucci wrote: > NFC would solve both of these "problems": > > * Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9, > * Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349 > (omega). > > What NFC alone will not solve are homograph collisions: a (U+0061 Latin > small letter a) and а (U+0430 Cyrillic small letter a) are NFC-normalized to > different codepoints.
NFC also doesn't solve various invisible characters (e.g., zero-width spaces, bidirectional control characters). For more information about all of the various security land mines, see[1]. I also suggest that people do a google search on "CVE" and "Unicode". There has been at least one interaction where we needed to make a kernel(!) change to address a security vulnerability, although we decided it wasn't super-critical because "no sane distribution actually enables the casefold feature on users' file systems by default". [1] https://www.unicode.org/reports/tr39/tr39-22.html The other security consideration to consider is the vast amount of code that you need to link into security critical / setuid programs if you are going to use libunicode. (And yes, we do include libunicode into the kernel in order to support casefold. If you are thinking about potentially enabling casefold by default on User file systems because Windows and MacOS does it, and we need to appeal to Gen Z'ers in order for Debian to stay relevent(tm) --- please don't. :-) So if we really do want to support unicode in usernames, may I suggest that having someone implement the smallest possible Unicode canonicalization library, which also handles getting rid of all of the *other* Unicode security traps like invisible characters, bidirectional control characters, etc., and then asking it to get subjected to rigorous security audits before we propose linking it into setuid programs, that would be a Really Good Idea. This would also reduce bloat in the minimal Debian install required for installer images, docker containers, etc., since we wouldn't need to support things like Unicode sorting rules, Unicode case folding, conversion between the many different Unicode encoding forms, etc. Cheers, - Ted > > But these are two different scenarios: the former problem may (and does) > arise without any wrongdoing from the user's side (a different OS, or a > different string manipulation library, or a screen keyboard may produce a > different é), the latter is an attack. The former is an interoperability > issue, the latter is a security issue. > > > While this seems the right thing to do, I think this should be done in > > useradd (pkg:shadow), in the respective upstream project, so that all > > Linux distributions get the same behavior. > > That's probably the best approach. > > Thanks for taking the time to delve into this issue, > > -- > Gioele Barabucci > > >