Hi nick, On Sat, Nov 23, 2024 at 02:48:10AM -0500, nick black wrote: > Marc Haber left as an exercise for the reader: > > (1) > > Should Debian allow UTF-8 user names in the first place or should we > > restrict names for regular users to some us-ascii near set as well? (I > > think yes, we should) > > I feel strongly yes, despite POSIX admonitions (quoted elsewhere > in this thread) and sure breakage any number of places.
Thank you, noticed. > I think > a test plan would be very desirable (off the top of my head, > we'd want to check login, the DMs, PAM, OpenSSH, passwd, w, > framebuffer console input, etc. It would probably also be a good > idea to loop in other distributions. Coordinating this test is way beyond what I have available in resources, most notably time. Our tools have been allowing UTF-8 user names at least since bookworm (I don't have any bullseye systems left, buster's adduser does not allow UTF-8). So we are already testing this in a stable release (albeit unplanned). Please note that allowing UTF-8 user names by default does not break compatibility in any place where only 7bit user names are being used. Debian is not using such user names in anything that we ship. We only allow them. Actually _doing_ this is still the local admin's decision. And should they decide to not want this, adduser can be configured to disallow. This thread is mainly about whether we should disallow things in next stable that are possible in current stable. I think we need good reasons for that, and I ain't seeing any right now. > I recommend Chapter 7 of my free book, "Hacking the Planet with > Notcurses: A Guide to TUIs and Character Semigraphics" for the > full story (as I understand it) regarding Unicode presentation: > https://nick-black.com/htp-notcurses.pdf (starts on page 41). Noted for reading. > * any upstream tool could say "bad idea" and refuse patches, > requiring their long term management, Depending of how important this tool is, we could get away without patching and probably not even documenting this failure. > * the Linux framebuffer console is pretty limited in what > glyphs it has available, and the number of glyphs it can > support, Probably, yes. But people working on the Linux framebuffer console are unlikely to actually use UTF-8 user names, so the only really bad situation would be a rescue situation. We could get away with documenting "please use 7bit only user names for accounts that are likely to be used in system rescue situations". > * you want installer support if you intend to do this right, The installer currently allows me to type UTF-8 user names in the entry fields (and even displays them correctly when one goes through the dialogs a second time), but rejects them with a sanitation error message ("The username you entered is invalid. Note that usernames must start with a lower-case letter, which can be followed by any combination of numbers and more lower-case letters, and must be no more than 32 characters long.") which is incorrect, it should be "lower-case us-ascii letters". From a German point of view "jürgen" conforms to the rules given in the error message. > * ubiquitous input for UTF-8 is a pretty complicated story, and Sites using such letters in user names should know which of them can be typed. > * broken localization (or failure to call setlocale()) could be > a bigger problem, especially for root/system accounts. I don't think we should allow UTF-8 charactes in the string "root" or in system account names. And if a local admin decides to do so, Debian packages should still restrict themselves to using US-ASCII in their system accounts. > Other concerns: > > You'll likely now be linking libunistring into some > binaries where it wasn't previously used. Probably, yes. I hope to get away in adduser without that, since I'd like to keep adduser's dependencies minimal (it's being used in the installer). > Regarding the subset of Unicode characters you'd want to allow, > this would be best decided using the General Category trait. > Each codepoint is assigned one of a finite set of General > Categories. We would probably want to allow Letters, Marks, and > Numbers, and perhaps a whitelist from Punctuation and Symbols > (Punctuation, connector and Punctuation, dash are probably all > we'd want) extended from currently supported ispunct(3) > characters. This data is available from libunistring (and > probably other places). This eliminates a great swatch of known > security issues. Do you have a suggestion for a perl regexp that allows this? My current development directory has "qr/[\p{Graph}*\.\${}><%'@]+/". > Names containing invalid UTF-8 sequences ought be rejected. Agreed. How do I check for this in perl? > Characters 0-127 would presumably be allowed iff they are now; > UTF-8 preserves US-ASCII. I'd rather allow 32-127 only. > We ought support combining characters up through the Extended > Grapheme Cluster (a single user-perceived character, roughly a > glyph, made up of one or more encoded characters). Generally a > single backspace ought map to an entire EGC. This is beyond my knowledge of Unicode. > Regarding canonicalization/normalization, this is a complex > question without a necessarily correct technical answer. I think > you'd want to follow the Principle of Least Astonishment; as to what > would astonish the least, I'd like to hear wider input. But > Unicode definitely defines multiple normal forms and equivalency > classes. I am not sure whether we need this. A local admin is likely to be consistent to herself in creating user names. > You now have glyphs which occupy more than one column. Are your > columnar/tabular programs prepared for that? ﷽𒁭𒐫 Probably not. If that's important for a local admin, they can disallow such characters and maybe even file a patch against adduser. Quoting the character just out of curiosity. > > (2) > > If the answer to (1) is "allow UTF-8", should we also do that for system > > users? (I think no, we should not) > > I think you should, simply because otherwise you have two paths > in more places. Adduser already has different code paths for normal and system accounts. > > (3) > > I think that 32 characters/bytes (it's the same if we don't allow UTF-8) > > is a good limitation for a system user name. But, should we increase > > that for regular user names? (I think yes) > > I hesitate to comment here because who really cares, but does 32 > save us something over 128? 128 seems the default "enough for > everybody" these days, looking at IPv6 and ZFS. systemd argues that > 32 characters are rarely supported in "older and unmaintaind" utilities. > My printer is administered by > i̸̒n̴͛e̵̎l̴͝u̷̾c̴̉t̵́å̵b̷͋l̷͐e̴̋m̸̆o̷̚d̴̐ä̸́l̶͝i̷̋t̷͗ẏ̷ȏ̵f̸̃t̶͘h̷͗e̴̿v̶͘i̷̛s̸̈́ì̵b̷̃l̶̎e̷͊. That really renders strangely here. > > (6) > > Does it still make sense to give non-UTF-8-locales special handling > > (which one?), or can adduser safely assume that any non-ascii locale is > > UTF-8? Or must I check for locale and reject UTF-8 user names on > > non-UTF-8 locales? (I hope that we can safely assume UTF-8) > > It cannot. "C" is not UTF-8. Assumption of UTF-8 requires a > properly set LANG and programs calling setlocale(). This, as > alluded to above, has the potential for a big mess. Our default is C.UTF-8 and has been like that for a while. Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421