On Wed, Jan 08, 2003 at 01:30:09AM -0500, Colin Walters wrote: > On Tue, 2003-01-07 at 03:07, Jakob Bohm wrote: > > > I agree, this is the only way to go. Naive, simple, classic > > UNIX-style programming should continue to "just work", > > Naïve, simple, classic UNIX-style programs are ASCII-only. Then someone > got the idea to bolt this huge "locale" kludge on top of all of it. It > is not something to be proud of or emulate. >
Naive, simple, classic UNIX-style programs (if 8 bit clean) will implicitly handle UTF8, latin-1, latin-2, Korean DBCS, Arab, Hebrew, most old DOS codepages, and generally any encoding which includes ASCII as a proper subset. The notable exception is certain Japanese DBCS encodings, which allow ASCII character encodings to have a different meaning if preceded by the wrong byte values. I am not sure if the common Chinese DBCS encodings are safe like Korean or unsafe like Japanese. This is what I want to keep working. But this pleasant situation presumes, that all the system interfaces (terminal, filesystem, Xlib ...) happen to use the *same* encoding at any given invocation of the program, at least as far as input/output to that program is concerned. So my detailed proposal is about getting UTF8 support work without breaking this basic programming assumption. > > I like > > the idea that I can download any old program written in a past > > decade and just type make. > > Yay for broken software. > Again, I assume that the program is 8 bit clean or I would have to restrict my input to ASCII anyway today. But if I do restrict my own input to ASCII for such a broken program, the system should do nothing which may increase the breakage beyond that manual workaround. To understand my concrete proposal, it should be seen in the light of the following general transition plan: Step S1. Get all the ultra-core software to support UTF8 (items 4 and 6 in the proposal). Step S2. Now maintainers of other software will have a reasonable environment in which to start implementing and testing that their code works with UTF8 variants of locales. And users can actually use such locales without massive breakage. Step S3. Make all Debian packages work correctly in the presence of UTF8 locales. Proposal items 1 to 3 are about making this as trivial as possible, with 90% plus of current packages (both source and binary) needing no change at all. Step S4. While implementing S3, work on creating solutions which allow processes running in UTF8 locales to interoperate with a world, where some systems and users will continue to use other encodings anyway for many years to come. Proposal item 5 says that this is the responsibility of the few pieces of software actually interfacing with the outside world, not of the many pieces of neutral software which may or may not happen to be used in those situations. Proposal item 4 emphasizes that simply having a user interface (such as libreadline in the shell, ncurses in some full screen text mode programs, Athena or Motif/lesstif widgets in X in X programs) does not put a program in that category. Thus character conversion should be done at the very edge of the system: In the local terminals (vt, xterm, Xlib), in remote terminal access software (ssh, telnet, tty wrappers for serial lines, Xlib for remote X terminals), and in physical storage interfaces (already partially in the stock kernel for non-UNIX filesystems). Step S5. Make UTF8 locales the default. Step S6. Subject support for other encodings to bit rot, not deliberate removal. > > 1. Unless otherwise specified here, or there are very special > > circumstances, all programs and libraries should assume that all > > strings they receive or output (including, but not limited to > > filenames) are in the same encoding, and make no externally > > visible character encoding conversion. (This is usually trivial > > to do, just do nothing). > > This is the way things currently work; it is also exceedingly broken. It is very much not broken: If I set my locale to UTF8, use a UTF8 terminal and all my filesystems present UTF8 at the system call level, everything works. If I set my locale to latin-1, use a latin1 terminal and all my filesystems present latin1 at the system call level, everything works too. If I set my locale to the predominant Japanese DBCS encoding, use a Japanese DBCS terminal and all my filesystems present Japanese DBCS at the system call level, almost everything works, unless I use one of the few characters whose DBCS encoding abuses the byte values normally associated with e.g. "/", or "\\" . And yes, I do use all of these variations on some of my machines, even though I don't speak the Japanese language personally. > > > 2. If a program really needs to make assumptions about the > > character encoding of data, it should assume the character > > encoding specified by the locale. > > I think that if you are writing a program today, it is saner to assume > UTF-8, since that is the future direction. If the locale says UTF8, then assuming UTF8 is safe. If the locale is not UTF8, assuming UTF8 is VERY broken, my proposal went on to say that supporting the UTF8 setting correctly is the most important case to implement, but a neutral 8-bit clean mode must also be available, which will handle most other encodings implicitly. Support for legacy DBCS encodings is not required at all, because it may be too difficult to add to programs in some situations, and users can soon get around by using UTF8 for those languages. > > > 3. Unless required for security or other functionality, programs > > and libraries should not object to processing invalid > > characters. (This increases the users chance of being able to > > deal with data in inconsistent or broken encodings, e.g. with > > commands such as mv M?nch.txt Maench.txt). > > I believe that the programs to which you might need to pass invalid > characters will also be the programs which will not look at or > manipulate the filenames anyways. 'mv' is a good example of a program > which we will *not* need to change. It just basically takes its > arguments and passes them to the rename system call (well obviously it > is more complicated than that, but that's the basic idea). > Here is a simple example: /bin/more needs to count the number of encoded characters in order to determine, when lines will wrap and thus when to pause output. So /bin/more must recognize the UTF8 (or other charset) values which indicate multi-byte encodings representing a single character. It may even need to know about zero and double width characters. But whatever it does, it should not refuse to pass through unmodified any non-UTF8 data I might feed it, because I probably have a reason to do that if I do (maybe my LOCALE variable says UTF8 by mistake, maybe my super-smart terminal does dynamic character set recognition, maybe I am piping binary data through it and it will be processed by the next filter in line). The same applies to multi-column /bin/ls output, or to my text editor. A very well known example is perl 5.8 . Many existing perl scripts process pure binary data using string functions. This broke unnecessarily when perl 5.8 started to assume all string data to be valid in the users character set and did non-reversible conversions to it in order to do UNICODE internally. The proposal says that any future changes to software should not make this mistake. > > 4. The low level software which converts keystrokes (or other > > non-string input) to strings or converts strings to pixels (or > > other non-string output), is responsible for doing so > > consistently with the locale of the programs to which it > > provides this service, unless those programs explicitly specify > > otherwise. > > I generally agree. > > > For terminal-style input/output, there will be a tool or library > > feature (existing or Debian-created) which does two-way > > conversion of character sets around a pty. This tool can / > > should be plugged into ssh, telnet, serial line getty and other > > conduits which allow terminal access from terminals that might > > have different locales than preferred on a given Debian system. > > Such a tool could save us time (perhaps this tool already exists in the > form of GNU screen, as mentioned by David Starner), but note we can't > really force users to use it. The idea is, that those Debian packages, which provide the interfaces to external terminals (telnet, ssh, serial line variants of getty) should be packaged to invoke the tool or feature implicitly by default, thereby causing all terminals to look like UTF8 terminals (if LC_CHARSET=UTF8), even if external computers or hardware terminals are really not. Since Debian is Free Software, users still have the freedom to break things, but they should not be broken as shipped. > > > 5. Software which persists or transports strings outside the > > current process group, such as the name processing in > > filesystems, should convert strings from the current locale to a > > common encoding chosen by the implementor, such as UTF8, UTF16, > > UTF32 or in some cases another encoding. It must be possible to > > turn off the translation through an extra environment variable, > > no matter what the locale or its character encoding. > > Ugh, I am opposed to any sort of environment variable like this. I > think it will not be necessary, and will complicate the implementation. There are some real world tasks (mostly related to system administration, crash recovery, backup etc.), where the ability to directly access the raw encodings of filenames etc. is vital, but correct graphic display of some characters is not. Such tasks need to run with character set translation turned off, and ditto for any other unwanted "automatic" assistance. A good example is your hypothetical script to convert on-disk filenames to UTF8 by renaming files, this tool obviously needs to bypass UTF8 translation in order to access the old filenames in the first place, another is tools which relate raw disk blocks to the output of e.g. /bin/ls output or filenames specified by "/sbin/fstool *.bak". This is actually one of the big MS mistakes around 1990. When they implemented Windows 2.x/3.x/9x on top of MS-DOS, they switched from the old IBM/DOS encodings (like 437 and 850) to early versions of latin-1 and friends (known in the MS world as ANSI encodings), and they added implicit character conversions to some of the file system interfaces. But they forgot to create a safe and easy way for sysadmins / advanced users to access and manipulate files whose names contained non-convertible characters. Even worse, they mandated that it was the responsibility of individual programs to invoke conversion functions at the "right" times. This meant that a lot of programs got it wrong, creating a situation where users had to stick to pure ASCII or risk exposing untested bugs in strange places. They never found a way to fix things once the bad spec had been implemented by all the Windows programs in the world. In the 32 bit version of Windows they removed all the non-converted system calls thereby removing the problem for the DOS chars in filesystems, killing off any differently encoded filenames, and moving those conversions into the kernel, but at the same time, they did it again for UNICODE. > > > For filenames or other data to which access must be possible > > even if it is improperly encoded, the translation code should > > include a well-defined escaping mechanism for accessing invalid > > character encodings on the medium. This code must not be > > enabled in other contexts, due to serious security issues (it > > could e.g. allow bad people to bypass code to filter out shell > > metacharacters etc.). This escape mechanism should allow things > > like tar backups to just work, no matter how confused the > > filenames on a disk. > > Not sure how this "escaping mechanism" would be possible, or what it > would even really do. > Assume user X is running on sarge+5, a pure UTF8 setup all the way through. Assume, that filesystem xyzfs stores filenames in another character set and is subject to automatic implicit conversions. For some reason he mounts a device containing a few (perhaps only one) non-UTF8 filename (perhaps an old removable disc, perhaps NFS, perhaps a corrupted disc, perhaps a network mount). Such an escaping mechanism would: 1. Allow the filename to just appear in all sorts of file listings, file open dialogs etc. without those dialogs doing anything special because it is all in the conversion routine. 2. Allow the file to be opened and manipulated with any tool the user might find useful, because the conversion routines allow the filename to make it through. 3. Allow the file to be backed up and restored, even if the operator is unaware of the presence of corrupted filenames on the system. Technically such a conversion might work as follows: 1. When converting on-device filenames to/from the intermediary format (probably UTF32), reversibly map any invalid byte values to some part of the Corporate Zone in UNICODE. The same 256 UNICODE code points can be used for all character sets, there may already be a tradition or standard indicating what values to use. 2. When converting locale format (UTF8 or otherwise) system call / library call filenames from/to the intermediary format, reversibly map any UNICODE code point not in the local encoding to a sequence of chars indicating the HEX unicode code point. The locale encoding character indicating this escape should be chosen carefully for each family of character encodings, as that character will become unusable in filenames for users of that encoding. > > A mechanism needs to be devised, either in kernel or libc, which > > allows the conversion of filenames and console i/o to and from > > the process locale to indeed match the process locale. A > > similar or identical mechanism should be put in Xlib. > > I think it might make sense to have common library functions to do stuff > like this in glibc. > NOT library functions, that is the big MS mistakes. It must happen outside individual programs and libraries in order to avoid creating an unmaintainable mess, where every programmer must figure out when to apply which conversion to which data, many create bugs, design improvements are impossible, and all programmers waste their time doing unnecessary work. > > 6. The base software in sarge, such as libc, Xlib, xterm must > > support UTF8 variants of all locales as soon as possible. > > Without this, the rest cannot even begin to be implemented. > > It already does. I just tried uxterm again for the first time in a > while, and I'm really impressed with its current level of UTF-8 > support. It can do almost all of UTF-8-demo.txt on my system. > I already knew that many xterm clones did it right. But the item says that ALL the terminal emulators, ALL the local terminal interfaces (text mode vt, svgatextmode, Xlib text input/output calls) and ALL the locales defined by the "locales" package must support UTF8 as the very first step of getting an environment in which UTF8 versions of packages may ship without causing massive breakage. > > P.S. I am not a DD, just trying to be helpful and constructive. > > Thanks for your comments. You're welcome. -- This message is hastily written, please ignore any unpleasant wordings, do not consider it a binding commitment, even if its phrasing may indicate so. Its contents may be deliberately or accidentally untrue. Trademarks and other things belong to their owners, if any.