On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote: > now might be the right time to start a discussion about release goals > for jessie.
I would like to propose full UTF-8 support. I don't mean here full support for all of Unicode's finer points, merely complete eradication of mojibake. That is, ensuring that /m.o/ matches "möo", or that "ä" sorts as equal to "a""combining ¨" is out of scope of this proposal. I propose the following sub-goals: 1. all programs should, in their default configuration, accept UTF-8 input and pass it through uncorrupted. Having to manually specify encoding is acceptable only in a programmatic interface, GUI/std{in,out,err}/ command line/plain files should work with nothing but LC_CTYPE. 2. all GUI/curses/etc programs should be able to display UTF-8 output where appropriate 3. all file names must be valid UTF-8 4. all text files should be encoded in UTF-8 This proposal doesn't call for eradication of non-UTF8 locales, even though I think that's long overdue. Josselin Mouette proposed that in #603914, and I agree, but that's material for another flamewar. Let's discuss the above points in depth: 1. properly passing UTF-8 Text entered by an user should never get mangled. These days, we can assume mixed charsets are a thing of the past, thus there's no need of special handling. So are, mostly, programs that don't support it -- but due to historic reasons, some are not configured to do so. Thus, let's mandate that no per-program steps are needed. An example: let's say we have an SQL table foo(a varchar(250)). Let's run somesqlclient -e "insert into foo values('$x'); select a from foo" (-e being whatever stands for "execute this statement"). sqlite3: ok p[ostgre]sql: ok mysql: doesn't work! But... the schema was declared as UTF-8, my locale is en_US.UTF-8, why doesn't it work? Turns out mysql requires you to call it with an extra argument, --default-character-set=utf8. There's no binary ABI to maintain, compat with some historic behaviour makes no sense. I can accept having to specify the charset in, say, a DBI line, as that's what the API wants, but on the command line... that's just wrong. Am I supposed to wrap everything with iconv, and suffer data loss on the way? Setting LANG/LC_foo should be enough. Another case, perhaps more controversial, is apache. Just take a look at how many of Debian random project pages have mangled encodings somewhere. By a 0th approximation, well over one third (more for text/plain, such as logs). And that's with users whose skills are way above average. These days, producing text that's not in UTF-8 can take quite a bit of effort, especially with modern GUI tools which don't even really pay lip service to supporting ancient charsets anymore. Thus, if someone serves some text in such a charset, he takes pains to even edit it. One argument is that because AddDefaultCharset overrides http-equiv, such old files would be mangled. I'd say, as they already take effort to maintain, let's let them rot in hell, as they are a rare case that stands in the way of a nearly ubiquitous one working properly. Such an admin can always configure his server to use an ancient encoding if he wishes to do so. (The other argument, our own files shipped in /doc/, is dead since apache 2.2.22-4, and is a major part of part 4 of this proposal.) 2. GUI/curses display With gtk, qt, and probably more, the issue is mostly moot. Other toolkits might require some work, but typically it's a matter of encoding (part 1 of this proposal): characters have different horizontal widths so you use outside functions for functionality like line wrapping already. Not so much in curses. Here, you have some characters take two spaces (CJK), some take zero (zero width spaces), some take zero but must not be detached from the previous character (combining). The line wrapping algorithm is actually quite simple, but needs to be implemented for every curses program that displays arbitrary strings. Ouch. [I got quite some experience fixing curses/etc programs this way, so I pledge priority help here. gtk/qt/fooxwidgets, not so much.] 3. all file names must be UTF-8 This is quite straightforward. They are already uninstallable on filesystems that operate in characters rather than bytes. Might be a good idea to forbid nasty stuff like newlines, tabs, etc too. I propose to apply this restriction to source packages as well. If Contents-* files are to be believed, the only violation is a binary package, zero source ones, so there'd be no extra work now, and at most a repack if an upstream regresses. The benefit is less clear than for binaries, but it's trivial and would prevent unexpected breakages. 4. all shipped text files in UTF-8 We don't want mojibake in provided documentation, config files, etc. With the amount of hackers nearby, even perl/shell/python/etc scripts in /*/bin. In short, all text files. This could be done by a debhelper tool, possibly declaratively by creating a file containing the encoding detected non-UTF text files should be converted from. If your package contains some files in an ancient encoding, you would: echo "iso-8859-42 *" >debian/ancient_encoding then the tool would detect text files, check if they're already UTF-8, and if not, convert them from that iso-8859-42. I expect 99% cases to use just one such encoding per package, but the above syntax allows per-file control. Detecting non-UTF files is easy: * false positives are impossible * false negatives are extremely unlikely: combinations of letters that would happen to match a valid utf character don't happen naturally, and even if they did, every single combination in the file tested would need to match valid utf. On the other hand, detecting text files is hard. The best tool so far, "file", makes so many errors it's useless for this purpose. One could use location: like, declaring stuff in /etc/ and /usr/share/doc/ to be text unless proven otherwise, but that's an incomplete hack. Only hashbangs can be considered reliable, but scripts are not where most documentation goes. Also, should HTML be considered text or not? Updating http-equiv is not rocket surgery, detecting HTML with fancy extensions can be. A 100% opt-in way, though, would be way too incomplete. Ideas? 4a. perl and pod Considering perl to be text raises one more issue: pod. By perl's design, pod without a specified encoding is considered to be ISO-8859-1, even if the file contains "use utf8;". This is surprising, and many authors use UTF-8 like everywhere else, leading to obvious results ("man gdm3" for one example). Thus, there should be a tool (preferably the one mentioned above) that checks perl files for pod with undeclared encoding, and raises alarm if the file contains any bytes with high bit set. If a conversion encoding is specified, such a declaration could be added automatically. [I'm on the DebConf, so let's discuss.] -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ
signature.asc
Description: Digital signature