On Tue, 2003-01-07 at 04:29, Denis Barbier wrote: > > but unless someone starts actually _using_ UTF-8, we would never know > > which tools are broken and which are not (I already found one bug > > in handling of UTF-8 GPG alias - I'll file the bugreport after some more > > testing).
Testing our tools' support for UTF-8 on your local system is perfectly fine; I've been doing just that personally. But, ... > > And remember, this is debian *un*stable, so some breakage is to be > > expected. Uploading packages with UTF-8 control fields is not ok. It will simply put, not work for anyone who's not using a UTF-8 terminal, which is unfortunately probably most of our users at the moment. Just Don't Do It. If you really want to help push UTF-8, apply my dpkg patch, help find/fix bugs in it, then start ensuring apt-get, aptitude, etc., all grok UTF-8. > [Could this discussion take place on debian-i18n?] Actually I think we should probably move to -devel, given how strongly this affects the system in general. Even people who maintain programs which care little for i18n will still have to deal with UTF-8 filenames, and should be UTF-8 aware in general. It looks to me like at this point almost everyone agrees with the content of my proposal in #99933, and we are discussing implementation details. Agreed? If so, another second would be cool :) And also if that is the case, then it makes a better argument for moving to -devel. > Mixing legacy encodings and UTF-8 looks like a bad idea, except that > we can determine whether strings are UTF-8 encoded or not. Not with perfect reliability. > The main problem with text files is that their encoding is not specified. > All human editable text files must *explicitly* tell their encoding, > either by their content (like XML/SGML/HTML) or by their file name > (.txt documentation or man pages must contain their encoding in their > full name, naming scheme must be standardized). This allows support > for both UTF-8 and legacy encodings. You mean like changelog.txt.UTF-8 or changelog.UTF-8.txt ? I am pretty much opposed to any sort of proposal of this form. The reason is that changing programs to recognize our arbitrary scheme for file encodings will not only be a lot of work, but instead we could add support to programs to autodetect the charset semi-intelligently from file content, which is what programs like Emacs in the real world do today. > (To Colin: you did not notice any > problem because ASCII text is UTF-8, but problems arise with all other > legacy encodings). Actually I quite frequently notice problems with European names, as well as the copyright character. Do not assume that because my native language is English that I do not experience charset problems :) > A similar approach could be considered for deb control files, a new > mandatory Encoding field must be added to debian/control (and automatically > put in other files when needed), which tells encoding used by all control > files. Dpkg and friends may then perform automatic conversion (to UTF-8 or > to current user's locale) if desired. Ugh. I am generally quite opposed to adding an Encoding field, and I bet you'll find the dpkg maintainers are too. It should just be UTF-8, period. If developers really want to, they can generate control from a control.in file by using iconv or similar.