On Tue, Jan 07, 2003 at 09:29:44AM +0100, Radovan Garabik wrote: [...] > > > > #99933 goes a lot farther than #174982. First of all, we can't even > > > > suggest that people use UTF-8 in package control fields until all our > > > > tools support it. Right now it is just plain broken to put anything but > > > > ASCII in them. > > > > > > But people are putting ISO-8859-1 there, now and then. > > > > Yes, and it is fundamentally broken to do so, because our tools do not > > support it. Displaying it might happen to work on the maintainer's > > machine, but it will probably fail in many more places around the world, > > where people use terminals with a different native encoding type. > > > > > And I am going to use UTF-8 for Maintainer: in my packages, once > > > I have new stable mail address (and new UTF-8 GPG alias) > > > > Please only use ASCII until the tools support it, and file bugs against > > packages with control fields with characters not in ASCII. Otherwise > > you are just worsening the problem by adding yet another encoding to the > > mix of ISO-8859-1, ISO-8859-2, and who knows what else is already there. > > but unless someone starts actually _using_ UTF-8, we would never know > which tools are broken and which are not (I already found one bug > in handling of UTF-8 GPG alias - I'll file the bugreport after some more > testing). > And remember, this is debian *un*stable, so some breakage is to be > expected.
[Could this discussion take place on debian-i18n?] Mixing legacy encodings and UTF-8 looks like a bad idea, except that we can determine whether strings are UTF-8 encoded or not. So it makes automatic conversion a bit harder, but it is not a real problem. The main problem with text files is that their encoding is not specified. All human editable text files must *explicitly* tell their encoding, either by their content (like XML/SGML/HTML) or by their file name (.txt documentation or man pages must contain their encoding in their full name, naming scheme must be standardized). This allows support for both UTF-8 and legacy encodings. (To Colin: you did not notice any problem because ASCII text is UTF-8, but problems arise with all other legacy encodings). A good example is debconf. Joey Hess added encoding information in 1.2.0, legacy encodings are currently the default, and switching to UTF-8 will take place when it is time, without any trouble. Automatic conversion to user's locale (including UTF-8) is performed on output. The only problem is that very few maintainers did manage to switch to po-debconf in order to add encoding informations into their templates files. A similar approach could be considered for deb control files, a new mandatory Encoding field must be added to debian/control (and automatically put in other files when needed), which tells encoding used by all control files. Dpkg and friends may then perform automatic conversion (to UTF-8 or to current user's locale) if desired. Denis