Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded

Colin Walters Tue, 07 Jan 2003 10:22:34 -0600

On Tue, 2003-01-07 at 04:29, Denis Barbier wrote:

> > but unless someone starts actually _using_ UTF-8, we would never know
> > which tools are broken and which are not (I already found one bug
> > in handling of UTF-8 GPG alias - I'll file the bugreport after some more
> > testing).


Testing our tools' support for UTF-8 on your local system is perfectly
fine; I've been doing just that personally.  But, ...

> > And remember, this is debian *un*stable, so some breakage is to be
> > expected.

Uploading packages with UTF-8 control fields is not ok.  It will simply
put, not work for anyone who's not using a UTF-8 terminal, which is
unfortunately probably most of our users at the moment.  Just Don't Do
It.

If you really want to help push UTF-8, apply my dpkg patch, help
find/fix bugs in it, then start ensuring apt-get, aptitude, etc., all
grok UTF-8.

> [Could this discussion take place on debian-i18n?]

Actually I think we should probably move to -devel, given how strongly
this affects the system in general.  Even people who maintain programs
which care little for i18n will still have to deal with UTF-8 filenames,
and should be UTF-8 aware in general.

It looks to me like at this point almost everyone agrees with the
content of my proposal in #99933, and we are discussing implementation
details.  Agreed?

If so, another second would be cool :)  And also if that is the case,
then it makes a better argument for moving to -devel.

> Mixing legacy encodings and UTF-8 looks like a bad idea, except that
> we can determine whether strings are UTF-8 encoded or not.  

Not with perfect reliability.

> The main problem with text files is that their encoding is not specified.
> All human editable text files must *explicitly* tell their encoding,
> either by their content (like XML/SGML/HTML) or by their file name
> (.txt documentation or man pages must contain their encoding in their
> full name, naming scheme must be standardized).  This allows support
> for both UTF-8 and legacy encodings.  

You mean like changelog.txt.UTF-8 or changelog.UTF-8.txt ? I am pretty
much opposed to any sort of proposal of this form.  The reason is that
changing programs to recognize our arbitrary scheme for file encodings
will not only be a lot of work, but instead we could add support to
programs to autodetect the charset semi-intelligently from file content,
which is what programs like Emacs in the real world do today.

> (To Colin: you did not notice any
> problem because ASCII text is UTF-8, but problems arise with all other
> legacy encodings).

Actually I quite frequently notice problems with European names, as well
as the copyright character.  Do not assume that because my native
language is English that I do not experience charset problems :)

> A similar approach could be considered for deb control files, a new
> mandatory Encoding field must be added to debian/control (and automatically
> put in other files when needed), which tells encoding used by all control
> files.  Dpkg and friends may then perform automatic conversion (to UTF-8 or
> to current user's locale) if desired.

Ugh.  I am generally quite opposed to adding an Encoding field, and I
bet you'll find the dpkg maintainers are too.  It should just be UTF-8,
period.  If developers really want to, they can generate control from a
control.in file by using iconv or similar.

Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded

Reply via email to