Bug#99933: second attempt at more comprehensive unicode policy

Colin Walters Thu, 02 Jan 2003 17:06:20 -0600

On Thu, 2003-01-02 at 13:57, Colin Walters wrote:

> #99933 goes a lot farther than #174982.


I have a counter-proposal to #99933, which I have attached.  I believe
it fixes the problems I raised with your proposal, and should also cover
some new areas (like filenames).  I also hopefully fixed James' issue
with the RFC link.

This patch supplants the one in #174982.  It is more ambitious than
#174982, but still does not introduce any "must"s, only "should"s or
weaker. 

Opinions?

--- policy.sgml 2003-01-01 21:59:26.000000000 -0500
+++ policy.sgml.new     2003-01-02 17:14:56.000000000 -0500
@@ -2258,10 +2258,8 @@
        </p>
 
        <p>
-         The entire changelog must be encoded in the
-         <url id="http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html"; 
name="UTF-8">
-         encoding of
-         <url id="http://www.unicode.org/"; name="Unicode">.
+         The entire changelog should be encoded UTF-8; see <ref
+         id="unicode"> for more information.
        </p>
        
        <sect1><heading>Defining alternative changelog formats</heading>
@@ -4190,6 +4188,31 @@
       <sect>
        <heading>Filesystem hierarchy</heading>
 
+       <sect1>
+         <heading>File Names</heading>
+
+         <p>
+           Files included in Debian packages or created by maintainer
+           scripts must have names which are valid UTF-8.  Since
+           UTF-8 is fully backwards compatible with ASCII, few
+           packages will encounter trouble with this.
+         </p>
+
+         <p>
+           Programs should expect filenames in general (whether from
+           a Debian package or created by the user) to be encoded
+           with UTF-8, although it is recommended for programs to try
+           gracefully falling back to the current locale's encoding
+           if this fails.  Programs included in Debian packages
+           should, when creating new files, encode their names in
+           UTF-8 by default.
+         </p>
+
+         <p>
+           See <ref id="unicode"> for more information on Debian and
+           Unicode.
+         </p>
+       </sect1>
 
        <sect1>
          <heading>Filesystem Structure</heading>
@@ -5414,6 +5437,32 @@
        </p>
       </sect>
 
+      <sect id="unicode">
+       <heading>Unicode</heading>
+
+       <p>
+         Debian is moving towards
+         <url id="http://www.unicode.org/"; name="Unicode">,
+         and specifically the <url id="http://www.ietf.org/rfc/rfc2279.txt"; 
name="UTF-8">
+         encoding of Unicode, for representation of character data.
+         Unicode is a universal character set, able to encode all the
+         world's languages.  Using Unicode makes internationalization
+         much easier, since programs will have to deal with only one
+         character set, instead of many different incompatible
+         national variants.
+       </p>
+
+       <p>
+         The UTF-8 encoding of Unicode is designed for Unix-like
+         systems such as Debian.  It is fully backwards compatible
+         with US-ASCII, and is also safe for use in filenames, since
+         no ASCII character appears as part of a multibyte character.
+         It is highly recommended, although not yet required, for
+         programs included in Debian to support Unicode and
+         specifically UTF-8.
+       </p>
+      </sect>
+
       <sect>
        <heading>Environment variables</heading>
 
@@ -7647,6 +7696,42 @@
        </p>
 
        <p>
+         All documentation included in a package should be encoded in
+         UTF-8 (see <ref id="unicode"> for more information).  If
+         upstream documentation is in another character set, the data
+         should be converted during the package build process.
+         <footnote>
+           <p>
+             One good way to do this is to use <prgn>iconv</prgn>, like:
+<example>
+       for file in ChangeLog doc/README doc/INSTALL; do
+         iconv -f ISO-8859-1 -t UTF-8 $file &gt; $file.new && mv $file.new 
$file
+       done
+</example>
+           </p>
+         </footnote>
+       </p>
+
+       <p>
+         Documentation formats which include a standard means of
+         specifying the character set of the data (such as
+         XML's <tt>encoding</tt> tag), may at their option use
+         another character set, although UTF-8 is still preferred.
+         Additionally, it is recommended for document formats which
+         are capable of specifying the character set of their data,
+         and do not have a default (like HTML), to do so.
+         <footnote>
+           <p>
+             As an example, for HTML documents, the <tt>head</tt>
+             section should include a header like:
+<example>
+  &lt;META content='text/html; charset=UTF-8' http-equiv='Content-Type'/&gt;
+</example>
+           </p>
+         </footnote>
+       </p>
+
+       <p>
          Other formats such as PostScript may be provided at the
          package maintainer's discretion.
        </p>

Bug#99933: second attempt at more comprehensive unicode policy

Reply via email to