Alan Lord wrote:

As a UK (English) based LFS user/builder I have, so far, had no problem using a standard LFS build. However, it has seemed to me that almost all new IT systems and Web Services platforms are using UTF-8 encoding. I am therefore planning to build my next LFS as a UTF-8.

You are welcome, but please back up your data with tar-1.15.1 before doing so, as follows:

tar --format=posix -jcf /path/to/home-backup.tar.bz2 /home

The --format=posix switch is essential, it ensures that filenames are converted to UTF-8 if you untar this archive in a UTF-8 locale.

The reason for making such a backup is that filenames with non-ASCII characters can be displayed propely only in one of those systems, the one which created them. File contents have to be reencoded manually after unpacking.

My point to this list is really about the "whys"...

* Why is UTF-8 important to linux users (especially English speaking)?

It is not important, but some people want it. The goal for LFS/BLFS is to provide a system for them that is not more broken than a typical modern RedHat system.

* Why should anyone bother about it in the first place?

Because people such as Markus Kuhn say that a Linux system must support UTF-8. Because LSB includes Li18nux2000 by reference, and Li18nux2000 says that at least UTF-8 must be supported. Because RedHat doesn't support anything else, and other distros are going to "UTF-8 by default".

But do you really listen to all of the above-listed propaganda? Generally, this comes from USA, from English-only people who use only ASCII and thus don't care about character sets at all and don't see the breakage. As for LSB, they can't implement their own requirements: their "sample implementation" contained unpatched gawk-3.1.4 at some time, and that version has horrible problems with UTF-8. For each RedHat disto before Fedora Core 4, a guide appeared on Russian web sites how to make it work with the good old KOI8-R encoding, because UTF-8 is too broken.

So (IMHO) the only valid reasons for choosing a UTF-8 based locale are:

1. RedHat compatibility at the cost of incompatibility with everyone else (including MS Windows): * Need to share files via NFS with systems that already use UTF-8 locales and can't be reconfigured * Need to ssh often into systems that already use UTF-8 locales, and rarely to systems that don't. * Need to work mainly with UTF-8 encoded text documents, i.e. those coming from people using RedHat systems

2. Just being adventurous (I think this applies to you).

* What does UTF-8 offer that ISO-8859-x does not?

Ability to use more than one non-English language in one text document. Ability to communicate with everyone else who also uses UTF-8.

Big note: it is possible to edit UTF-8 text documents without using UTF-8 locale. Just start Kate and select the UTF-8 encoding from the menu. It is a known working and bug-free setup.

I think this could do with a bit of discussion. The USA/English world doesn't, on the face of it, have anything to gain from going to a UTF-8 based format, however, I believe this is the way forward for *everyone* and should perhaps have a greater emphasis in the LFS project as a whole. Aren't most of the major linux distributions now using UTF-8?

Officially, yes. Unofficially, only total n00bs in Russia don't know that they had to revert this (they just thought that Linux doesn't support Russian well yet: newbies see bugs, but cannot identify fixable regressions) until very recently in order to get a reasonably bug-free system. This probably doesn't apply to English systems, though.

If I can help, Alex, I would be happy to help build/develop an english language UTF-8 system for comparison/analysis.

You are welcome, although it is unlikely that you will find any bug except general slowdown.

--
Alexander E. Patrakov
--
http://linuxfromscratch.org/mailman/listinfo/lfs-dev
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

Reply via email to