Re: The C locale

Thomas Wolff Tue, 29 Sep 2009 04:13:02 -0700

Corinna Vinschen wrote:

On Sep 29 01:03, IWAMURO Motonori wrote:

2009/9/27 IWAMURO Motonori <deenhe...@gmail.com>:

LANG="ja" -> EUCJP
LANG="ja_JP" -> EUCJP

Hmmm, It is a difficult problem.


I think selecting UTF-8 is good because eucJP is legacy.

But, for interoperability with other UNIX-like system(*), I don't
think selecting UTF-8 is good.

* Solaris: ja, ja_JP -> eucJP
* Linux (Debian): ja -> Unknown, ja_JP -> eucJP

I need to think more...

My conclusion is as follows as a result of hearing other Japanese
people's opinion:

LANG=ja -> UTF-8
LANG=ja_JP -> UTF-8

Because, we specify "eucJP" explicitly when we need it.


Hmm.

That's an interesting point.

In theory this sounds like a good idea to be used for all locales which
don't specify the charset explicitely, because that results in using the
same charset, "UTF-8", for all such locales.  "C", "ja" or "en_US"
would all default to UTF-8.

The keyword here again should be compatibility. That means,unfortunately, that I do not think this is a good idea.A number of locales have been established on common systems that do notspecify their encoding explicitly (i.e. in their name).Since there is now more or less a common set of such locales amongvarious Linux and Unix systems, this seems to bea de-facto standard although I am not aware of any more formaldefinition/listing/description of this.On a modern Linux system, use the following command to get a list (notsure if it's appropriate to attach it here):

   for l in `locale -a`
   do      echo "$l        `LC_ALL=$l locale charmap`"
   done

I have also tried to incorporate a best guess assembly of mappings frommodern systems in my editor mined so it canderive the encoding from the locale name, so you could also take aworking list from there.

I think this list should be used for reference to define thelocale/encoding mapping, other choices may be more attractive

but only raise problems.

The downside is that a user, who needs to work under the default ANSI
codepage for some reason, has to know the name of the default ANSI
codepage.  Right now any user who needs the default ANSI codepage can
simply set LANG to some language code and go ahead, without having to
know the number.  With your solution, that wouldn't be possible anymore
and the user would have to figure out the default ANSI codepage on the
system before being able to use it.

I honestly don't know if that's really a problem, though.  But I don't
want to take that feature away for now.  Anybody having a strong opinion
on this issue?

I wasn't quite aware that the old "codepage:oem" setting didn't strictlymean "CP850" or "CP437" but apparently the respective system locale.If that is really needed, maybe the "C" locale should get you there, orsome "OEM" as (I think) Andy proposed. If someone feels the needto combine a specific language setting with the unspecific "systemlocale", well, maybe a pseudo encoding name could be invented to formnames like "en_GB.OEM". Just leaving out the encoding suffix should nothave that effect as I argued above.


Kind regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Re: The C locale

Reply via email to