On Dec 31, 2007 1:41 PM, ChadDavis <[EMAIL PROTECTED]> wrote:
> When I run 'ls' on a given directory, some of the file names show a question
> mark in the place of a non-supported character.  In trying to understand
> what is happening, I find that I don't understand a couple of fundamentals.
>
> 1) what is the default encoding of my debian system?

On new Etch installs, UTF-8 is the default. On older systems, it depends
on you locale (I'm not sure if a system upgraded to Etch would be UTF-8
or not). In the US it would be ISO-8859-1 or ISO-8859-15, I think. Use the
command "locale" and see what it says. Mine says en_US.UTF-8

> 2) It seems that a file itself doesn't have any encoding as it is sitting on
> the hard drive -- its just bytes, right?  when a given application picks it
> up, that application will try to read it as a certain encoding -- how is
> that determiniation made?

All files have encoding. Text files do, of course, but so binary files
like .jpg or .mp3. Even binary executables and libraries have a
specific format (binary executables are in ELF format on
non-ancient Linux systems).

When a text file is opened, I believe most simple apps try to interpret
it based on your systems locale. Some smarter programs may apply
fairly complicated heuristics to determine the encoding. Some
plain-text-based file types, such as xml, declare the encoding near
the beginning of the file.

> 3) What is the encoding of the file name?  Is this a feature of the
> filesystem?

This is also based on your locale.

Note that if you download a text file that is in, say, Shift-JIS (a common
Japanese encoding), the file and perhaps the filename will still be in
Shift-JIS. Even if your system is UTF-8 and has Japanese fonts installed,
it will not display the file correctly if it simply interprets it based on your
locale.

There are programs that can convert between encodings, including the
"convmv" package, which converts only filenames, the package
"utf8-migration-tool" and the "recode" package.

> I realize these questions may not be that "smart"; please tell me what I'm
> missing if so.  Also, point me to documentation if you know of some that
> explains all of this.  I couldn't find anything on the topic searching the
> web or debian docs.

For general info start with these wiki pages and some of the other pages
they link to:

http://en.wikipedia.org/wiki/Locale
http://en.wikipedia.org/wiki/Character_encoding


If you want more in-depth programmer-oriented info on unicode, check
out Joel's article:

http://www.joelonsoftware.com/articles/Unicode.html


There is more Debian-specific info about charsets, locales, etc. in the
Debian Reference section on L10n (Localization) [take out 10 letters]:

http://www.debian.org/doc/manuals/debian-reference/ch-tune.en.html#s-l10n

and in the Debian i18n (internationalization) [take out 18 letters] Guide:

http://www.debian.org/doc/manuals/intro-i18n/index.en.html


Cheers,
Kelly Clowers


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to