On 8/6/2013 5:49 πμ, Cameron Simpson wrote:
On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
<nikos.gr...@gmail.com> wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
| > | >| errors='replace' mean dont break in case or error?
| >
| > | >Yes. The result will be correct for correct iso-8859-7 and slightly 
mangled
| > | >for something that would not decode smoothly.
| >
| > | How can it be correct? We have encoded out string in utf-8 and then
| > | we tried to decode it as greek-iso? How can this possibly be
| > | correct?
|
| > If it is a valid iso-8859-7 sequence (which might cover everything,
| > since I expect it is an 8-bit 1:1 mapping from bytes values to a
| > set of codepoints, just like iso-8859-1) then it may decode to the
| > "wrong" characters, but the reverse process (characters encoded as
| > bytes) should produce the original bytes.  With a mapping like this,
| > errors='replace' may mean nothing; there will be no errors because
| > the only Unicode characters in play are all from iso-8859-7 to start
| > with. Of course another string may not be safe.
|
| > Visually, the names will be garbage. And if you go:
| >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| > while using the iso-8859-7 locale, the wrong thing will occur
| > (assuming it even works, though I think it should because all these
| > characters are represented in iso-8859-7, yes?)
|
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
|
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

Hold on!

In the beginning there was ASCII with 0-127 values and then there was Unicode with 0-127 of ASCII's + i dont know how much many more?

Now ASCIII needs 1 byte to store a single character while Unicode needs 2 bytes to store a character and that is because it has > 256 characters to store > 2^8bits ?

Is this correct?

Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into the hard drive?

Because in some post i have read that 'UTF-8 encoding of Unicode'.
Can you please explain to me whats the difference of ASCII-Unicode themselves aand then of them compared to 'Charsets' . I'm still confused about this.

Is it like we said in C++:
' int a',     a variable with name 'a' of type integer.
'char a',   a variable with name 'a' of type char

So taken form above example(the closest i could think of), the way i understand them is:

A 'string' can be of (unicode's or ascii's) type and that type needs a way (thats a charset) to store this string into the hdd as a sequense of bytes?






--
Webhost <http://superhost.gr>&& Weblog <http://psariastonafro.wordpress.com>
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to