On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr...@gmail.com> wrote: | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: | > | >| errors='replace' mean dont break in case or error? | > | > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled | > | >for something that would not decode smoothly. | > | > | How can it be correct? We have encoded out string in utf-8 and then | > | we tried to decode it as greek-iso? How can this possibly be | > | correct? | | > If it is a valid iso-8859-7 sequence (which might cover everything, | > since I expect it is an 8-bit 1:1 mapping from bytes values to a | > set of codepoints, just like iso-8859-1) then it may decode to the | > "wrong" characters, but the reverse process (characters encoded as | > bytes) should produce the original bytes. With a mapping like this, | > errors='replace' may mean nothing; there will be no errors because | > the only Unicode characters in play are all from iso-8859-7 to start | > with. Of course another string may not be safe. | | > Visually, the names will be garbage. And if you go: | > mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' | > while using the iso-8859-7 locale, the wrong thing will occur | > (assuming it even works, though I think it should because all these | > characters are represented in iso-8859-7, yes?) | | All the rest you i understood only the above quotes its still unclear to me. | I cant see to understand it. | | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar?
Yes. It is certainly true for utf-8 and latin-iso and ASCII. I expect it to be so for greek-iso, but have not checked. They're all essentially the ASCII set plus a range of other character codepoints for the upper values. The 8-bit sets iso-8859-1 (which I take you to mean by "latin-iso") and iso-8859-7 (which I take you to mean by "greek-iso") are single byte mapping with the top half mapped to characters commonly used in a particular region. Unicode has a much greater range, but the UTF-8 encoding of Unicode deliberately has the bottom 0-127 identical to ASCII, and higher values represented by multibyte sequences commences with at least the first byte in the 128-255 range. In this way pure ASCII files are already in UTF-8 (and, in fact, work just fine for the iso-8859-x encodings as well). | For example char 'a' has the value of '65' for all of those character sets? | Is hat what you mean? Yes. | s = 'a' (This is unicode right? Why when we assign a string to | a variable that string's type is always unicode and does not | automatically become utf-8 which includes all available world-wide | characters? Unicode is something different that a character set? ) In Python 3, yes. Strings are unicode. Note that that means they are sequences of codepoints whose meaning is as for Unicode. "utf-8" is a byte encoding for Unicode strings. An external storage format, if you like. The first 0-127 codepoints are 1:1 with byte values, and the higher code points require multibyte sequences. | utf8_byte = s.encode('utf-8') Unicode string => utf-8 byte encoding. | Now if we are to decode this back to utf8 we will receive the char 'a'. Yes. | I beleive same thing will happen with latin, greek, ascii isos. Correct? | | utf8_a = utf8_byte.decode('iso-8859-7') | latin_a = utf8_byte.decode('iso-8859-1') | ascii_a = utf8_byte.decode('ascii') | utf8_a = utf8_byte.decode('iso-8859-7') | | Is this correct? Yes, because of the design decision about the 0-127 codepoints. | All of those decodes will work even if the encoded bytestring was of utf8 type? | | The characters that will not decode correctly are those that their codepoints are greater that > 127 ? | for example if s = 'α' (greek character equivalent to english 'a') | Is this what you mean? Yes, exactly so. | -------------------------------- | | Now back to my almost ready files.py script please: | | | #======================================================== | # Collect filenames of the path dir as bytes | greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' ) | | for filename in greek_filenames: | # Compute 'path/to/filename' in bytes | greek_path = b'/home/nikos/public_html/data/apps/' + b'filename' You don't mean b'filename', which is the literal word "filename". You mean: filename.encode('iso-8859-7') More probably, you mean: dirpath = b'/home/nikos/public_html/data/apps/' greek_filenames = os.listdir(dirpath) for greek_filename in greek_filenames: try: filename = greek_filename.decode('iso-8859-7') and then: greek_path = dirpath + greek_filename utf8_filename = filename.encode('utf-8') utf8_path = dirpath + utf8_filename | try: | filepath = greek_path.decode('iso-8859-7') | # Rename current filename from greek bytes --> utf-8 bytes | os.rename( greek_path, filepath.encode('utf-8') ) I would break this up into smaller pieces: filepath = greek_path.decode('iso-8859-7') # Rename current filename from greek bytes --> utf-8 bytes utf8_path = filepath.encode('utf-8') os.rename( greek_path, utf8_path ) That way if an exception it thrown you have a much better idea of exactly which line had a problem. | except UnicodeDecodeError: | # Since its not a greek bytestring then its a proper utf8 bytestring | filepath = greek_path.decode('utf-8') And here you have a logic error. The idea is ok, but the encode and os.rename are not relevant to your UnicodeDecodeError check. So do this: dirpath = b'/home/nikos/public_html/data/apps/' greek_filenames = os.listdir(dirpath) for greek_filename in greek_filenames: try: filename = greek_filename.decode('iso-8859-7') except UnicodeDecodeError: # Since its not a greek bytestring then its a proper utf8 bytestring # no need to rename it pass else: # Rename current filename from greek bytes --> utf-8 bytes utf8_filename = filename.encode('utf-8') greek_path = dirpath + greek_filename utf8_path = dirpath + utf8_filename os.rename( greek_path, utf8_path ) You should try/except only around exactly the code expected to raise an exception, not extra stuff. However, this code won't work. Because iso-8859-7 is an 8-bit character set, it will _never_ fail to decode. All the bytes are value bytes. So not UnicodeDecodeError raised. A better test might be to decode it as utf-8. If that fails, then _guess_ that it is iso-8859-7 and rename the file, otherwise do not touch it. However, the real test is by eye: your program cannot deduce if a filename is nonsense, but presumably a visual inspection will show nonsense or sensible names. So: write a standalone python program to fix a filename (provided as sys.argv[1]) using the code above get a utf-8 Putty terminal check the remote locale is utf-8 do an "ls" for each nonsense file, run: python3 fix_filename.py nonsense-filename You should augument your rename with a prior os.path.exists() test to make sure you do not replace an existing file. [...snip...] | ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Error in sys.excepthook: | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Original exception was: | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last): | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] File "/home/nikos/public_html/cgi-bin/files.py", line 71, in <module> | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] os.rename( greek_path, filepath.encode('utf-8') ) | [Fri Jun 07 14:53:17 2013] [error] [client 79.103.41.173] FileNotFoundError: [Errno 2] \\u0394\\u03b5\\u03bd \\u03c5\\u03c0\\u03ac\\u03c1\\u03c7\\u03b5\\u03b9 \\u03c4\\u03ad\\u03c4\\u03bf\\u03b9\\u03bf \\u03b1\\u03c1\\u03c7\\u03b5\\u03af\\u03bf \\u03ae \\u03ba\\u03b1\\u03c4\\u03ac\\u03bb\\u03bf\\u03b3\\u03bf\\u03c2: '/home/nikos/public_html/data/apps/filename' Well, I would guess 2 things are happening: - you construct a literal b'/home/nikos/public_html/data/apps/filename' at the top of your script see my earlier remarks therefore the complaint that it does not exist - I would guess that the \\uxxxx sequences are a Unicode transcription of the error message, transcribed as hex because they don't look "printable" in the current local Cheers, -- Cameron Simpson <c...@zip.com.au> Louis Pasteur's theory of germs is ridiculous fiction. --Pierre Pachet, Professor of Physiology at Toulouse, 1872 -- http://mail.python.org/mailman/listinfo/python-list