Le lundi 7 Mars 2005 21:54, "Martin v. LÃwis" a ÃcritÂ:
Hi, Thank you for your very informative answer. Some interspersed remarks follow. > > I personally would write my applications so that they put the signature > into files that cannot be concatenated meaningfully (since the > signature simplifies encoding auto-detection) and leave out the > signature from files which can be concatenated (as concatenating the > files will put the signature in the middle of a file). > Well, no text files can't be concatenated ! Sooner or later, someone will use "cat" on the text files your application did generate. That will be a lot of fun for the new unicode aware "super-cat". > > I guess that this leading BOM mark are special marking bytes that can't > > be, in no way, decoded as valid text. > > Right ? > > Wrong. The BOM mark decodes as U+FEFF: > >>> codecs.BOM_UTF8.decode("utf-8") > > u'\ufeff' I meant "valid text" to denote human readable actual real natural language text. My intent with this question was to get sure that we can easily distinguish a UTF-8 with the signature from one without. Your answer implies a "yes". > > I also guess that this leading BOM mark is silently ignored by any > > unicode aware file stream reader to which we already indicated that the > > file follows the UTF-8 encoding standard. > > Right ? > > No. It should eventually be ignored by the application, but whether the > stream reader special-cases it or not is depends on application needs. > Well, for most of us, I think, the need is to transparently decode the input into a unique internal unicode encoding (UFT-16 for both java and Qt ; Qt docs saying there might be a need to switch to UFT-32 some day) and then be able to manipulate this internal text with the usual tools your programming system provides. By "transparent", I mean, at least, to be able to automatically process the two variants of the same UTF-8 encoding. We should only have to specify "UTF-8" and the streamer takes care of the rest. BTW, the python "unicode" built-in function documentation says it returns a "unicode" string which scarcely means something. What is the python "internal" unicode encoding ? > > No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports > it to the application when it finds it, and it will never generate the > signature on its own. So processing the UTF-8 signature is left to the > application in Python. > Ok. > > In python documentation, I see theseconstants. The documentation is not > > clear to which encoding these constants apply. Here's my understanding : > > > > BOM : UTF-8 only or UTF-8 and UTF-32 ? > > UTF-16. > > > BOM_BE : UTF-8 only or UTF-8 and UTF-32 ? > > BOM_LE : UTF-8 only or UTF-8 and UTF-32 ? > > UTF-16 > Ok. > > Why should I need these constants if codecs decoder can handle them > > without my help, only specifying the encoding ? > > Well, because the codecs don't. It might be useful to add a > "utf-8-signature" codec some day, which generates the signature on > encoding, and removes it on decoding. > Ok. My sincere thanks, Francis Girard > Regards, > Martin -- http://mail.python.org/mailman/listinfo/python-list