Re: Unicode BOM marks

John Roth Tue, 08 Mar 2005 07:55:38 -0800

""Martin v. LÃwis"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]

Francis Girard wrote:
Well, no text files can't be concatenated ! Sooner or later, someone will use "cat" on the text files your application did generate. That will be a lot of fun for the new unicode aware "super-cat".
Well, no. For example, Python source code is not typically concatenated,
nor is source code in any other language. The same holds for XML files:
concatenating two XML documents (using cat) gives an ill-formed document
- whether the files start with an UTF-8 signature or not.

And if you're talking HTML and XML, the situation is even worse, since the application absolutely needs to be aware of the signature. HTML might have a <meta ... > directive close to the front to tell you what the encoding is supposed to be, and then again, it might not. You should be able to depend on the first character being a <, but you might not be able to. FitNesse, for example, sends FIT a file that consists of the HTML between the <body> and </body> tags, and nothing else. This situation makes character set detection in PyFit, um, interesting. (Fortunately, I have other ways of dealing with FitNesse, but it's still an issue for batch use.)

As for the "super-cat": there is actually no problem with putting U+FFFE
in the middle of some document - applications are supposed to filter it
out. The precise processing instructions in the Unicode standard vary
from Unicode version to Unicode version, but essentially, you are
supposed to ignore the BOM if you see it.


It would be useful for "super-cat" to filter all but the first one, however.

John Roth

Regards,
Martin


--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode BOM marks

Reply via email to