""Martin v. LÃwis"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
Francis Girard wrote:Well, no text files can't be concatenated ! Sooner or later, someone will use "cat" on the text files your application did generate. That will be a lot of fun for the new unicode aware "super-cat".
Well, no. For example, Python source code is not typically concatenated, nor is source code in any other language. The same holds for XML files: concatenating two XML documents (using cat) gives an ill-formed document - whether the files start with an UTF-8 signature or not.
And if you're talking HTML and XML, the situation is even worse, since
the application absolutely needs to be aware of the signature. HTML might
have a <meta ... > directive close to the front to tell you what the encoding
is supposed to be, and then again, it might not. You should be able to depend
on the first character being a <, but you might not be able to. FitNesse, for
example, sends FIT a file that consists of the HTML between the <body>
and </body> tags, and nothing else. This situation makes character set
detection in PyFit, um, interesting. (Fortunately, I have other ways of
dealing with FitNesse, but it's still an issue for batch use.)
As for the "super-cat": there is actually no problem with putting U+FFFE in the middle of some document - applications are supposed to filter it out. The precise processing instructions in the Unicode standard vary from Unicode version to Unicode version, but essentially, you are supposed to ignore the BOM if you see it.
It would be useful for "super-cat" to filter all but the first one, however.
John Roth
Regards, Martin
-- http://mail.python.org/mailman/listinfo/python-list