Re: byte-order marks

Andy Wingo Tue, 29 Jan 2013 01:03:39 -0800

Hi Mark,

Let me work the other way around, starting at the problem and not a
potential solution.

There is a file:

https://cvs.khronos.org/svn/repos/ogl/trunk/ecosystem/public/sdk/docs/man3/glBlendEquationSeparate.xml

It is valid XML.  It also has a UTF-8 BOM.  It fails to parse in SSAX.

The reason is that the XML standard specifies that if an XML file starts
with a <?xml ...?>, that the `<' must be the first character.  It also
recommends in a non-normative section that, for XML files, that the BOM
(if any) together with the coding attribute on the XML be used to detect
the character encoding.  (http://www.w3.org/TR/REC-xml/#sec-guessing).

This to me says that, for the purposes of XML, that the BOM is actually
outside of the text of the document.  And indeed, this does make sense
in some way, and given that the Windows world seems to always prepend
these marks on their text documents, it is a kind of "container".

Anyway, it is someone's responsibility to consume the BOM.  So I was
thinking: whose?  It *can't* be xml->sxml, because it receives a port
already, and the BOM should only be interpreted in files from disk, not
from e.g. sockets.

So it's someone's responsibility outside the XML code.

Whose?

scm_i_scan_for_file_encoding is only called when opening files from the
disk, in textual mode (without the "b" / O_BINARY flag).  It seemed a
safe place.  I agree it's a bit hacky, but we are talking about the BOM
here.

On Tue 29 Jan 2013 09:22, Mark H Weaver <m...@netris.org> writes:

> IMO, our default behavior should allow portable scheme code to write an
> arbitrary string of characters to a file in some encoding, and later
> read it back, without having to worry about whether the string starts
> with something that looks like a BOM, or contains a string that looks
> like a coding declaration.

I agree FWIW.

> Frankly, I consider this to be a potential source of security flaws in
> software built using Guile, and on that basis would advocate removing
> the existing cleverness from 'open-input-file' in stable-2.0.  At the
> very least it should be removed from master.

I agree as well.  Want to make a patch?

> Regarding byte-order marks, my preference is that users should explictly
> consume BOMs if that's what they want (ideally using some convenience
> procedure provided by Guile).  Sometimes consuming the BOM is the wrong
> thing.  For example, if the user is copying a file to another file, or
> to a socket, it may be important to preserve the BOM.

If you are copying a binary file, you should use binary APIs.  Otherwise
you can misinterpret the characters, and potentially write them as a
different encoding.

Also otherwise, without O_BINARY on Windows, you will end up munging
line-ends.  So from a portable perspective, reading a file as
characters already implies munging the text.

If you are copying a textual file, you need to know how to decode it;
and in that case a BOM can be helpful.  I do not feel strongly about
this point however.

> If others feel strongly that BOMs should be consumed by default, then
> the following compromise is about as far as I'd (reluctantly) consider
> going:
>
> * 'open-input-file' could perhaps auto-consume a BOM at the beginning of
>   the stream, but *only* if the BOM is already in the encoding specified
>   by the user (possibly via an explicit call to 'file-encoding').

The problem is that we have no way of knowing what file encoding the
user specifies.  The encoding could come from the environment, or from
some fluid that some other piece of code binds.  We are really missing
an encoding argument to open-file.

> * BOMs absolutely should *not* be used to determine the encoding unless
>   the user has explicitly asked for coding auto-detection.

OK.

> Having said all this, if 'open-input-file' is changed to no longer call
> 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add
> BOMs to its list of heuristics, though I tend to agree with Mike that a
> coding declaration should take precedence, for the reasons he described.

OK.  Incidentally we should relax the scan-for-encoding requirement that
the coding be in a comment, as we will begin compiling javascript, lua,
etc files in the future.  That would perhaps allow XML encodings to be
automatically detected as well.

> What do you think?

I liked that my solution "just worked" with a small amount of code and
no changes to the rest of the application.  I can't help but think that
requiring the user to put in more code is going to infect an endless set
of call sites with little "helper" procedures that aren't going to be
more correct in aggregate.

Andy
-- 
http://wingolog.org/

Re: byte-order marks

Reply via email to