Re: byte-order marks

Mark H Weaver Tue, 29 Jan 2013 00:22:50 -0800

Hi Andy,

Andy Wingo <[email protected]> writes:
> What do people think about this attached patch?


I'm strongly opposed to making 'open-input-file' any more clever than it
already is.  Furthermore, I strongly believe that it should be much less
clever than it is now.  Our basic textual I/O should be robust by
default, and should not second-guess the specified encoding based on
flimsy heuristics that work 99% of the time.

IMO, our default behavior should allow portable scheme code to write an
arbitrary string of characters to a file in some encoding, and later
read it back, without having to worry about whether the string starts
with something that looks like a BOM, or contains a string that looks
like a coding declaration.  The string might be from a network, and thus
potentially from a malicious source.

Frankly, I consider this to be a potential source of security flaws in
software built using Guile, and on that basis would advocate removing
the existing cleverness from 'open-input-file' in stable-2.0.  At the
very least it should be removed from master.

Regarding byte-order marks, my preference is that users should explictly
consume BOMs if that's what they want (ideally using some convenience
procedure provided by Guile).  Sometimes consuming the BOM is the wrong
thing.  For example, if the user is copying a file to another file, or
to a socket, it may be important to preserve the BOM.

If others feel strongly that BOMs should be consumed by default, then
the following compromise is about as far as I'd (reluctantly) consider
going:

* 'open-input-file' could perhaps auto-consume a BOM at the beginning of
  the stream, but *only* if the BOM is already in the encoding specified
  by the user (possibly via an explicit call to 'file-encoding').  For
  example, if the specified port encoding is UTF-8, then EF BB BF would
  be consumed, but FE FF or FF FE would be left alone.

* BOMs absolutely should *not* be used to determine the encoding unless
  the user has explicitly asked for coding auto-detection.

Having said all this, if 'open-input-file' is changed to no longer call
'scm_i_scan_for_file_encoding', then I think it's a fine idea to add
BOMs to its list of heuristics, though I tend to agree with Mike that a
coding declaration should take precedence, for the reasons he described.

However, I strongly believe that 'scm_i_scan_for_file_encoding' is the
wrong place to consume BOMs.

What do you think?

      Mark

Re: byte-order marks

Reply via email to