Re: byte-order marks

Ludovic Courtès Tue, 29 Jan 2013 05:27:38 -0800

Andy Wingo <wi...@pobox.com> skribis:

[...]


>> Regarding byte-order marks, my preference is that users should explictly
>> consume BOMs if that's what they want (ideally using some convenience
>> procedure provided by Guile).  Sometimes consuming the BOM is the wrong
>> thing.  For example, if the user is copying a file to another file, or
>> to a socket, it may be important to preserve the BOM.
>
> If you are copying a binary file, you should use binary APIs.  Otherwise
> you can misinterpret the characters, and potentially write them as a
> different encoding.
>
> Also otherwise, without O_BINARY on Windows, you will end up munging
> line-ends.  So from a portable perspective, reading a file as
> characters already implies munging the text.

Agreed.  Reading textual data implies interpretation of its byte
structure, and the BOM is just part of that meta-data.

>> If others feel strongly that BOMs should be consumed by default, then
>> the following compromise is about as far as I'd (reluctantly) consider
>> going:
>>
>> * 'open-input-file' could perhaps auto-consume a BOM at the beginning of
>>   the stream, but *only* if the BOM is already in the encoding specified
>>   by the user (possibly via an explicit call to 'file-encoding').
>
> The problem is that we have no way of knowing what file encoding the
> user specifies.  The encoding could come from the environment, or from
> some fluid that some other piece of code binds.  We are really missing
> an encoding argument to open-file.

Well, ‘%default-port-encoding’ is really an argument to ‘open-file’,
though admittedly not a convenient one.  However, there’s no way to open
a file in binary mode when using ‘open-input-file’,
‘call-with-input-file’, etc.

>> Having said all this, if 'open-input-file' is changed to no longer call
>> 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add
>> BOMs to its list of heuristics, though I tend to agree with Mike that a
>> coding declaration should take precedence, for the reasons he described.
>
> OK.  Incidentally we should relax the scan-for-encoding requirement that
> the coding be in a comment, as we will begin compiling javascript, lua,
> etc files in the future.

OTOH, that would make it more likely that the “coding:” sequence is
misinterpreted as a coding declaration in contexts that have nothing to
do with that.

> I liked that my solution "just worked" with a small amount of code and
> no changes to the rest of the application.  I can't help but think that
> requiring the user to put in more code is going to infect an endless set
> of call sites with little "helper" procedures that aren't going to be
> more correct in aggregate.

For textual files, it doesn’t seem unreasonable for ‘open-input-file’ to
consume the BOM, IMO.  It’s not much different from the ‘eol-style’
transcoders.

Ludo’.

Re: byte-order marks

Reply via email to