Andy Wingo <wi...@pobox.com> skribis: [...]
>> Regarding byte-order marks, my preference is that users should explictly >> consume BOMs if that's what they want (ideally using some convenience >> procedure provided by Guile). Sometimes consuming the BOM is the wrong >> thing. For example, if the user is copying a file to another file, or >> to a socket, it may be important to preserve the BOM. > > If you are copying a binary file, you should use binary APIs. Otherwise > you can misinterpret the characters, and potentially write them as a > different encoding. > > Also otherwise, without O_BINARY on Windows, you will end up munging > line-ends. So from a portable perspective, reading a file as > characters already implies munging the text. Agreed. Reading textual data implies interpretation of its byte structure, and the BOM is just part of that meta-data. >> If others feel strongly that BOMs should be consumed by default, then >> the following compromise is about as far as I'd (reluctantly) consider >> going: >> >> * 'open-input-file' could perhaps auto-consume a BOM at the beginning of >> the stream, but *only* if the BOM is already in the encoding specified >> by the user (possibly via an explicit call to 'file-encoding'). > > The problem is that we have no way of knowing what file encoding the > user specifies. The encoding could come from the environment, or from > some fluid that some other piece of code binds. We are really missing > an encoding argument to open-file. Well, ‘%default-port-encoding’ is really an argument to ‘open-file’, though admittedly not a convenient one. However, there’s no way to open a file in binary mode when using ‘open-input-file’, ‘call-with-input-file’, etc. >> Having said all this, if 'open-input-file' is changed to no longer call >> 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add >> BOMs to its list of heuristics, though I tend to agree with Mike that a >> coding declaration should take precedence, for the reasons he described. > > OK. Incidentally we should relax the scan-for-encoding requirement that > the coding be in a comment, as we will begin compiling javascript, lua, > etc files in the future. OTOH, that would make it more likely that the “coding:” sequence is misinterpreted as a coding declaration in contexts that have nothing to do with that. > I liked that my solution "just worked" with a small amount of code and > no changes to the rest of the application. I can't help but think that > requiring the user to put in more code is going to infect an endless set > of call sites with little "helper" procedures that aren't going to be > more correct in aggregate. For textual files, it doesn’t seem unreasonable for ‘open-input-file’ to consume the BOM, IMO. It’s not much different from the ‘eol-style’ transcoders. Ludo’.