Hi Andy, Andy Wingo <wi...@pobox.com> writes: > What do people think about this attached patch?
I'm strongly opposed to making 'open-input-file' any more clever than it already is. Furthermore, I strongly believe that it should be much less clever than it is now. Our basic textual I/O should be robust by default, and should not second-guess the specified encoding based on flimsy heuristics that work 99% of the time. IMO, our default behavior should allow portable scheme code to write an arbitrary string of characters to a file in some encoding, and later read it back, without having to worry about whether the string starts with something that looks like a BOM, or contains a string that looks like a coding declaration. The string might be from a network, and thus potentially from a malicious source. Frankly, I consider this to be a potential source of security flaws in software built using Guile, and on that basis would advocate removing the existing cleverness from 'open-input-file' in stable-2.0. At the very least it should be removed from master. Regarding byte-order marks, my preference is that users should explictly consume BOMs if that's what they want (ideally using some convenience procedure provided by Guile). Sometimes consuming the BOM is the wrong thing. For example, if the user is copying a file to another file, or to a socket, it may be important to preserve the BOM. If others feel strongly that BOMs should be consumed by default, then the following compromise is about as far as I'd (reluctantly) consider going: * 'open-input-file' could perhaps auto-consume a BOM at the beginning of the stream, but *only* if the BOM is already in the encoding specified by the user (possibly via an explicit call to 'file-encoding'). For example, if the specified port encoding is UTF-8, then EF BB BF would be consumed, but FE FF or FF FE would be left alone. * BOMs absolutely should *not* be used to determine the encoding unless the user has explicitly asked for coding auto-detection. Having said all this, if 'open-input-file' is changed to no longer call 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add BOMs to its list of heuristics, though I tend to agree with Mike that a coding declaration should take precedence, for the reasons he described. However, I strongly believe that 'scm_i_scan_for_file_encoding' is the wrong place to consume BOMs. What do you think? Mark