Hi Mark, Let me work the other way around, starting at the problem and not a potential solution.
There is a file: https://cvs.khronos.org/svn/repos/ogl/trunk/ecosystem/public/sdk/docs/man3/glBlendEquationSeparate.xml It is valid XML. It also has a UTF-8 BOM. It fails to parse in SSAX. The reason is that the XML standard specifies that if an XML file starts with a <?xml ...?>, that the `<' must be the first character. It also recommends in a non-normative section that, for XML files, that the BOM (if any) together with the coding attribute on the XML be used to detect the character encoding. (http://www.w3.org/TR/REC-xml/#sec-guessing). This to me says that, for the purposes of XML, that the BOM is actually outside of the text of the document. And indeed, this does make sense in some way, and given that the Windows world seems to always prepend these marks on their text documents, it is a kind of "container". Anyway, it is someone's responsibility to consume the BOM. So I was thinking: whose? It *can't* be xml->sxml, because it receives a port already, and the BOM should only be interpreted in files from disk, not from e.g. sockets. So it's someone's responsibility outside the XML code. Whose? scm_i_scan_for_file_encoding is only called when opening files from the disk, in textual mode (without the "b" / O_BINARY flag). It seemed a safe place. I agree it's a bit hacky, but we are talking about the BOM here. On Tue 29 Jan 2013 09:22, Mark H Weaver <m...@netris.org> writes: > IMO, our default behavior should allow portable scheme code to write an > arbitrary string of characters to a file in some encoding, and later > read it back, without having to worry about whether the string starts > with something that looks like a BOM, or contains a string that looks > like a coding declaration. I agree FWIW. > Frankly, I consider this to be a potential source of security flaws in > software built using Guile, and on that basis would advocate removing > the existing cleverness from 'open-input-file' in stable-2.0. At the > very least it should be removed from master. I agree as well. Want to make a patch? > Regarding byte-order marks, my preference is that users should explictly > consume BOMs if that's what they want (ideally using some convenience > procedure provided by Guile). Sometimes consuming the BOM is the wrong > thing. For example, if the user is copying a file to another file, or > to a socket, it may be important to preserve the BOM. If you are copying a binary file, you should use binary APIs. Otherwise you can misinterpret the characters, and potentially write them as a different encoding. Also otherwise, without O_BINARY on Windows, you will end up munging line-ends. So from a portable perspective, reading a file as characters already implies munging the text. If you are copying a textual file, you need to know how to decode it; and in that case a BOM can be helpful. I do not feel strongly about this point however. > If others feel strongly that BOMs should be consumed by default, then > the following compromise is about as far as I'd (reluctantly) consider > going: > > * 'open-input-file' could perhaps auto-consume a BOM at the beginning of > the stream, but *only* if the BOM is already in the encoding specified > by the user (possibly via an explicit call to 'file-encoding'). The problem is that we have no way of knowing what file encoding the user specifies. The encoding could come from the environment, or from some fluid that some other piece of code binds. We are really missing an encoding argument to open-file. > * BOMs absolutely should *not* be used to determine the encoding unless > the user has explicitly asked for coding auto-detection. OK. > Having said all this, if 'open-input-file' is changed to no longer call > 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add > BOMs to its list of heuristics, though I tend to agree with Mike that a > coding declaration should take precedence, for the reasons he described. OK. Incidentally we should relax the scan-for-encoding requirement that the coding be in a comment, as we will begin compiling javascript, lua, etc files in the future. That would perhaps allow XML encodings to be automatically detected as well. > What do you think? I liked that my solution "just worked" with a small amount of code and no changes to the rest of the application. I can't help but think that requiring the user to put in more code is going to infect an endless set of call sites with little "helper" procedures that aren't going to be more correct in aggregate. Andy -- http://wingolog.org/