Hi! Mark H Weaver <m...@netris.org> skribis:
> I researched this some more, and discovered that removal of byte-order > marks (BOMs) is the responsibility of iconv, which discards BOMs from > the beginning of streams when using the UTF-16 or UTF-32 encodings, but > *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding. > It uses the BOM to determine the endianness of the stream, but other > than that does *not* use it to guess the encoding, so there's no > guesswork involved. (Side note: iconv also inserts a BOM automatically > when writing a stream using UTF-16 or UTF-32). Are you talking about GNU iconv or iconv as specified by POSIX? I can’t see any occurrence of “BOM” at <http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html>. > So thanks to iconv, we get UTF-{16,32} BOM removal for free. > Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to > a buffer overrun and assertion failure when 'iconv' discards a BOM. Good catch! > The first patch below fixes this problem. I ended up almost completely > rewriting that function, partly because it was largely structured around > a mistaken assumption that iconv will never consume input without > producing output, and partly because it was quite inefficient (several > unnecessary conditional branches in the loop) and IMO was rather > difficult to read. Great. (I think ‘iconv’ semantics lead to tricky code, no matter what.) > get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, > char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) > { [...] > - for (output_size = 0, output = (char *) utf8_buf, > - bytes_consumed = 0, err = 0; > - err == 0 && output_size == 0 > - && (bytes_consumed == 0 || byte_read != EOF); > - bytes_consumed++) > + for (;;) Clarity is in the eye of the beholder, but to me this is a step backwards. [...] > + /* NOTE: The following test assumes that the only special values > + (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */ > + if (SCM_ICONV_SPECIAL_P (pt->input_cd)) Probably an indication that a more descriptive name is needed, as Andy noted. > @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char > *encoding) > new_output_cd = iconv_open (encoding, "UTF-8"); > if (new_output_cd == (iconv_t) -1) Should be SCM_ICONV_UNINITIALIZED? Thanks again for the research and fixes! Ludo’.