Hello all, I researched this some more, and discovered that removal of byte-order marks (BOMs) is the responsibility of iconv, which discards BOMs from the beginning of streams when using the UTF-16 or UTF-32 encodings, but *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding. It uses the BOM to determine the endianness of the stream, but other than that does *not* use it to guess the encoding, so there's no guesswork involved. (Side note: iconv also inserts a BOM automatically when writing a stream using UTF-16 or UTF-32).
Note that the discarding of BOMs does *not* happen when opening a port, but instead when reading from a port for the first time (or after calling 'set-port-encoding!', since that creates a fresh iconv descriptor). This is important for a few reasons: It solves the nasty complication of (open-file NAME "r+"), it works for any kind of stream (not just files), and it does not block waiting for input immediately after opening a port. So thanks to iconv, we get UTF-{16,32} BOM removal for free. Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to a buffer overrun and assertion failure when 'iconv' discards a BOM. It can be triggered as follows: (use-modules (rnrs io ports)) (let ((port (open-bytevector-input-port #vu8(255 254 97 0)))) (set-port-encoding! port "UTF-16") (read-char port)) The first patch below fixes this problem. I ended up almost completely rewriting that function, partly because it was largely structured around a mistaken assumption that iconv will never consume input without producing output, and partly because it was quite inefficient (several unnecessary conditional branches in the loop) and IMO was rather difficult to read. So what about UTF-8? Since we handle UTF-8 internally, we have to handle UTF-8 BOM removal ourselves. The second patch implements this with the same semantics as iconv does for UTF-{16,32}. It was a bit tricky because we need to keep track of whether we're at the beginning of the stream, and we cannot add any more fields to scm_t_port in stable-2.0. I handled this by adding some more special values for the iconv descriptor pointers. As a bonus, we no longer call 'scm_i_set_port_encoding_x' once per character when using UTF-8 :) The third patch removes the byte-order mark check from scm_i_scan_for_encoding, following Andy's suggestion. Comments and suggestions solicited. Mark
>From ceccaf59267bc98c86aae33809905f26b017ebc8 Mon Sep 17 00:00:00 2001 From: Mark H Weaver <m...@netris.org> Date: Wed, 30 Jan 2013 10:16:37 -0500 Subject: [PATCH 1/3] Rewrite get_iconv_codepoint to fix a bug involving byte-order marks. * libguile/ports.c (get_iconv_codepoint): Rewrite to fix a bug and improve efficiency and clarity. Previously, it incorrectly assumed that iconv would never consume input without producing output, which led to a buffer overrun and subsequent assertion failure. This happens when a byte-order mark is consumed by iconv at the beginning of the stream when using the UTF-16 or UTF-32 encodings. * test-suite/tests/ports.test (unicode byte-order marks (BOMs)): Add tests. --- libguile/ports.c | 95 ++++++++++++++++++++++-------------------- test-suite/tests/ports.test | 96 ++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 145 insertions(+), 46 deletions(-) diff --git a/libguile/ports.c b/libguile/ports.c index 55808e2..a1d6c96 100644 --- a/libguile/ports.c +++ b/libguile/ports.c @@ -1,5 +1,5 @@ -/* Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2004, - * 2006, 2007, 2008, 2009, 2010, 2011, 2012 Free Software Foundation, Inc. +/* Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2004, 2006, + * 2007, 2008, 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public License @@ -1292,66 +1292,73 @@ static int get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) { - scm_t_port *pt; - int err, byte_read; - size_t bytes_consumed, output_size; - char *output; + scm_t_port *pt = SCM_PTAB_ENTRY (port); scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE]; + size_t input_size = 0; - pt = SCM_PTAB_ENTRY (port); - - for (output_size = 0, output = (char *) utf8_buf, - bytes_consumed = 0, err = 0; - err == 0 && output_size == 0 - && (bytes_consumed == 0 || byte_read != EOF); - bytes_consumed++) + for (;;) { - char *input; + int byte_read; + char *input, *output; size_t input_left, output_left, done; byte_read = scm_get_byte_or_eof (port); - if (byte_read == EOF) + if (SCM_UNLIKELY (byte_read == EOF)) { - if (bytes_consumed == 0) - { - *codepoint = (scm_t_wchar) EOF; - *len = 0; - return 0; - } - else - continue; + if (SCM_LIKELY (input_size == 0)) + { + *codepoint = (scm_t_wchar) EOF; + *len = input_size; + return 0; + } + else + /* EOF found in the middle of a multibyte character. */ + return EILSEQ; } - buf[bytes_consumed] = byte_read; + buf[input_size++] = byte_read; input = buf; - input_left = bytes_consumed + 1; + input_left = input_size; + output = (char *) utf8_buf; output_left = sizeof (utf8_buf); - done = iconv (pt->input_cd, &input, &input_left, - &output, &output_left); + done = iconv (pt->input_cd, &input, &input_left, &output, &output_left); + if (done == (size_t) -1) { - err = errno; - if (err == EINVAL) - /* Missing input: keep trying. */ - err = 0; + int err = errno; + if (SCM_LIKELY (err == EINVAL)) + /* The input byte sequence did not form a complete + character. Read another byte and try again. */ + continue; + else + return err; } else - output_size = sizeof (utf8_buf) - output_left; - } - - if (SCM_UNLIKELY (output_size == 0)) - /* An unterminated sequence. */ - err = EILSEQ; - else if (SCM_LIKELY (err == 0)) - { - /* Convert the UTF8_BUF sequence to a Unicode code point. */ - *codepoint = utf8_to_codepoint (utf8_buf, output_size); - *len = bytes_consumed; + { + size_t output_size = sizeof (utf8_buf) - output_left; + if (SCM_LIKELY (output_size > 0)) + { + /* iconv generated output. Convert the UTF8_BUF sequence + to a Unicode code point. */ + *codepoint = utf8_to_codepoint (utf8_buf, output_size); + *len = input_size; + return 0; + } + else + { + /* iconv consumed some bytes without producing any output. + Most likely this means that a Unicode byte-order mark + (BOM) was consumed, which should not be included in the + returned buf. Shift any remaining bytes to the beginning + of buf, and continue the loop. */ + memcpy (buf, input, input_left); + input_size = input_left; + continue; + } + } } - - return err; } /* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test index 613d269..38044c6 100644 --- a/test-suite/tests/ports.test +++ b/test-suite/tests/ports.test @@ -2,7 +2,7 @@ ;;;; Jim Blandy <j...@red-bean.com> --- May 1999 ;;;; ;;;; Copyright (C) 1999, 2001, 2004, 2006, 2007, 2009, 2010, -;;;; 2011, 2012 Free Software Foundation, Inc. +;;;; 2011, 2012, 2013 Free Software Foundation, Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -24,7 +24,7 @@ #:use-module (ice-9 popen) #:use-module (ice-9 rdelim) #:use-module (rnrs bytevectors) - #:use-module ((rnrs io ports) #:select (open-bytevector-input-port))) + #:use-module ((ice-9 binary-ports) #:select (open-bytevector-input-port))) (define (display-line . args) (for-each display args) @@ -1149,6 +1149,98 @@ +(define (bv-read-test encoding bv) + (let ((port (open-bytevector-input-port bv))) + (set-port-encoding! port encoding) + (read-string port))) + +(with-test-prefix "unicode byte-order marks (BOMs)" + + (pass-if-equal "BOM not discarded from Latin-1 stream" + "\xEF\xBB\xBF\x61" + (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM not discarded from Latin-2 stream" + "\u010F\u0165\u017C\x61" + (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM not discarded from UTF-16BE stream" + "\uFEFF\x61" + (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61))) + + (pass-if-equal "BOM not discarded from UTF-16LE stream" + "\uFEFF\x61" + (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00))) + + (pass-if-equal "BOM not discarded from UTF-32BE stream" + "\uFEFF\x61" + (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61))) + + (pass-if-equal "BOM not discarded from UTF-32LE stream" + "\uFEFF\x61" + (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00))) + + (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)" + "a" + (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61))) + + (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)" + "\uFEFFa" + (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61))) + + (pass-if-equal "BOM not discarded unless at start of UTF-16 stream" + "a\uFEFFb" + (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62))) + (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00)))) + (if (char=? #\a (string-ref be 1)) + be + le))) + + (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)" + "a" + (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00))) + + (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)" + "\uFEFFa" + (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00))) + + (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)" + "a" + (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF #x00 #x00 #x00 #x61))) + + (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)" + "\uFEFFa" + (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61))) + + (pass-if-equal "BOM not discarded unless at start of UTF-32 stream" + "a\uFEFFb" + (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61 + #x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x62))) + (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00 + #xFF #xFE #x00 #x00 + #x62 #x00 #x00 #x00)))) + (if (char=? #\a (string-ref be 1)) + be + le))) + + (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)" + "a" + (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00))) + + (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)" + "\uFEFFa" + (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00 + #xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00)))) + + + (define-syntax-rule (with-load-path path body ...) (let ((new path) (old %load-path)) -- 1.7.10.4
>From 65e0cca752e005d75c8eade1c92f084a8518f209 Mon Sep 17 00:00:00 2001 From: Mark H Weaver <m...@netris.org> Date: Wed, 30 Jan 2013 12:21:20 -0500 Subject: [PATCH 2/3] Discard UTF-8 byte-order marks at stream start, and improve efficiency. * libguile/ports.c (SCM_ICONV_UNINITIALIZED, SCM_ICONV_UTF8_AT_START, SCM_ICONV_UTF8_NOT_AT_START, SCM_ICONV_SPECIAL_P, SCM_UNICODE_BOM): New macros. (finalize_port): Use new macros, and black hole the iconv descriptor pointers after closing them. (scm_new_port_table_entry, scm_i_remove_port, get_codepoint): Use new macros. (get_utf8_codepoint): Discard unicode byte-order marks at stream start. After reading a code point, record the fact that we are no longer at the stream start. (scm_i_set_port_encoding_x): Use new macros. Eliminate extraneous test. * test-suite/tests/ports.test (unicode byte-order marks (BOMs)): Add tests. --- libguile/ports.c | 79 ++++++++++++++++++++++++++++++------------- test-suite/tests/ports.test | 12 +++++++ 2 files changed, 67 insertions(+), 24 deletions(-) diff --git a/libguile/ports.c b/libguile/ports.c index a1d6c96..438a15b 100644 --- a/libguile/ports.c +++ b/libguile/ports.c @@ -89,6 +89,16 @@ #define HAVE_FTRUNCATE 1 #endif +/* Special values for the input_cd and output_cd fields of the + scm_t_port structure. FIXME: Use a nicer representation for Guile + 2.1, when we can add more fields to the scm_t_port structure. */ +#define SCM_ICONV_UNINITIALIZED ((iconv_t) -1) +#define SCM_ICONV_UTF8_AT_START ((iconv_t) -3) +#define SCM_ICONV_UTF8_NOT_AT_START ((iconv_t) -4) +#define SCM_ICONV_SPECIAL_P(cd) ((cd) >= (iconv_t) -4) + +#define SCM_UNICODE_BOM 0xFEFF /* Unicode byte-order mark */ + /* The port kind table --- a dynamically resized array of port types. */ @@ -585,10 +595,13 @@ finalize_port (void *ptr, void *data) entry = SCM_PTAB_ENTRY (port); - if (entry->input_cd != (iconv_t) -1) + if (!SCM_ICONV_SPECIAL_P (entry->input_cd)) iconv_close (entry->input_cd); - if (entry->output_cd != (iconv_t) -1) + entry->input_cd = SCM_ICONV_UNINITIALIZED; + + if (!SCM_ICONV_SPECIAL_P (entry->output_cd)) iconv_close (entry->output_cd); + entry->output_cd = SCM_ICONV_UNINITIALIZED; SCM_SETSTREAM (port, 0); SCM_CLR_PORT_OPEN_FLAG (port); @@ -626,8 +639,8 @@ scm_new_port_table_entry (scm_t_bits tag) entry->encoding = enc ? scm_gc_strdup (enc, "port") : NULL; /* The conversion descriptors will be opened lazily. */ - entry->input_cd = (iconv_t) -1; - entry->output_cd = (iconv_t) -1; + entry->input_cd = SCM_ICONV_UNINITIALIZED; + entry->output_cd = SCM_ICONV_UNINITIALIZED; entry->ilseq_handler = scm_i_default_port_conversion_handler (); @@ -682,17 +695,13 @@ scm_i_remove_port (SCM port) p->putback_buf = NULL; p->putback_buf_size = 0; - if (p->input_cd != (iconv_t) -1) - { - iconv_close (p->input_cd); - p->input_cd = (iconv_t) -1; - } - - if (p->output_cd != (iconv_t) -1) - { - iconv_close (p->output_cd); - p->output_cd = (iconv_t) -1; - } + if (!SCM_ICONV_SPECIAL_P (p->input_cd)) + iconv_close (p->input_cd); + p->input_cd = SCM_ICONV_UNINITIALIZED; + + if (!SCM_ICONV_SPECIAL_P (p->output_cd)) + iconv_close (p->output_cd); + p->output_cd = SCM_ICONV_UNINITIALIZED; SCM_SETPTAB_ENTRY (port, 0); @@ -1202,6 +1211,8 @@ get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, } else if ((buf[0] & 0xf0) == 0xe0) { + scm_t_wchar code_pt; + /* 3-byte form. */ byte = scm_peek_byte_or_eof (port); ASSERT_NOT_EOF (byte); @@ -1225,9 +1236,21 @@ get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, buf[2] = (scm_t_uint8) byte; *len = 3; - *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL + code_pt = ((scm_t_wchar) buf[0] & 0x0f) << 12UL | ((scm_t_wchar) buf[1] & 0x3f) << 6UL | (buf[2] & 0x3f); + + if (SCM_UNLIKELY (code_pt == SCM_UNICODE_BOM + && pt->input_cd == SCM_ICONV_UTF8_AT_START)) + { + /* Discard a Unicode byte-order mark (BOM) at the beginning of + the stream. Record the fact that we are no longer at the + beginning, and read another code point. */ + pt->input_cd = SCM_ICONV_UTF8_NOT_AT_START; + return get_utf8_codepoint (port, codepoint, buf, len); + } + + *codepoint = code_pt; } else if (buf[0] >= 0xf0 && buf[0] <= 0xf4) { @@ -1272,6 +1295,9 @@ get_utf8_codepoint (SCM port, scm_t_wchar *codepoint, else goto invalid_seq; + /* Record the fact that we are no longer at the beginning of the + stream, so that future byte-order marks will not be discarded. */ + pt->input_cd = SCM_ICONV_UTF8_NOT_AT_START; return 0; invalid_seq: @@ -1372,12 +1398,13 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, int err; scm_t_port *pt = SCM_PTAB_ENTRY (port); - if (pt->input_cd == (iconv_t) -1) + if (pt->input_cd == SCM_ICONV_UNINITIALIZED) /* Initialize the conversion descriptors, if needed. */ scm_i_set_port_encoding_x (port, pt->encoding); - /* FIXME: In 2.1, add a flag to determine whether a port is UTF-8. */ - if (pt->input_cd == (iconv_t) -1) + /* NOTE: The following test assumes that the only special values + (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */ + if (SCM_ICONV_SPECIAL_P (pt->input_cd)) err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len); else err = get_iconv_codepoint (port, codepoint, buf, len); @@ -2228,7 +2255,12 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) /* If ENCODING is UTF-8, then no conversion descriptor is opened because we do I/O ourselves. This saves 100+ KiB for each descriptor. */ - if (strcmp (encoding, "UTF-8")) + if (strcmp (encoding, "UTF-8") == 0) + { + new_input_cd = SCM_ICONV_UTF8_AT_START; + new_output_cd = SCM_ICONV_UNINITIALIZED; + } + else { if (SCM_CELL_WORD_0 (port) & SCM_RDNG) { @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) new_output_cd = iconv_open (encoding, "UTF-8"); if (new_output_cd == (iconv_t) -1) { - if (new_input_cd != (iconv_t) -1) - iconv_close (new_input_cd); + iconv_close (new_input_cd); goto invalid_encoding; } } } - if (pt->input_cd != (iconv_t) -1) + if (!SCM_ICONV_SPECIAL_P (pt->input_cd)) iconv_close (pt->input_cd); - if (pt->output_cd != (iconv_t) -1) + if (!SCM_ICONV_SPECIAL_P (pt->output_cd)) iconv_close (pt->output_cd); pt->input_cd = new_input_cd; diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test index 38044c6..65c3b3f 100644 --- a/test-suite/tests/ports.test +++ b/test-suite/tests/ports.test @@ -1182,6 +1182,18 @@ (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00 #x61 #x00 #x00 #x00))) + (pass-if-equal "BOM discarded from start of UTF-8 stream" + "a" + (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61))) + + (pass-if-equal "Only one BOM discarded from start of UTF-8 stream" + "\uFEFFa" + (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM not discarded unless at start of UTF-8 stream" + "a\uFEFFb" + (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62))) + (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)" "a" (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61))) -- 1.7.10.4
>From 1e4dde890b0a9a80b26f78d5c82b8ffac9e47689 Mon Sep 17 00:00:00 2001 From: Mark H Weaver <m...@netris.org> Date: Wed, 30 Jan 2013 14:22:00 -0500 Subject: [PATCH 3/3] Remove byte-order mark check from scm_i_scan_for_encoding. * libguile/read.c (scm_i_scan_for_encoding): Remove byte-order mark check. --- libguile/read.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/libguile/read.c b/libguile/read.c index 222891b..75cf2cc 100644 --- a/libguile/read.c +++ b/libguile/read.c @@ -1,5 +1,5 @@ -/* Copyright (C) 1995, 1996, 1997, 1999, 2000, 2001, 2003, 2004, 2006, - * 2007, 2008, 2009, 2010, 2011, 2012 Free Software Foundation, Inc. +/* Copyright (C) 1995, 1996, 1997, 1999, 2000, 2001, 2003, 2004, 2006, 2007, + * 2008, 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public License @@ -1985,7 +1985,6 @@ scm_i_scan_for_encoding (SCM port) char header[SCM_ENCODING_SEARCH_SIZE+1]; size_t bytes_read, encoding_length, i; char *encoding = NULL; - int utf8_bom = 0; char *pos, *encoding_start; int in_comment; @@ -2030,10 +2029,6 @@ scm_i_scan_for_encoding (SCM port) scm_seek (port, scm_from_int (0), scm_from_int (SEEK_SET)); } - if (bytes_read > 3 - && header[0] == '\xef' && header[1] == '\xbb' && header[2] == '\xbf') - utf8_bom = 1; - /* search past "coding[:=]" */ pos = header; while (1) @@ -2102,11 +2097,6 @@ scm_i_scan_for_encoding (SCM port) /* This wasn't in a comment */ return NULL; - if (utf8_bom && strcmp(encoding, "UTF-8")) - scm_misc_error (NULL, - "the port input declares the encoding ~s but is encoded as UTF-8", - scm_list_1 (scm_from_locale_string (encoding))); - return encoding; } -- 1.7.10.4