[PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings

Mark H Weaver Wed, 30 Jan 2013 20:41:12 -0800

Hello all,

I researched this some more, and discovered that removal of byte-order
marks (BOMs) is the responsibility of iconv, which discards BOMs from
the beginning of streams when using the UTF-16 or UTF-32 encodings, but
*not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding.
It uses the BOM to determine the endianness of the stream, but other
than that does *not* use it to guess the encoding, so there's no
guesswork involved.  (Side note: iconv also inserts a BOM automatically
when writing a stream using UTF-16 or UTF-32).


Note that the discarding of BOMs does *not* happen when opening a port,
but instead when reading from a port for the first time (or after
calling 'set-port-encoding!', since that creates a fresh iconv
descriptor).  This is important for a few reasons: It solves the nasty
complication of (open-file NAME "r+"), it works for any kind of stream
(not just files), and it does not block waiting for input immediately
after opening a port.

So thanks to iconv, we get UTF-{16,32} BOM removal for free.
Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to
a buffer overrun and assertion failure when 'iconv' discards a BOM.  It
can be triggered as follows:

  (use-modules (rnrs io ports))
  (let ((port (open-bytevector-input-port #vu8(255 254 97 0))))
    (set-port-encoding! port "UTF-16")
    (read-char port))

The first patch below fixes this problem.  I ended up almost completely
rewriting that function, partly because it was largely structured around
a mistaken assumption that iconv will never consume input without
producing output, and partly because it was quite inefficient (several
unnecessary conditional branches in the loop) and IMO was rather
difficult to read.

So what about UTF-8?  Since we handle UTF-8 internally, we have to
handle UTF-8 BOM removal ourselves.  The second patch implements this
with the same semantics as iconv does for UTF-{16,32}.  It was a bit
tricky because we need to keep track of whether we're at the beginning
of the stream, and we cannot add any more fields to scm_t_port in
stable-2.0.  I handled this by adding some more special values for the
iconv descriptor pointers.  As a bonus, we no longer call
'scm_i_set_port_encoding_x' once per character when using UTF-8 :)

The third patch removes the byte-order mark check from
scm_i_scan_for_encoding, following Andy's suggestion.

Comments and suggestions solicited.

      Mark

>From ceccaf59267bc98c86aae33809905f26b017ebc8 Mon Sep 17 00:00:00 2001
From: Mark H Weaver <m...@netris.org>
Date: Wed, 30 Jan 2013 10:16:37 -0500
Subject: [PATCH 1/3] Rewrite get_iconv_codepoint to fix a bug involving
 byte-order marks.

* libguile/ports.c (get_iconv_codepoint): Rewrite to fix a bug and
  improve efficiency and clarity.  Previously, it incorrectly assumed
  that iconv would never consume input without producing output, which
  led to a buffer overrun and subsequent assertion failure.  This
  happens when a byte-order mark is consumed by iconv at the beginning
  of the stream when using the UTF-16 or UTF-32 encodings.

* test-suite/tests/ports.test (unicode byte-order marks (BOMs)):
  Add tests.
---
 libguile/ports.c            |   95 ++++++++++++++++++++++--------------------
 test-suite/tests/ports.test |   96 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 145 insertions(+), 46 deletions(-)

diff --git a/libguile/ports.c b/libguile/ports.c
index 55808e2..a1d6c96 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -1,5 +1,5 @@
-/* Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2004,
- *   2006, 2007, 2008, 2009, 2010, 2011, 2012 Free Software Foundation, Inc.
+/* Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2004, 2006,
+ *   2007, 2008, 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -1292,66 +1292,73 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_port *pt;
-  int err, byte_read;
-  size_t bytes_consumed, output_size;
-  char *output;
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
+  size_t input_size = 0;
 
-  pt = SCM_PTAB_ENTRY (port);
-
-  for (output_size = 0, output = (char *) utf8_buf,
-	 bytes_consumed = 0, err = 0;
-       err == 0 && output_size == 0
-	 && (bytes_consumed == 0 || byte_read != EOF);
-       bytes_consumed++)
+  for (;;)
     {
-      char *input;
+      int byte_read;
+      char *input, *output;
       size_t input_left, output_left, done;
 
       byte_read = scm_get_byte_or_eof (port);
-      if (byte_read == EOF)
+      if (SCM_UNLIKELY (byte_read == EOF))
 	{
-	  if (bytes_consumed == 0)
-	    {
-	      *codepoint = (scm_t_wchar) EOF;
-	      *len = 0;
-	      return 0;
-	    }
-	  else
-	    continue;
+          if (SCM_LIKELY (input_size == 0))
+            {
+              *codepoint = (scm_t_wchar) EOF;
+              *len = input_size;
+              return 0;
+            }
+          else
+            /* EOF found in the middle of a multibyte character. */
+            return EILSEQ;
 	}
 
-      buf[bytes_consumed] = byte_read;
+      buf[input_size++] = byte_read;
 
       input = buf;
-      input_left = bytes_consumed + 1;
+      input_left = input_size;
+      output = (char *) utf8_buf;
       output_left = sizeof (utf8_buf);
 
-      done = iconv (pt->input_cd, &input, &input_left,
-		    &output, &output_left);
+      done = iconv (pt->input_cd, &input, &input_left, &output, &output_left);
+
       if (done == (size_t) -1)
 	{
-	  err = errno;
-	  if (err == EINVAL)
-	    /* Missing input: keep trying.  */
-	    err = 0;
+	  int err = errno;
+	  if (SCM_LIKELY (err == EINVAL))
+            /* The input byte sequence did not form a complete
+               character.  Read another byte and try again. */
+            continue;
+          else
+            return err;
 	}
       else
-	output_size = sizeof (utf8_buf) - output_left;
-    }
-
-  if (SCM_UNLIKELY (output_size == 0))
-    /* An unterminated sequence.  */
-    err = EILSEQ;
-  else if (SCM_LIKELY (err == 0))
-    {
-      /* Convert the UTF8_BUF sequence to a Unicode code point.  */
-      *codepoint = utf8_to_codepoint (utf8_buf, output_size);
-      *len = bytes_consumed;
+        {
+          size_t output_size = sizeof (utf8_buf) - output_left;
+          if (SCM_LIKELY (output_size > 0))
+            {
+              /* iconv generated output.  Convert the UTF8_BUF sequence
+              to a Unicode code point.  */
+              *codepoint = utf8_to_codepoint (utf8_buf, output_size);
+              *len = input_size;
+              return 0;
+            }
+          else
+            {
+              /* iconv consumed some bytes without producing any output.
+                 Most likely this means that a Unicode byte-order mark
+                 (BOM) was consumed, which should not be included in the
+                 returned buf.  Shift any remaining bytes to the beginning
+                 of buf, and continue the loop. */
+              memcpy (buf, input, input_left);
+              input_size = input_left;
+              continue;
+            }
+        }
     }
-
-  return err;
 }
 
 /* Read a codepoint from PORT and return it in *CODEPOINT.  Fill BUF
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 613d269..38044c6 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -2,7 +2,7 @@
 ;;;; Jim Blandy <j...@red-bean.com> --- May 1999
 ;;;;
 ;;;; 	Copyright (C) 1999, 2001, 2004, 2006, 2007, 2009, 2010,
-;;;;      2011, 2012 Free Software Foundation, Inc.
+;;;;      2011, 2012, 2013 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -24,7 +24,7 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((ice-9 binary-ports) #:select (open-bytevector-input-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -1149,6 +1149,98 @@
 
 
 
+(define (bv-read-test encoding bv)
+  (let ((port (open-bytevector-input-port bv)))
+    (set-port-encoding! port encoding)
+    (read-string port)))
+
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                  #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                  #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 1))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                #x00 #x00 #xFE #xFF
+                                #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                          #x00 #x00 #xFE #xFF
+                                          #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                          #xFF #xFE #x00 #x00
+                                          #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 1))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                #xFF #xFE #x00 #x00
+                                #x61 #x00 #x00 #x00))))
+
+
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4

>From 65e0cca752e005d75c8eade1c92f084a8518f209 Mon Sep 17 00:00:00 2001
From: Mark H Weaver <m...@netris.org>
Date: Wed, 30 Jan 2013 12:21:20 -0500
Subject: [PATCH 2/3] Discard UTF-8 byte-order marks at stream start, and
 improve efficiency.

* libguile/ports.c (SCM_ICONV_UNINITIALIZED, SCM_ICONV_UTF8_AT_START,
  SCM_ICONV_UTF8_NOT_AT_START, SCM_ICONV_SPECIAL_P, SCM_UNICODE_BOM):
  New macros.

  (finalize_port): Use new macros, and black hole the iconv descriptor
  pointers after closing them.

  (scm_new_port_table_entry, scm_i_remove_port, get_codepoint): Use new
  macros.

  (get_utf8_codepoint): Discard unicode byte-order marks at stream
  start.  After reading a code point, record the fact that we are no
  longer at the stream start.

  (scm_i_set_port_encoding_x): Use new macros.  Eliminate extraneous
  test.

* test-suite/tests/ports.test (unicode byte-order marks (BOMs)):
  Add tests.
---
 libguile/ports.c            |   79 ++++++++++++++++++++++++++++++-------------
 test-suite/tests/ports.test |   12 +++++++
 2 files changed, 67 insertions(+), 24 deletions(-)

diff --git a/libguile/ports.c b/libguile/ports.c
index a1d6c96..438a15b 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -89,6 +89,16 @@
 #define HAVE_FTRUNCATE 1
 #endif
 
+/* Special values for the input_cd and output_cd fields of the
+   scm_t_port structure.  FIXME: Use a nicer representation for Guile
+   2.1, when we can add more fields to the scm_t_port structure. */
+#define SCM_ICONV_UNINITIALIZED     ((iconv_t) -1)
+#define SCM_ICONV_UTF8_AT_START     ((iconv_t) -3)
+#define SCM_ICONV_UTF8_NOT_AT_START ((iconv_t) -4)
+#define SCM_ICONV_SPECIAL_P(cd)     ((cd) >= (iconv_t) -4)
+
+#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */
+
 
 /* The port kind table --- a dynamically resized array of port types.  */
 
@@ -585,10 +595,13 @@ finalize_port (void *ptr, void *data)
 
 	  entry = SCM_PTAB_ENTRY (port);
 
-	  if (entry->input_cd != (iconv_t) -1)
+	  if (!SCM_ICONV_SPECIAL_P (entry->input_cd))
 	    iconv_close (entry->input_cd);
-	  if (entry->output_cd != (iconv_t) -1)
+          entry->input_cd = SCM_ICONV_UNINITIALIZED;
+
+	  if (!SCM_ICONV_SPECIAL_P (entry->output_cd))
 	    iconv_close (entry->output_cd);
+          entry->output_cd = SCM_ICONV_UNINITIALIZED;
 
 	  SCM_SETSTREAM (port, 0);
 	  SCM_CLR_PORT_OPEN_FLAG (port);
@@ -626,8 +639,8 @@ scm_new_port_table_entry (scm_t_bits tag)
   entry->encoding = enc ? scm_gc_strdup (enc, "port") : NULL;
 
   /* The conversion descriptors will be opened lazily.  */
-  entry->input_cd = (iconv_t) -1;
-  entry->output_cd = (iconv_t) -1;
+  entry->input_cd = SCM_ICONV_UNINITIALIZED;
+  entry->output_cd = SCM_ICONV_UNINITIALIZED;
 
   entry->ilseq_handler = scm_i_default_port_conversion_handler ();
 
@@ -682,17 +695,13 @@ scm_i_remove_port (SCM port)
   p->putback_buf = NULL;
   p->putback_buf_size = 0;
 
-  if (p->input_cd != (iconv_t) -1)
-    {
-      iconv_close (p->input_cd);
-      p->input_cd = (iconv_t) -1;
-    }
-  
-  if (p->output_cd != (iconv_t) -1)
-    {
-      iconv_close (p->output_cd);
-      p->output_cd = (iconv_t) -1;
-    }
+  if (!SCM_ICONV_SPECIAL_P (p->input_cd))
+    iconv_close (p->input_cd);
+  p->input_cd = SCM_ICONV_UNINITIALIZED;
+
+  if (!SCM_ICONV_SPECIAL_P (p->output_cd))
+    iconv_close (p->output_cd);
+  p->output_cd = SCM_ICONV_UNINITIALIZED;
 
   SCM_SETPTAB_ENTRY (port, 0);
 
@@ -1202,6 +1211,8 @@ get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
     }
   else if ((buf[0] & 0xf0) == 0xe0)
     {
+      scm_t_wchar code_pt;
+
       /* 3-byte form.  */
       byte = scm_peek_byte_or_eof (port);
       ASSERT_NOT_EOF (byte);
@@ -1225,9 +1236,21 @@ get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
       buf[2] = (scm_t_uint8) byte;
       *len = 3;
 
-      *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
+      code_pt = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
 	| ((scm_t_wchar) buf[1] & 0x3f) << 6UL
 	| (buf[2] & 0x3f);
+
+      if (SCM_UNLIKELY (code_pt == SCM_UNICODE_BOM
+                        && pt->input_cd == SCM_ICONV_UTF8_AT_START))
+        {
+          /* Discard a Unicode byte-order mark (BOM) at the beginning of
+             the stream.  Record the fact that we are no longer at the
+             beginning, and read another code point. */
+          pt->input_cd = SCM_ICONV_UTF8_NOT_AT_START;
+          return get_utf8_codepoint (port, codepoint, buf, len);
+        }
+
+      *codepoint = code_pt;
     }
   else if (buf[0] >= 0xf0 && buf[0] <= 0xf4)
     {
@@ -1272,6 +1295,9 @@ get_utf8_codepoint (SCM port, scm_t_wchar *codepoint,
   else
     goto invalid_seq;
 
+  /* Record the fact that we are no longer at the beginning of the
+     stream, so that future byte-order marks will not be discarded. */
+  pt->input_cd = SCM_ICONV_UTF8_NOT_AT_START;
   return 0;
 
  invalid_seq:
@@ -1372,12 +1398,13 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
   int err;
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
 
-  if (pt->input_cd == (iconv_t) -1)
+  if (pt->input_cd == SCM_ICONV_UNINITIALIZED)
     /* Initialize the conversion descriptors, if needed.  */
     scm_i_set_port_encoding_x (port, pt->encoding);
 
-  /* FIXME: In 2.1, add a flag to determine whether a port is UTF-8.  */
-  if (pt->input_cd == (iconv_t) -1)
+  /* NOTE: The following test assumes that the only special values
+     (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */
+  if (SCM_ICONV_SPECIAL_P (pt->input_cd))
     err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
   else
     err = get_iconv_codepoint (port, codepoint, buf, len);
@@ -2228,7 +2255,12 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   /* If ENCODING is UTF-8, then no conversion descriptor is opened
      because we do I/O ourselves.  This saves 100+ KiB for each
      descriptor.  */
-  if (strcmp (encoding, "UTF-8"))
+  if (strcmp (encoding, "UTF-8") == 0)
+    {
+      new_input_cd = SCM_ICONV_UTF8_AT_START;
+      new_output_cd = SCM_ICONV_UNINITIALIZED;
+    }
+  else
     {
       if (SCM_CELL_WORD_0 (port) & SCM_RDNG)
 	{
@@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
 	  new_output_cd = iconv_open (encoding, "UTF-8");
 	  if (new_output_cd == (iconv_t) -1)
 	    {
-	      if (new_input_cd != (iconv_t) -1)
-		iconv_close (new_input_cd);
+              iconv_close (new_input_cd);
 	      goto invalid_encoding;
 	    }
 	}
     }
 
-  if (pt->input_cd != (iconv_t) -1)
+  if (!SCM_ICONV_SPECIAL_P (pt->input_cd))
     iconv_close (pt->input_cd);
-  if (pt->output_cd != (iconv_t) -1)
+  if (!SCM_ICONV_SPECIAL_P (pt->output_cd))
     iconv_close (pt->output_cd);
 
   pt->input_cd = new_input_cd;
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 38044c6..65c3b3f 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -1182,6 +1182,18 @@
     (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
                                   #x61 #x00 #x00 #x00)))
 
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
   (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
       "a"
     (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
-- 
1.7.10.4

>From 1e4dde890b0a9a80b26f78d5c82b8ffac9e47689 Mon Sep 17 00:00:00 2001
From: Mark H Weaver <m...@netris.org>
Date: Wed, 30 Jan 2013 14:22:00 -0500
Subject: [PATCH 3/3] Remove byte-order mark check from
 scm_i_scan_for_encoding.

* libguile/read.c (scm_i_scan_for_encoding): Remove byte-order mark
  check.
---
 libguile/read.c |   14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/libguile/read.c b/libguile/read.c
index 222891b..75cf2cc 100644
--- a/libguile/read.c
+++ b/libguile/read.c
@@ -1,5 +1,5 @@
-/* Copyright (C) 1995, 1996, 1997, 1999, 2000, 2001, 2003, 2004, 2006,
- *   2007, 2008, 2009, 2010, 2011, 2012 Free Software Foundation, Inc.
+/* Copyright (C) 1995, 1996, 1997, 1999, 2000, 2001, 2003, 2004, 2006, 2007,
+ *   2008, 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc.
  * 
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -1985,7 +1985,6 @@ scm_i_scan_for_encoding (SCM port)
   char header[SCM_ENCODING_SEARCH_SIZE+1];
   size_t bytes_read, encoding_length, i;
   char *encoding = NULL;
-  int utf8_bom = 0;
   char *pos, *encoding_start;
   int in_comment;
 
@@ -2030,10 +2029,6 @@ scm_i_scan_for_encoding (SCM port)
       scm_seek (port, scm_from_int (0), scm_from_int (SEEK_SET));
     }
 
-  if (bytes_read > 3 
-      && header[0] == '\xef' && header[1] == '\xbb' && header[2] == '\xbf')
-    utf8_bom = 1;
-
   /* search past "coding[:=]" */
   pos = header;
   while (1)
@@ -2102,11 +2097,6 @@ scm_i_scan_for_encoding (SCM port)
     /* This wasn't in a comment */
     return NULL;
 
-  if (utf8_bom && strcmp(encoding, "UTF-8"))
-    scm_misc_error (NULL,
-		    "the port input declares the encoding ~s but is encoded as UTF-8",
-		    scm_list_1 (scm_from_locale_string (encoding)));
-
   return encoding;
 }
 
-- 
1.7.10.4

[PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings

Reply via email to