MoarVM/Rakudo (un)bustage with unsigned chars

Nicholas Clark Mon, 05 May 2014 13:41:22 -0700

So, I'm trying to improve MoarVM portability from the current "it works on
both kinds of systems"* by getting it to build on ARM. Being another little
endian system. And it went "boom!" in strange ways.


So, trying to replicate why, I wondered if it's because char on ARM is
unsigned, whereas x86 char is signed. Clearly the code works on x86, and
the x86_64 hardware I have access to is nice and fast, and the debugging
toolchain is more mature, etc

And lo, if I compile on x86_64 with -fsigned-char it explodes, trying to
compile the setting, with an error about

    Could not locate compile-time value for symbol Whatever

and a screenful of backtrace. (Turns out not to be in the same place as ARM.
More on that in the next message)

After some elimination, I found the cause. We have this in
src/strings/utf8.c:

    if (bytes >= 3 && utf8[0] == 0xEF && utf8[1] == 0xBB && utf8[0xBF]) {
        /* disregard UTF-8 BOM if it's present */
        utf8 += 3; bytes -= 3;
    }

bytes is const char *utf8
Meaning that when char is signed, that code is never reached, because signed
chars have values in the range -128 to +127, which is never equal to 0xEf.
However, when char is signed, such as when I compile with -fsigned-char, it
is reached, it is executed, specifically during deserialisation:

(gdb) where
#0  MVM_string_utf8_decode () at src/strings/utf8.c:205
#1  0x00007ffff669d274 in deserialize_strings () at src/core/bytecode.c:284
#2  0x00007ffff66a79b7 in MVM_bytecode_unpack () at src/core/bytecode.c:749
#3  0x00007ffff669b51a in MVM_cu_from_bytes () at src/core/compunit.c:17
#4  0x00007ffff669b7fc in MVM_cu_map_from_file () at src/core/compunit.c:58
#5  0x00007ffff66c6da9 in MVM_load_bytecode () at src/core/loadbytecode.c:31

only once, for the sequence OxEF 0xBB 0xBF
(hey, I've just noticed the bug in the buggy code)
which was intended to be a 1 character literal U+FEFF, and instead is getting
replaced with an empty string.

Which, for some reason completely confuses setting compilation.

Anyway, if I change the type of bytes to be MVMuint8 *, then the bug is
repeatable on x86_64 with the default char (ie signed).

So attached are two patches, one to remove the above buggy code, and the
second to change the prototype of MVM_string_utf8_decode to take MVMuint8 *,
as conceptually UTF-8 sequences are unsigned, not signed.

Nicholas Clark

* Win32/Linux, x86/x86_64, RedHat/Debian, Debian/Ubuntu - take your pick :-)

>From 95cdb337bb03827967777e89e679a507adeff030 Mon Sep 17 00:00:00 2001
From: Nicholas Clark <n...@ccl4.org>
Date: Mon, 5 May 2014 19:41:54 +0200
Subject: [PATCH 1/5] Remove BOM-discarding code from MVM_string_utf8_decode()

This code is unreachable on a platform where char is signed, because there
utf8[0] will be between -128 and and 127 inclusive, and hence never equal to
0xEF. (ie x86, x86_64). On a platform where char is unsigned, the code is
reached, and it kills the Rakudo build. Specifically, this happens because
it munges strings being deserialised, with the particular problem string being
the character U+FEFF, which is used in pre-compiled NQP code. Deserialising
what should be a 1 char string U+FEFF as a 0 char string causes a very strange
"Could not locate compile-time value for symbol Whatever" error when
compiling the Perl 6 setting. Oh the strange side effects of bugs...

If BOM-nomming is needed, it should be in I/O specific code, and only at the
start of files, not in the general-purpose NFG conversion routines.
---
 src/strings/utf8.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/src/strings/utf8.c b/src/strings/utf8.c
index c70da97..87143f3 100644
--- a/src/strings/utf8.c
+++ b/src/strings/utf8.c
@@ -200,10 +200,6 @@ MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type,
     MVMint32 line;
     MVMint32 col;

-    if (bytes >= 3 && utf8[0] == 0xEF && utf8[1] == 0xBB && utf8[0xBF]) {
-        /* disregard UTF-8 BOM if it's present */
-        utf8 += 3; bytes -= 3;
-    }
     orig_bytes = bytes;
     orig_utf8 = utf8;

-- 
1.8.4.2

>From 8a18681c3974b33860b03cbd814f9ad56cc5a381 Mon Sep 17 00:00:00 2001
From: Nicholas Clark <n...@ccl4.org>
Date: Mon, 5 May 2014 18:50:00 +0200
Subject: [PATCH 2/5] MVM_string_utf8_decode() should take a MVMuint8 *, not a
 char *

Conceptually, UTF-8 is sequences of Octets, which are unsigned, so use an
unsigned type.
---
 src/strings/utf8.c | 2 +-
 src/strings/utf8.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/strings/utf8.c b/src/strings/utf8.c
index 87143f3..54669de 100644
--- a/src/strings/utf8.c
+++ b/src/strings/utf8.c
@@ -187,7 +187,7 @@ static void *utf8_encode(void *bytes, MVMCodepoint32 cp)
 /* Decodes the specified number of bytes of utf8 into an NFG string, creating
  * a result of the specified type. The type must have the MVMString REPR.
  * Only bring in the raw codepoints for now. */
-MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const char *utf8, size_t bytes) {
+MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const MVMuint8 *utf8, size_t bytes) {
     MVMString *result = (MVMString *)REPR(result_type)->allocate(tc, STABLE(result_type));
     MVMint32 count = 0;
     MVMCodepoint32 codepoint;
diff --git a/src/strings/utf8.h b/src/strings/utf8.h
index 511a393..fce37c6 100644
--- a/src/strings/utf8.h
+++ b/src/strings/utf8.h
@@ -1,4 +1,4 @@
-MVM_PUBLIC MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const char *utf8, size_t bytes);
+MVM_PUBLIC MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const MVMuint8 *utf8, size_t bytes);
 MVM_PUBLIC void MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds, MVMint32 *stopper_chars, MVMint32 *stopper_sep);
 MVM_PUBLIC MVMuint8 * MVM_string_utf8_encode_substr(MVMThreadContext *tc,
         MVMString *str, MVMuint64 *output_size, MVMint64 start, MVMint64 length);
-- 
1.8.4.2

MoarVM/Rakudo (un)bustage with unsigned chars

Reply via email to