So, I'm trying to improve MoarVM portability from the current "it works on both kinds of systems"* by getting it to build on ARM. Being another little endian system. And it went "boom!" in strange ways.
So, trying to replicate why, I wondered if it's because char on ARM is unsigned, whereas x86 char is signed. Clearly the code works on x86, and the x86_64 hardware I have access to is nice and fast, and the debugging toolchain is more mature, etc And lo, if I compile on x86_64 with -fsigned-char it explodes, trying to compile the setting, with an error about Could not locate compile-time value for symbol Whatever and a screenful of backtrace. (Turns out not to be in the same place as ARM. More on that in the next message) After some elimination, I found the cause. We have this in src/strings/utf8.c: if (bytes >= 3 && utf8[0] == 0xEF && utf8[1] == 0xBB && utf8[0xBF]) { /* disregard UTF-8 BOM if it's present */ utf8 += 3; bytes -= 3; } bytes is const char *utf8 Meaning that when char is signed, that code is never reached, because signed chars have values in the range -128 to +127, which is never equal to 0xEf. However, when char is signed, such as when I compile with -fsigned-char, it is reached, it is executed, specifically during deserialisation: (gdb) where #0 MVM_string_utf8_decode () at src/strings/utf8.c:205 #1 0x00007ffff669d274 in deserialize_strings () at src/core/bytecode.c:284 #2 0x00007ffff66a79b7 in MVM_bytecode_unpack () at src/core/bytecode.c:749 #3 0x00007ffff669b51a in MVM_cu_from_bytes () at src/core/compunit.c:17 #4 0x00007ffff669b7fc in MVM_cu_map_from_file () at src/core/compunit.c:58 #5 0x00007ffff66c6da9 in MVM_load_bytecode () at src/core/loadbytecode.c:31 only once, for the sequence OxEF 0xBB 0xBF (hey, I've just noticed the bug in the buggy code) which was intended to be a 1 character literal U+FEFF, and instead is getting replaced with an empty string. Which, for some reason completely confuses setting compilation. Anyway, if I change the type of bytes to be MVMuint8 *, then the bug is repeatable on x86_64 with the default char (ie signed). So attached are two patches, one to remove the above buggy code, and the second to change the prototype of MVM_string_utf8_decode to take MVMuint8 *, as conceptually UTF-8 sequences are unsigned, not signed. Nicholas Clark * Win32/Linux, x86/x86_64, RedHat/Debian, Debian/Ubuntu - take your pick :-)
>From 95cdb337bb03827967777e89e679a507adeff030 Mon Sep 17 00:00:00 2001 From: Nicholas Clark <n...@ccl4.org> Date: Mon, 5 May 2014 19:41:54 +0200 Subject: [PATCH 1/5] Remove BOM-discarding code from MVM_string_utf8_decode() This code is unreachable on a platform where char is signed, because there utf8[0] will be between -128 and and 127 inclusive, and hence never equal to 0xEF. (ie x86, x86_64). On a platform where char is unsigned, the code is reached, and it kills the Rakudo build. Specifically, this happens because it munges strings being deserialised, with the particular problem string being the character U+FEFF, which is used in pre-compiled NQP code. Deserialising what should be a 1 char string U+FEFF as a 0 char string causes a very strange "Could not locate compile-time value for symbol Whatever" error when compiling the Perl 6 setting. Oh the strange side effects of bugs... If BOM-nomming is needed, it should be in I/O specific code, and only at the start of files, not in the general-purpose NFG conversion routines. --- src/strings/utf8.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/src/strings/utf8.c b/src/strings/utf8.c index c70da97..87143f3 100644 --- a/src/strings/utf8.c +++ b/src/strings/utf8.c @@ -200,10 +200,6 @@ MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, MVMint32 line; MVMint32 col; - if (bytes >= 3 && utf8[0] == 0xEF && utf8[1] == 0xBB && utf8[0xBF]) { - /* disregard UTF-8 BOM if it's present */ - utf8 += 3; bytes -= 3; - } orig_bytes = bytes; orig_utf8 = utf8; -- 1.8.4.2
>From 8a18681c3974b33860b03cbd814f9ad56cc5a381 Mon Sep 17 00:00:00 2001 From: Nicholas Clark <n...@ccl4.org> Date: Mon, 5 May 2014 18:50:00 +0200 Subject: [PATCH 2/5] MVM_string_utf8_decode() should take a MVMuint8 *, not a char * Conceptually, UTF-8 is sequences of Octets, which are unsigned, so use an unsigned type. --- src/strings/utf8.c | 2 +- src/strings/utf8.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/strings/utf8.c b/src/strings/utf8.c index 87143f3..54669de 100644 --- a/src/strings/utf8.c +++ b/src/strings/utf8.c @@ -187,7 +187,7 @@ static void *utf8_encode(void *bytes, MVMCodepoint32 cp) /* Decodes the specified number of bytes of utf8 into an NFG string, creating * a result of the specified type. The type must have the MVMString REPR. * Only bring in the raw codepoints for now. */ -MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const char *utf8, size_t bytes) { +MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const MVMuint8 *utf8, size_t bytes) { MVMString *result = (MVMString *)REPR(result_type)->allocate(tc, STABLE(result_type)); MVMint32 count = 0; MVMCodepoint32 codepoint; diff --git a/src/strings/utf8.h b/src/strings/utf8.h index 511a393..fce37c6 100644 --- a/src/strings/utf8.h +++ b/src/strings/utf8.h @@ -1,4 +1,4 @@ -MVM_PUBLIC MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const char *utf8, size_t bytes); +MVM_PUBLIC MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, MVMObject *result_type, const MVMuint8 *utf8, size_t bytes); MVM_PUBLIC void MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds, MVMint32 *stopper_chars, MVMint32 *stopper_sep); MVM_PUBLIC MVMuint8 * MVM_string_utf8_encode_substr(MVMThreadContext *tc, MVMString *str, MVMuint64 *output_size, MVMint64 start, MVMint64 length); -- 1.8.4.2