Timothy Sample <samp...@ngyro.com> writes:

> I’m still looking into this, but I wanted to quickly post this
> reproducer for the Guile bug:
>
>     (use-modules (ice-9 regex))
>     (define str
> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")
>     (match:substring (string-match "[0-8]+" str))
>
> This triggers the out-of-range error when run with “LC_ALL=C”.

It turns out that all that’s needed is the last code point, which is
“Number Eleven Full Stop”, or ‘⒒’.  When Guile converts this to an ASCII
C string using ‘u32_conv_from_encoding’, it becomes “11.”.  The regex
(“[0-8]+”) matches the “11” part with start index 0 and end index 2.
The ‘fixup_multibyte_match’ function does nothing (it only matters when
the locale encoding is multibyte) [1].  Guile then builds the match
vector with the original string but keeps the ASCII offsets.  In other
words, it thinks the match substring goes from 0 to 2 in a single code
point string:

    ,use (ice-9 regex)
    (string-match "11" "\u2492")
    => #("\u2492" (0 . 2))

I’m not sure there’s any way to solve this nicely in Guile.  It would be
clearer if the match vector included the string as libc matched it, but
it’s still surprising that the match happens with a different string.

In Disarchive, I can rewrite the generator without regex.  I’ll do that
and see what I can do about the “Gave up!” issue.

[1] It works on the converted-to-ASCII C string, which means that the
byte offsets and code point offsets are the same.  Hence, it has nothing
to do.


-- Tim



Reply via email to