Hi Timothy, Ludo, On +2021-05-03 00:02:09 -0400, Timothy Sample wrote: > Timothy Sample <samp...@ngyro.com> writes: > > > I’m still looking into this, but I wanted to quickly post this > > reproducer for the Guile bug: > > > > (use-modules (ice-9 regex)) > > (define str > > "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492") > > (match:substring (string-match "[0-8]+" str)) > > > > This triggers the out-of-range error when run with “LC_ALL=C”. > > It turns out that all that’s needed is the last code point, which is > “Number Eleven Full Stop”, or ‘⒒’. When Guile converts this to an ASCII > C string using ‘u32_conv_from_encoding’, it becomes “11.”. The regex > (“[0-8]+”) matches the “11” part with start index 0 and end index 2. > The ‘fixup_multibyte_match’ function does nothing (it only matters when > the locale encoding is multibyte) [1]. Guile then builds the match > vector with the original string but keeps the ASCII offsets. In other > words, it thinks the match substring goes from 0 to 2 in a single code > point string: > > ,use (ice-9 regex) > (string-match "11" "\u2492") > => #("\u2492" (0 . 2)) > > I’m not sure there’s any way to solve this nicely in Guile. It would be > clearer if the match vector included the string as libc matched it, but > it’s still surprising that the match happens with a different string. > > In Disarchive, I can rewrite the generator without regex. I’ll do that > and see what I can do about the “Gave up!” issue. > > [1] It works on the converted-to-ASCII C string, which means that the > byte offsets and code point offsets are the same. Hence, it has nothing > to do. > > > -- Tim >
> > What happens with these? (code ppoints in decimal) 8554 _Ⅺ_ "ROMAN NUMERAL ELEVEN" 8570 _ⅺ_ "SMALL ROMAN NUMERAL ELEVEN" 9322 _⑪_ "CIRCLED NUMBER ELEVEN" 9342 _⑾_ "PARENTHESIZED NUMBER ELEVEN" 9362 _⒒_ "NUMBER ELEVEN FULL STOP" 9451 _⓫_ "NEGATIVE CIRCLED NUMBER ELEVEN" 13155 _㍣_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ELEVEN" 13290 _㏪_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ELEVEN" I would argue that none of these should be "decoded" into ascii polyglyphs since they are atomic character glyphs. IMO It is over-eager transformation to make them into ascii polyglyphs. /Super/sub/-script placement metadata is another thing to consider -- "decode" to ascii art?? ;-) Unicode characters representing mathematical values in other languages are different. Those are subject to natural language translation with locale-dependent semantics. These might be candidates for that?: (code points in decimal) 8544 _Ⅰ_ "ROMAN NUMERAL ONE" 8545 _Ⅱ_ "ROMAN NUMERAL TWO" 8546 _Ⅲ_ "ROMAN NUMERAL THREE" 8547 _Ⅳ_ "ROMAN NUMERAL FOUR" 8548 _Ⅴ_ "ROMAN NUMERAL FIVE" 8549 _Ⅵ_ "ROMAN NUMERAL SIX" 8550 _Ⅶ_ "ROMAN NUMERAL SEVEN" 8551 _Ⅷ_ "ROMAN NUMERAL EIGHT" 8552 _Ⅸ_ "ROMAN NUMERAL NINE" 8553 _Ⅹ_ "ROMAN NUMERAL TEN" 8554 _Ⅺ_ "ROMAN NUMERAL ELEVEN" 8555 _Ⅻ_ "ROMAN NUMERAL TWELVE" 8556 _Ⅼ_ "ROMAN NUMERAL FIFTY" 8557 _Ⅽ_ "ROMAN NUMERAL ONE HUNDRED" 8558 _Ⅾ_ "ROMAN NUMERAL FIVE HUNDRED" 8559 _Ⅿ_ "ROMAN NUMERAL ONE THOUSAND" 8560 _ⅰ_ "SMALL ROMAN NUMERAL ONE" 8561 _ⅱ_ "SMALL ROMAN NUMERAL TWO" 8562 _ⅲ_ "SMALL ROMAN NUMERAL THREE" 8563 _ⅳ_ "SMALL ROMAN NUMERAL FOUR" 8564 _ⅴ_ "SMALL ROMAN NUMERAL FIVE" 8565 _ⅵ_ "SMALL ROMAN NUMERAL SIX" 8566 _ⅶ_ "SMALL ROMAN NUMERAL SEVEN" 8567 _ⅷ_ "SMALL ROMAN NUMERAL EIGHT" 8568 _ⅸ_ "SMALL ROMAN NUMERAL NINE" 8569 _ⅹ_ "SMALL ROMAN NUMERAL TEN" 8570 _ⅺ_ "SMALL ROMAN NUMERAL ELEVEN" 8571 _ⅻ_ "SMALL ROMAN NUMERAL TWELVE" 8572 _ⅼ_ "SMALL ROMAN NUMERAL FIFTY" 8573 _ⅽ_ "SMALL ROMAN NUMERAL ONE HUNDRED" 8574 _ⅾ_ "SMALL ROMAN NUMERAL FIVE HUNDRED" 8575 _ⅿ_ "SMALL ROMAN NUMERAL ONE THOUSAND" 8576 _ↀ_ "ROMAN NUMERAL ONE THOUSAND C D" 8577 _ↁ_ "ROMAN NUMERAL FIVE THOUSAND" 8578 _ↂ_ "ROMAN NUMERAL TEN THOUSAND" 8579 _Ↄ_ "ROMAN NUMERAL REVERSED ONE HUNDRED" 8581 _ↅ_ "ROMAN NUMERAL SIX LATE FORM" 8582 _ↆ_ "ROMAN NUMERAL FIFTY EARLY FORM" 8583 _ↇ_ "ROMAN NUMERAL FIFTY THOUSAND" 8584 _ↈ_ "ROMAN NUMERAL ONE HUNDRED THOUSAND" 12321 _〡_ "HANGZHOU NUMERAL ONE" 12322 _〢_ "HANGZHOU NUMERAL TWO" 12323 _〣_ "HANGZHOU NUMERAL THREE" 12324 _〤_ "HANGZHOU NUMERAL FOUR" 12325 _〥_ "HANGZHOU NUMERAL FIVE" 12326 _〦_ "HANGZHOU NUMERAL SIX" 12327 _〧_ "HANGZHOU NUMERAL SEVEN" 12328 _〨_ "HANGZHOU NUMERAL EIGHT" 12329 _〩_ "HANGZHOU NUMERAL NINE" 12344 _〸_ "HANGZHOU NUMERAL TEN" 12345 _〹_ "HANGZHOU NUMERAL TWENTY" 12346 _〺_ "HANGZHOU NUMERAL THIRTY" Just my intuitive reaction, no academic creds to back it up ;) -- Regards, Bengt Richter