‘regexp-exec’ sometimes gets match boundaries wrong when operating on a
Unicode string but in a C locale (this is with
af96820e072d18c49ac03e80c6f3466d568dc77d):

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(ice-9 regex)
scheme@(guile-user)> (setlocale LC_ALL "C")
$52 = "C"
scheme@(guile-user)> (string-match "start (.*)"
                                   (string-append "start "
                                                   (string (integer->char 
1002))))
$53 = #("start \u03ea" (0 . 8) (6 . 8))
scheme@(guile-user)> (match:substring $53 1)
ice-9/boot-9.scm:1683:22: In procedure raise-exception:
Value out of range 6 to< 7: 8

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
--8<---------------cut here---------------end--------------->8---

The attached program produces more failures at random.  (The example
above works well under a UTF-8 locale.)

So I believe ‘fixup_multibyte_match’ isn’t quite correct.

Ludo’.

PS: This originates in <https://issues.guix.gnu.org/77283>.

(use-modules (ice-9 regex))

(define rx
  (make-regexp "^start (.*)"))

(setlocale LC_ALL "C")
(let loop ()
  (let* ((i (+ 256 (random (expt 2 10))))
         (str (string-append "start " (string (integer->char i)))))
    (with-exception-handler
        (lambda (exc)
          (pk 'exc exc '<-- i)
          (display-backtrace (make-stack #t) (current-error-port))
          (exit 1))
      (lambda ()
        (match:substring (regexp-exec rx str) 1)))
    (loop)))

Reply via email to