Laurent, Thanks for the idea. Unfortunately, that doesn't seem to make a difference. Here's a search involving only 8000 characters.
> (regexp-match* #px"\\W\\wat" (substring book 24000 32000)) . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 53 starting index: 55 valid range: [0, 8000] string: "are in my\r\npossession. Had these been of a political nature, much as I should\r\nprize any relics of such a man, yet th… I *think* the restriction is on the pattern, not the string being searched. --- All, I tried the same code on Racket on Chez, and it seems to work fine. Using Racket CS did let me produce a small example that reveals the issue on Racket 7.2 On Racket 7.2.0.5 CS > (regexp-match* #px"\\W\\wat" "a cat érat") '(" cat" "érat") On Racket 7.2 (no CS) > (regexp-match* #px"\\W\\wat" "a cat érat") . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 2 starting index: 4 valid range: [0, 10] string: "a cat érat" I thought both Racket 7.2 and Racket 7.2 CS were using Matthew's new RegExp implementation, but perhaps not. (Or perhaps there's a lower-level issue.) Given the forthcoming switch to Chez, it may not be something that anyone wants to deal with, but I thought the smaller example might help someone who wants to attack the problem. -- SamR > On Feb 10, 2019, at 10:19 AM, Laurent <laurent.ors...@gmail.com> wrote: > > Not sure what the actual problem is, but may be this excerpt from the docs is > relevant: > > The internal size of a regexp value is limited to 32 kilobytes; this limit > roughly corresponds to a source string with 32,000 literal characters or > 5,000 operators. > > Source: > http://docs.racket-lang.org/reference/regexp.html?q=regular%20expressions#%28tech._regular._expression%29 > > Maybe you need to split the book in chunks of 32kB. > > > On Sun, Feb 10, 2019 at 3:59 PM Rebelsky, Samuel <rebel...@grinnell.edu> > wrote: > Dear Racket Users, > > Some of my students are getting strange results from regexp-match* and I'm > hoping that someone on the list might be able to explain what's happening. > > They've selected the book at > http://www.gutenberg.org/cache/epub/37499/pg37499.txt, which is encoded in > UTF-8. The students are searching a book from Project Gutenberg for words > that start with any letter and then "at", using the regular expression > #px"\\W.at". Here's the behavior we're seeing. > > Welcome to DrRacket, version 7.2 [3m]. > Language: racket, with debugging; memory limit: 128 MB. > > (define port (open-input-file "pg37499.txt")) > > (define book (port->string port)) > > (close-input-port port) > > (string-length book) > 616649 > > (regexp-match* #px"\\W.at" book) > . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: > substring: ending index is smaller than starting index > ending index: 24053 > starting index: 24055 > valid range: [0, 616649] > string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by > Henry Foljambe Hall\r\n\r\nThis eBook is for the use of any… > > I was trying to extract the portion that creates the error. Strangely > enough, if we just work with some substrings of the book that include the > approximate indices, regexp-match* seems to work fine. But not all such > substrings. > > > (regexp-match* #px"\\W.at" (substring book 0 25000)) > '(" Dat" " rat" " lat" " Cat" " at" " lat" "\ndat" " dat" " Cat" " bat" " > bat" " bat" " lat" "\r\nat" ", at" " rat" " pat" " lat" ", at" " nat" " nat" > "\r\nat" "\r\nat" " pat" " nat" "\nnat" "\r\nat" " Cat" "\ngat" " nat") > > (regexp-match* #px"\\W.at" (substring book 0 100000)) > . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: > substring: ending index is smaller than starting index > ending index: 24053 > starting index: 24055 > valid range: [0, 100000] > string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by > Henry Foljambe Hall\r\n\r\nThis eBook is for the use of any... > > (regexp-match* #px"\\W.at" (substring book 24000 25000)) > '(" nat") > > (regexp-match* #px"\\W.at" (substring book 24000 50000)) > . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: > substring: ending index is smaller than starting index > ending index: 53 > starting index: 55 > valid range: [0, 26000] > string: "are in my\r\npossession. Had these been of a political nature, > much as I should\r\nprize any relics of such a man, yet th... > > We also see the same error with similar patterns that start with \\W. > > > (regexp-match* #px"\\W[a-zA-Z]at" (substring book 24000 50000)) > . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: > substring: ending index is smaller than starting index > ending index: 53 > starting index: 55 > valid range: [0, 26000] > string: "are in my\r\npossession. Had these been of a political nature, > much as I should\r\nprize any relics of such a man, yet th... > > (regexp-match* #px"\\W\\wat" (substring book 24000 50000)) > . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: > substring: ending index is smaller than starting index > ending index: 53 > starting index: 55 > valid range: [0, 26000] > string: "are in my\r\npossession. Had these been of a political nature, > much as I should\r\nprize any relics of such a man, yet th… > > Any ideas? > > Thanks! > > -- SamR > > Samuel A. Rebelsky - He, Him, His > Professor of Computer Science > Grinnell College, 1116 8th Avenue, Grinnell, Iowa 50112 > > The opinions expressed herein are my own, and should not be attributed to > Grinnell College, Grinnell's Department of Computer Science, SIGCSE, SIGCAS, > any other organizations with which I am affiliated, my family, or even most > sentient beings. > > -- > You received this message because you are subscribed to the Google Groups > "Racket Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to racket-users+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.