Dear Racket Users, Some of my students are getting strange results from regexp-match* and I'm hoping that someone on the list might be able to explain what's happening.
They've selected the book at http://www.gutenberg.org/cache/epub/37499/pg37499.txt, which is encoded in UTF-8. The students are searching a book from Project Gutenberg for words that start with any letter and then "at", using the regular expression #px"\\W.at". Here's the behavior we're seeing. Welcome to DrRacket, version 7.2 [3m]. Language: racket, with debugging; memory limit: 128 MB. > (define port (open-input-file "pg37499.txt")) > (define book (port->string port)) > (close-input-port port) > (string-length book) 616649 > (regexp-match* #px"\\W.at" book) . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 24053 starting index: 24055 valid range: [0, 616649] string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by Henry Foljambe Hall\r\n\r\nThis eBook is for the use of any… I was trying to extract the portion that creates the error. Strangely enough, if we just work with some substrings of the book that include the approximate indices, regexp-match* seems to work fine. But not all such substrings. > (regexp-match* #px"\\W.at" (substring book 0 25000)) '(" Dat" " rat" " lat" " Cat" " at" " lat" "\ndat" " dat" " Cat" " bat" " bat" " bat" " lat" "\r\nat" ", at" " rat" " pat" " lat" ", at" " nat" " nat" "\r\nat" "\r\nat" " pat" " nat" "\nnat" "\r\nat" " Cat" "\ngat" " nat") > (regexp-match* #px"\\W.at" (substring book 0 100000)) . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 24053 starting index: 24055 valid range: [0, 100000] string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by Henry Foljambe Hall\r\n\r\nThis eBook is for the use of any... > (regexp-match* #px"\\W.at" (substring book 24000 25000)) '(" nat") > (regexp-match* #px"\\W.at" (substring book 24000 50000)) . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 53 starting index: 55 valid range: [0, 26000] string: "are in my\r\npossession. Had these been of a political nature, much as I should\r\nprize any relics of such a man, yet th... We also see the same error with similar patterns that start with \\W. > (regexp-match* #px"\\W[a-zA-Z]at" (substring book 24000 50000)) . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 53 starting index: 55 valid range: [0, 26000] string: "are in my\r\npossession. Had these been of a political nature, much as I should\r\nprize any relics of such a man, yet th... > (regexp-match* #px"\\W\\wat" (substring book 24000 50000)) . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: substring: ending index is smaller than starting index ending index: 53 starting index: 55 valid range: [0, 26000] string: "are in my\r\npossession. Had these been of a political nature, much as I should\r\nprize any relics of such a man, yet th… Any ideas? Thanks! -- SamR Samuel A. Rebelsky - He, Him, His Professor of Computer Science Grinnell College, 1116 8th Avenue, Grinnell, Iowa 50112 The opinions expressed herein are my own, and should not be attributed to Grinnell College, Grinnell's Department of Computer Science, SIGCSE, SIGCAS, any other organizations with which I am affiliated, my family, or even most sentient beings. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.