Dear Racket Users,

Some of my students are getting strange results from regexp-match* and I'm 
hoping that someone on the list might be able to explain what's happening.  

They've selected the book at 
http://www.gutenberg.org/cache/epub/37499/pg37499.txt, which is encoded in 
UTF-8.  The students are searching a book from Project Gutenberg for words that 
start with any letter and then "at", using the regular expression #px"\\W.at".  
Here's the behavior we're seeing.

Welcome to DrRacket, version 7.2 [3m].
Language: racket, with debugging; memory limit: 128 MB.
> (define port (open-input-file "pg37499.txt"))
> (define book (port->string port))
> (close-input-port port)
> (string-length book)
616649
> (regexp-match* #px"\\W.at" book)
. . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: 
substring: ending index is smaller than starting index
  ending index: 24053
  starting index: 24055
  valid range: [0, 616649]
  string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by Henry 
Foljambe Hall\r\n\r\nThis eBook is for the use of any…

I was trying to extract the portion that creates the error.  Strangely enough, 
if we just work with some substrings of the book that include the approximate 
indices, regexp-match* seems to work fine.  But not all such substrings.

> (regexp-match* #px"\\W.at" (substring book 0 25000))
'(" Dat" " rat" " lat" " Cat" "  at" " lat" "\ndat" " dat" " Cat" " bat" " bat" 
" bat" " lat" "\r\nat" ", at" " rat" " pat" " lat" ", at" " nat" " nat" 
"\r\nat" "\r\nat" " pat" " nat" "\nnat" "\r\nat" " Cat" "\ngat" " nat")
> (regexp-match* #px"\\W.at" (substring book 0 100000))
. . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: 
substring: ending index is smaller than starting index
  ending index: 24053
  starting index: 24055
  valid range: [0, 100000]
  string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by Henry 
Foljambe Hall\r\n\r\nThis eBook is for the use of any...
> (regexp-match* #px"\\W.at" (substring book 24000 25000))
'(" nat")
> (regexp-match* #px"\\W.at" (substring book 24000 50000))
. . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: 
substring: ending index is smaller than starting index
  ending index: 53
  starting index: 55
  valid range: [0, 26000]
  string: "are in my\r\npossession. Had these been of a political nature, much 
as I should\r\nprize any relics of such a man, yet th...

We also see the same error with similar patterns that start with \\W.  

> (regexp-match* #px"\\W[a-zA-Z]at" (substring book 24000 50000))
. . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: 
substring: ending index is smaller than starting index
  ending index: 53
  starting index: 55
  valid range: [0, 26000]
  string: "are in my\r\npossession. Had these been of a political nature, much 
as I should\r\nprize any relics of such a man, yet th...
> (regexp-match* #px"\\W\\wat" (substring book 24000 50000))
. . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47: 
substring: ending index is smaller than starting index
  ending index: 53
  starting index: 55
  valid range: [0, 26000]
  string: "are in my\r\npossession. Had these been of a political nature, much 
as I should\r\nprize any relics of such a man, yet th…

Any ideas?

Thanks!

-- SamR

Samuel A. Rebelsky - He, Him, His
Professor of Computer Science
Grinnell College, 1116 8th Avenue, Grinnell, Iowa 50112

The opinions expressed herein are my own, and should not be attributed to 
Grinnell College, Grinnell's Department of Computer Science, SIGCSE, SIGCAS, 
any other organizations with which I am affiliated, my family, or even most 
sentient beings.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to