Not sure what the actual problem is, but may be this excerpt from the docs
is relevant:

  The internal size of a regexp value is limited to 32 kilobytes; this
limit roughly corresponds to a source string with 32,000 literal characters
or 5,000 operators.

Source:
http://docs.racket-lang.org/reference/regexp.html?q=regular%20expressions#%28tech._regular._expression%29

Maybe you need to split the book in chunks of 32kB.


On Sun, Feb 10, 2019 at 3:59 PM Rebelsky, Samuel <rebel...@grinnell.edu>
wrote:

> Dear Racket Users,
>
> Some of my students are getting strange results from regexp-match* and I'm
> hoping that someone on the list might be able to explain what's happening.
>
> They've selected the book at
> http://www.gutenberg.org/cache/epub/37499/pg37499.txt, which is encoded
> in UTF-8.  The students are searching a book from Project Gutenberg for
> words that start with any letter and then "at", using the regular
> expression #px"\\W.at".  Here's the behavior we're seeing.
>
> Welcome to DrRacket, version 7.2 [3m].
> Language: racket, with debugging; memory limit: 128 MB.
> > (define port (open-input-file "pg37499.txt"))
> > (define book (port->string port))
> > (close-input-port port)
> > (string-length book)
> 616649
> > (regexp-match* #px"\\W.at" book)
> . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47:
> substring: ending index is smaller than starting index
>   ending index: 24053
>   starting index: 24055
>   valid range: [0, 616649]
>   string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by
> Henry Foljambe Hall\r\n\r\nThis eBook is for the use of any…
>
> I was trying to extract the portion that creates the error.  Strangely
> enough, if we just work with some substrings of the book that include the
> approximate indices, regexp-match* seems to work fine.  But not all such
> substrings.
>
> > (regexp-match* #px"\\W.at" (substring book 0 25000))
> '(" Dat" " rat" " lat" " Cat" "  at" " lat" "\ndat" " dat" " Cat" " bat" "
> bat" " bat" " lat" "\r\nat" ", at" " rat" " pat" " lat" ", at" " nat" "
> nat" "\r\nat" "\r\nat" " pat" " nat" "\nnat" "\r\nat" " Cat" "\ngat" " nat")
> > (regexp-match* #px"\\W.at" (substring book 0 100000))
> . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47:
> substring: ending index is smaller than starting index
>   ending index: 24053
>   starting index: 24055
>   valid range: [0, 100000]
>   string: "\uFEFFProject Gutenberg's Napoleon's Letters to Josephine, by
> Henry Foljambe Hall\r\n\r\nThis eBook is for the use of any...
> > (regexp-match* #px"\\W.at" (substring book 24000 25000))
> '(" nat")
> > (regexp-match* #px"\\W.at" (substring book 24000 50000))
> . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47:
> substring: ending index is smaller than starting index
>   ending index: 53
>   starting index: 55
>   valid range: [0, 26000]
>   string: "are in my\r\npossession. Had these been of a political nature,
> much as I should\r\nprize any relics of such a man, yet th...
>
> We also see the same error with similar patterns that start with \\W.
>
> > (regexp-match* #px"\\W[a-zA-Z]at" (substring book 24000 50000))
> . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47:
> substring: ending index is smaller than starting index
>   ending index: 53
>   starting index: 55
>   valid range: [0, 26000]
>   string: "are in my\r\npossession. Had these been of a political nature,
> much as I should\r\nprize any relics of such a man, yet th...
> > (regexp-match* #px"\\W\\wat" (substring book 24000 50000))
> . . ../../Applications/Racket v7.2/collects/racket/private/kw.rkt:1325:47:
> substring: ending index is smaller than starting index
>   ending index: 53
>   starting index: 55
>   valid range: [0, 26000]
>   string: "are in my\r\npossession. Had these been of a political nature,
> much as I should\r\nprize any relics of such a man, yet th…
>
> Any ideas?
>
> Thanks!
>
> -- SamR
>
> Samuel A. Rebelsky - He, Him, His
> Professor of Computer Science
> Grinnell College, 1116 8th Avenue, Grinnell, Iowa 50112
>
> The opinions expressed herein are my own, and should not be attributed to
> Grinnell College, Grinnell's Department of Computer Science, SIGCSE,
> SIGCAS, any other organizations with which I am affiliated, my family, or
> even most sentient beings.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to