On Thursday, August 21, 2003, at 07:22 , Benjamin Goldberg wrote:
A foolish question: can you imagine strings which are lazily read from a file?
If so, could you imagine such a string, sitting in front of a really really big file, bigger than could fit into memory?
Having a lazily slurped file string simply delays disaster, and opens the door for Very Big Mistakes. Such strings would have to be treated very delicately, or the program would behave very inefficiently or crash. (And let's be frank, a lazily concatenated STRING* is just a tie()d string value—I thought that was leaving the core.) There's power in such strings, no doubt. There's also TERROR of passing the string to anything lest your program explode because some CPAN module's author wasn't also TERRIFIED of your input being something not-just-a-string. If I'm going to have the potential to load the entire file into memory if I'm the least bit careless, I'd prefer to be up front about it. Anti-action-at-a-distance. I don't need to be deluded that my code is efficient because it reads lazily. (Fact is, it's probably faster if it buffers the file all at once, if it's going to buffer it at all. Certainly more memory-efficient (!). Fewer chunks. Less overhead. But probably faster still to mmap() it.)
And what if your admittedly huge file is larger than 2**32 bytes? (A very real possibility! You said it was too big to fit in memory!) Are you going to suggest that all STRING* consumers on 32-bit platforms emulate 64-bit arithmetic whenever manipulating STRING* lengths?
To efficiently process a Very Large String, you need to *stream* through it, not buffer it. Same applies to infinite strings (generators) or indeterminate strings (generators and sockets). Such strings don't have representable or knowable lengths. STRING*'s *really* *really* should reliably have lengths, I think.
IMAGINE, if you will, something absolutely crazy:
grammar HTTPServer { rule http { (<request> <commit>)* } rule request { <get_request> | <post_request> | ... } rule get_request { GET <path> <version> <crlf> <header> { my $file = open(...) or print("403 Access Denied\r\n"), fail; print "200 OK\r\n"; while (<$file>) print; close $file; } } rule post_request { GET <path> <version> <crlf> <header> { # Blahblahblah... } } rule crlf { \r\n } rule header { <header_line>* <crlf> <commit> } rule header_line { ([:alpha:]+): ([^\r\n]* <crlf> ([ \t]+ [^\r\n]* <crlf>)*) <commit> } # ... more ... }
If perl's using a stream rather than buffering to a STRING*, then $sock =~ /<HTTPServer::http>/ could actually work—and quite efficiently. [1] How cool is that? Just imagine trying to apply the same pattern to a more long-lived protocol than HTTP, though—a database connection, maybe, or IRC. Or an HTTP client, which can download lots of data. Using chunky strings? perl, meet rlimit. rlimit, this is perl. [2] Using streams? Network programming becomes crazily easy.
—
Gordon Henriksen [EMAIL PROTECTED]
[1] Of course, this requires that the regex engine be coded to think in sequences. The regex engine could keep its own backtracking buffer, and trim that buffer at each commit.
[2] No doubt, unshift hacks[3] could be found to make the lazy slurpy file string not crash. But these are just changes to make strings behave like streams, and would impose upon STRING* consumers everywhere Very Strange things like those strings which don't know their own length. A string wants to be a string, and a stream wants to be a stream.
[3] Unshift hack #1: Where commit appears in the above, exit the grammar, trim the beginning of the string, and re-enter. (But that forces the grammar author to discard the regex state, whereas commit would offer no such restriction.) Unshift hack #2: Tell =~ that <commit> can trim the beginning of the string. (DWIM departs; /cgxism returns.)