Applying regexen/grammars to objects (was Re: String API)

Benjamin Goldberg Mon, 25 Aug 2003 01:04:37 +0000

Gordon Henriksen wrote:
> 
> Now, I don't really have much of an opinion on compound strings in
> general. I do want to address one particular argument, though—the lazily
> slurped file string.
> 
> On Thursday, August 21, 2003, at 07:22 , Benjamin Goldberg wrote:
> 
> > A foolish question: can you imagine strings which are lazily read from
> > a file?
> >
> > If so, could you imagine such a string, sitting in front of a really
> > really big file, bigger than could fit into memory?
> 
> Having a lazily slurped file string simply delays disaster, and opens
> the door for Very Big Mistakes. Such strings would have to be treated
> very delicately, or the program would behave very inefficiently or
> crash.


Although Dan's convinced me that STRING*s don't need to be anything
other than concrete, wholly-in-memory, non-active buffers of data, (for
various and sundry reasons), I'm not sure why a lazily slurped file
string would need to be treated "delicately".

In particular, what would make the program crash?

(I'll assume that you're thinking the program might become inefficient
due to someone grabbing bytes out of arbitrary offsets, due to having to
seek() about here and there all the time.  While it's true that this
might be inneficient, it's not much worse than skipping around inside of
a utf8 string.)

> (And let's be frank, a lazily concatenated STRING* is just a
> tie()d string value—I thought that was leaving the core.) There's power
> in such strings, no doubt. There's also TERROR of passing the string to
> anything lest your program explode because some CPAN module's author
> wasn't also TERRIFIED of your input being something not-just-a-string.
> If I'm going to have the potential to load the entire file into memory
> if I'm the least bit careless,

Why would you have the potential to load the entire file into memory if
you're careless?

> I'd prefer to be up front about it.
> Anti-action-at-a-distance. I don't need to be deluded that my code is
> efficient because it reads lazily. (Fact is, it's probably faster if it
> buffers the file all at once, if it's going to buffer it at all.
> Certainly more memory-efficient (!). Fewer chunks. Less overhead. But
> probably faster still to mmap() it.)

Well, I'll certainly agree that mmaping is almost certainly faster than
any other way of bringing whole files into strings.

Hmm, does parrot's memory system use mmap when really big chunks are
requested?  IIRC, perl5's does.

> And what if your admittedly huge file is larger than 2**32 bytes? (A
> very real possibility! You said it was too big to fit in memory!) Are
> you going to suggest that all STRING* consumers on 32-bit platforms
> emulate 64-bit arithmetic whenever manipulating STRING* lengths?

Blech.  Yeah, that *would* be annoying.  OTOH, they're already emulating
64-bit arithmetic whenever they deal with file offsets.  Or perhaps I
should be saying, "bad enough that they're already ... with file
offsets, we don't want to have to do it with string lengths, too."

> To efficiently process a Very Large String, you need to *stream* through
> it, not buffer it. Same applies to infinite strings (generators) or
> indeterminate strings (generators and sockets). Such strings don't have
> representable or knowable lengths. STRING*'s *really* *really* should
> reliably have lengths, I think.
> 
> IMAGINE, if you will, something absolutely crazy:
> 
>         grammar HTTPServer {
>                 rule http {
>                         (<request> <commit>)*
>                 }
>                 rule request {
>                         <get_request> | <post_request> | ...
>                 }
>                 rule get_request {
>                         GET <path> <version> <crlf>
>                         <header>
[snip]

You should have a <commit> after that CRLF there :)

> If perl's using a stream rather than buffering to a STRING*, then
> $sock =~ /<HTTPServer::http>/ could actually work—and quite efficiently.
> [1]

> [1] Of course, this requires that the regex engine be coded to think in
> sequences.

Well, PerlString already supports the Iterator interface, and if the
regex engine were redesigned to *use* the Iterator interface, and if
streams were designed to support the Iterator interface, then passing a
stream instead of a string to the regex engine would be easy.

The way that I would envision a stream's support of the iterator
interface would be as follows:

   PMC * get_integer_keyed(PMC *key) {
      return key_integer(INTERP, key);
   }

   PMC * nextkey_keyed(PMC *key, INTVAL what) {
      PMC * ret;
      switch( what ) {
         case ITERATE_FROM_START:
            key_set_integer(INTERP, key, DYNSELF->shift_integer());
            return key;
         case ITERATE_GET_NEXT:
            ret = key_next(INTERP, key);
            if( ret ) return ret;
            ret = key_new_integer(INTERP, DYNSELF->shift_integer());
            key_set_next( key, ret );
            return key;
         case ITERATE_GET_PREV:
         case ITERATE_FROM_END:
            internal_exception(???, "Unsupported");
      }
      internal_exception(???, "Illegal iterator request");
      return NULL;
   }

> The regex engine could keep its own backtracking buffer, and
> trim that buffer at each commit.

s/buffer/stack/.

The regex engine already does.  It currently pushes onto it's stack
pointers into the string it's matching against -- obviously, for a
switch to iterators, it would push onto it's stack Key.pmc objects.

ISTM that the regex compiler can (and probably should) produce code for
both arbitrary objects supporting iteration, and for strings.  That way,
if/when a regex is applied to a PerlString (which is the normal case),
it will avoid the extra indirection produced by using the iterator
interface.

> How cool is that? Just imagine trying to apply the same pattern to a
> more long-lived protocol than HTTP, though—a database connection, maybe,
> or IRC.

Through a database connection?  I can envision that for the purpose of
implementing the protocol, but if you mean, examining the high-level
output of a database... I don't see what you mean.

For IRC... that can be done.

> Or an HTTP client, which can download lots of data. Using chunky
> strings? perl, meet rlimit. rlimit, this is perl. [2] Using streams?
> Network programming becomes crazily easy.
> 
> —
> 
> Gordon Henriksen
> [EMAIL PROTECTED]
> 
> 
> [2] No doubt, unshift hacks[3] could be found to make the lazy slurpy
> file string not crash. But these are just changes to make strings behave
> like streams, and would impose upon STRING* consumers everywhere Very
> Strange things like those strings which don't know their own length. A
> string wants to be a string, and a stream wants to be a stream.

I wasn't considering allowing lazily slurped file strings on anything
other than plain files (ones for which perl's "-f" operator returns
true).

Thus, I can't see how the string wouldn't know it's own length.

> [3] Unshift hack #1: Where commit appears in the above, exit the
> grammar, trim the beginning of the string, and re-enter. (But that
> forces the grammar author to discard the regex state, whereas commit
> would offer no such restriction.) Unshift hack #2: Tell =~ that <commit>
> can trim the beginning of the string. (DWIM departs; /cgxism returns.)

Trimming off the beginning of the string is the job of the <cut>
operator, not the <commit> operator.

Hmm... I wonder how <cut> would be done with an iterator.  Bleh.

However, <commit> can easily be done, just by removing the bottom of our
stack, up to the current point.  And popping off an empty stack could
raise a catchable exception, indicating that we've attempted to
backtrack past a commit.

For strings, <commit> frees up part of the backtracking stack, but since
the things *on* the stack are just pointers into the string being
iterated upon, nothing more is freed.

For iterated objects, <commit> frees up part of the backtracking stack,
and since the things on the stack are Key.pmc objects, these also get
freed.

-- 
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED]
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

Applying regexen/grammars to objects (was Re: String API)

Reply via email to