Re: Semantics for regexes

Larry Wall Thu, 02 Sep 2004 10:51:54 -0700

On Thu, Sep 02, 2004 at 10:43:48AM -0400, Aaron Sherman wrote:
: On Wed, 2004-09-01 at 17:00, Larry Wall wrote:
: > Okay, except that hypotheticality is an attribute of a variable's
: > value, not of the pad it's in.
: 
: Yes, I think I got that part, and perhaps I was being unclear or am
: still missing something. Here's what I was saying, a slightly different
: way:
: 
: As you enter a rule, you establish a new, free-floating pad. It *is*
: stored on the current pad stack (so that its variables are available to
: the rule and its closures), but, more importantly, it is part of the
: rule's state because it is stored in C<$0>


So far, so good.

: When you bind a hypothetical it goes into this pad.

You're still confusing hypotheticality with storage (as I was when I
wrote the Apocalypse in question, so I'm certainly not blaming you :-).

If you bind a hypothetical, you are performing a "let" on that variable,
which is very much like a "temp" (a "local" in P5-ese).  That is completely
independent of where the variable is stored.

If you have a variable with a C<?> secondary sigil, as in C<$?x>,
it stored in the rule's pad.  The C<?> is functioning as a "my" with
respect to the rule.  If there is no C<?>, then either the variable
is declared explicitly with C<my> (or equivalent) in an outer lexical
scope, or it's a package variable (in the absence of strictness).
The location of the variable is completely independent of whether
the variable is hypothetical.

In any case, a variable exists in the pad as soon as the pad is
created, whether it's a lexical pad or a rule pad (which is also
a lexical pad, as it happens--you just don't have to use "my").
In either case, the compiler knows that there's a lexical variable
of the name of either C<$x> or C<$?x>, and the initial pad reflects that.
(If we didn't do that, we'd screw up the closureness of an exception
handler, which in Perl 6 sees the pad for the lexical scope in which
it's embedded, and has to know that $x exists but is undefined even
before it is elaborated.)

: When you unbind a hypothetical (fail/backtrack) it is deleted from this
: pad (its value doesn't just get undef).

Neither of those is true.  It must regress to the *previous* value, which
might or might not be undef.  But it never disappears from its pad.

: When you return from the rule (and this is the key), you return C<$0>,
: which, along with other state, contains a reference to this pad (and the
: pad, of course contains a circular reference to C<$0>). The caller can
: now do one of two things:
: 
:       * Push this pad onto its stack. Pro: simple and fast

Which stack is that?

:       * Copy each variable from this pad in a "smart" way, searching up
:         the pad stack for a candidate variable to replace, and
:         defaulting to storing it in the inner-most pad as a new lexical.

No, if the lower rule bound anything outside of its scope, it's
already bound.  All the upper rule has to do is decide whether to
bind the lower $0 to some other name in its own upper $0.  (It doesn't
have to.  One post-A5 change is that rules that remember their subtree
are written <?expr>, while rules that throw away their subtree are
written <ws>.)

: I think the second one is the one you are describing (and described in
: A5). The first is, IMHO, the cleaner solution, but I'm not suggesting
: anything really, just pointing out the options.

There is no separate match stack, and there is no copying.  There's
merely the tree of $0 objects, which ends up being your syntax tree
on a successful match.  There is a separate mechanism that keeps
track of C<temp> and C<let> variables, which in some sense acts as a stack.
Its behavior must be intimately tied to backtracking.  It is largely
independent of the lifetimes of the $0 pads, except insofar as there's
a correlation between backtracking over some variables, and wanting
to blow away the entire $0 pretty soon if you happen to backtrack out
of the whole rule.

: My real point is that if you just establish such a free-floating
: "hypopad" (sounds like something Dr. McCoy would use) in the rule, then
: you get all of the hypothetical/backtracking behavior that you want,
: regardless of how the caller integrates the variables with its scope. It
: also keeps rules from having to search up through existing scope levels
: themselves, keeping their complexity constrained to what they know best:
: matching regular expressions and grammars. Perl's calling conventions
: manage all of the extra complexity on return, and that's probably where
: stack-walking code should go anyway.

A "hypopad" is the wrong granularity for hypotheticality.  $2 is usually
more hypothetical than $1 simply because it's based on the $1 hypothesis...

: > : Essentially every close-paren triggers binding, and every back-track
: > : over a close-paren triggers clearing.
: > 
: > Yes, that's essentially correct.  My quibble was simply that it may be
: > hard to keep track of what to clear out in the case of calling a
: > failure continuation.
: 
: I'm not sure if that's going to be true or not, as thinking in terms of
: failure continuations hurts my brain ;-) Still, I'm 99% sure that what I
: describe above puts all of the "what to clear" state in the pad that you
: return. Nice and easy.

Sorry to inhabit your 1% unsureness, but that's precisely where I am.
The C<let> mechanism is independent of pad state, just like C<temp>.
A hypothetical variable is just a temporized variable with conditional
rollback on failure.  Nice and easy.  :-)

Larry

Re: Semantics for regexes

Reply via email to