Re: Perlstorm #0040

2000-09-23 Thread Mark-Jason Dominus


> I lie: the other reason qr{} currently doesn't behave like that is that
> when we interpolate a compiled regexp into a context that requires it be
> recompiled,

Interpolated qr() items shouldn't be recompiled anyway.  They should
be treated as subroutine calls.  Unfortunately, this requires a
reentrant regex engine, which Perl doesn't have.  But I think it's the
right way to go, and it would solve the backreference problem, as well
as many other related problems.




Re: RFC 208 (v2) crypt() default salt

2000-09-21 Thread Mark-Jason Dominus


Bart Lateur:
> >If there are no objections, I will freeze this in twenty-four hours.
> 
> Oh, I have a small one: I feel that this pseudo-random salt should NOT
> affect the standard random generator. I'll clarify: by default, if you
> feed the pseudo-random generator with a certain number, you'll get the
> same sequence of output numbers, every single time. There are
> applications for this. I think that any call to crypt() should NEVER
> change this sequence of numbers, in particular, it should not skip a
> number every time crypt() is called with one parameter.
>
> Therefore, crypt() should have it's own pseudo-random generator. A
> simple task, really: same code, but a different seed variable.

I had considered this for the original RFC, but I decided against it.

To implement it, Perl would have to have its own built-in random
number generator, because there is no way to save and restore the old
state of rand() (for example).  It would substantially complicate the
code.

And the problem you describe is not really a problem.  There has never
been any guarantee that a program would produce the same sequence of
random numbers after a change to the Perl binary.  More recent
versions of Perl use random() or drand48() if they are available,
instead of rand().  A program run under an old version of Perl and
then a newer version that used random() instead of rand() would
generate a different sequence of random numbers depending on which
version of Perl was running it, even if the seed was the same.  This
has never been an issue in the past, so I did not consider it
important.

I will add a note aboput this to the RFC.  If there are no other
comments, I will freeze it in 24 hours.




Re: Threaded Perl bytecode (was: Re: stackless python)

2000-10-25 Thread Mark-Jason Dominus


> > Joshua N Pritikin writes:
> > : http://www.oreillynet.com/pub/a/python/2000/10/04/stackless-intro.html
> > 
> > Perl 5 is already stackless in that sense, though we never implemented
> > continuations.  The main impetus for going stackless was to make it
> > possible to implement a Forth-style treaded code interpreter, though
> > we never put one of those into production either.

There's a large school of thought in the Lisp world that holds that
full continuations are a bad idea.  See for example:

http://www.deja.com/threadmsg_ct.xp?AN=635369657

Executive summary of this article:  

* Continuations are hard to implement and harder to implement
  efficiently.   Languages with continuations tend to be slower
  because of the extreme generality constraints imposed by the
  presence of continuations.

* Typical uses of continuations are for things like exception
  handling.  Nobody really uses continuations because they are too
  difficult to understand.  Exception handling is adequately served by
  simpler and more efficient catch-throw mechanisms which everyone
  already understands.


Anyone seriously interested in putting continuations into Perl 6 would
probably do well to read the entire thread headed by the article I
cited above.




Critique available

2000-11-02 Thread Mark-Jason Dominus


My critique of the Perl 6 RFC process and following discussion is now
available at

http://www.perl.com/pub/2000/11/perl6rfc.html

Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: Critique available

2000-11-02 Thread Mark-Jason Dominus


> To strive for balance, I think perl.com's home page should also have the
> links to Larry's ALS talk and slides.

Thanks very much.  I have asked the folks at Songline to arrange this.

We were going to carry these, and in fact the ORA were prepared to
complete Nat's transcript, but then Ask posted them first, so we
didn't go ahead with that.  But you are right, the web bage should
have links to them.  This was an oversight on my part.




Fwd: Response to Critique of Perl 6 RFC Process

2000-11-02 Thread Mark-Jason Dominus


Frank Tobin has generously given me permission to forward his comments
to this list.

--- Forwarded Message

Date: Thu, 2 Nov 2000 00:31:42 -0600 (CST)
From: Frank Tobin <[EMAIL PROTECTED]>
X-Sender: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Response to Critique of Perl 6 RFC Process
Message-ID: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

- -BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I appreciated reading your critique of the RFC process.  I think one
problem that contributed to the mess of "mishandled" RFC's was that there
were no real written guidelines on what authors should do with RFC's in
various circumstances.

For example, when should an RFC be withdrawn?  How about withdrawn?  Does
withdrawn mean the RFC should be revoked because there is something
inherently bad about (e.g., wanting a perfect data structure), or can it
also mean the RFC is simply heavily disliked?  Does frozen mean "it looks
good, let's go for it", or does it mean "no further changes will improve
the RFC".

>From personal experience, I was the maintainer of RFC 357 (Perl should use
XML for docs instead of POD).  This generated a lot of criticism/debate.  
In general it seemed like the heavy majority of the Perl community was
against it.  I decided to mark the RFC as frozen, while adding a section
in the RFC about how the RFC was against it (although I didn't go into
much detail, I admit).  I decided against withdrawing it, because I felt
there wasn't anything inherently wrong about the RFC; it was just
disliked.  The problem seemed that "withdrawn" and "frozen" weren't
orthogonal choices.

Perhaps one problem was that there was only one field for the status of an
RFC.  Perhaps two were needed.  One of these would be "Closure:
Open/Closed", which would indicate the activeness of the RFC, and the
other would be "Resolution: Popular/Unpopular/AlreadyDone/Impossible/etc".
Maybe this would've given maintainers the ability to better describe the
status of the RFC; I know it would've made my choice easier.

- - --
Frank Tobin http://www.uiuc.edu/~ftobin/

- -BEGIN PGP SIGNATURE-
Version: GnuPG v1.0.4 (FreeBSD)
Comment: pgpenvelope 2.9.0 - http://pgpenvelope.sourceforge.net/

iEYEARECAAYFAjoBClUACgkQVv/RCiYMT6MBMwCfe0povtY/42rca0qn9E+Sc6pb
7UgAoK7YQ6gp61LjdgZvDXFD77Oao6Gv
=xn0j
- -END PGP SIGNATURE-


--- End of Forwarded Message




Re: Critique available

2000-11-02 Thread Mark-Jason Dominus


> I just figured it was time for a little nudge.

Yes, thank you.  It is on www.perl.com now.




Re: Critique available

2000-11-03 Thread Mark-Jason Dominus


> Anyone think others are needed?

"Stick to the subject."



Garbage collector slowness

2000-12-19 Thread Mark-Jason Dominus

http://www.xanalys.com/software_tools/mm/articles/lang.html#emacs.lisp


Erik Naggum ([EMAIL PROTECTED]) reports: 

 I have run some tests at the U of Oslo with about 100 users who
 generally agreed that Emacs had become faster in the latest Emacs
 pretest. All I had done was to remove the "Garbage collecting"
 message which people perceive as slowing Emacs down and tell them
 that it had been sped up. It is, somehow, permissible for a
 program to take a lot of time doing any other task than
 administrative duties like garbage collection.


Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: Garbage collector slowness

2000-12-19 Thread Mark-Jason Dominus


> "The new version must be better because our gazillion dollar marketing
> campaign said so.  (We didn't really *fix* anything.)  

The part I found interesting was the part about elimination of the message.

Perceived slowness is also important.




Re: Schwartzian Transform

2001-03-28 Thread Mark-Jason Dominus


> So you can say
> 
>   use Memoize;
>   # ...
>   memoize 'f';
>   @sorted = sort { my_compare(f($a),f($b)) } @unsorted
> 
> to get a lot of the effect of the S word.

Yes, and of course the inline version of this technique is also
common:

   @sorted = sort { my $ac = $cache{$a} ||= f($a);
my $bc = $cache{$b} ||= f($b);
my_compare($ac,$bc);
  } @unsorted;

Joseph Hall calls this the 'Orcish Maneuver'.

However (I don't know who suggested this, but:)

> > > > >I'd think /perl/ should complain if your comparison function isn't
> > > > >idempotent (if warnings on, of course).  If nothing else, it's probably an
> > > > >indicator that you should be using that schwartz thang.

I have to agree with whoever followed up that this is a really dumb
idea.  It reminds me of the time I was teaching the regex class at
TPC3, and I explained how the /o in

/$foo/o

represents a promise to Perl that $foo will never change, so Perl can
skip the operation of checking to see if it has changed every time the
match is performed.  Then there was a question from someone in the
audience, asking if Perl would emit a warning if $foo changed.

On the other side of the argument, however, I should mention that I've
planned for a long time to write a Sort::Test module which *would*
check to make sure the comparator function behaved properly, and would
report problems.   When you use the module, it would make all your
sorts run really slowly, but you would get a warning if your
comparator was bad. 

Idempotency is not the important thing here.  The *important* property
that the comparator needs, and the one that bad comparators usually
lack is 
if my_compare(a,b) < 0, and
   my_compare(b,c) < 0, then it should also be the case that
   my_compare(a,c) < 0

for all keys a, b, and c.

Sort::Test would run a quadratic sort such as a bubble sort, and make
sure that this essential condition held true.  Note in particular that
if the comparator has the form { my_compare(f(a),f(b)) }, then it does
not matter if f() is idempotent; what really matters is that
my_compare should have the property above.

I had also planned to have optional checks:

use Sort::Test 'self';

(Make sure that my_compare(a,a) == 0 for all a)

use Sort::Test 'twice';

(Make sure that my_compare(a,b) == my_compare(a,b) for all a,b)

This last is essentially the idempotency restriction again.  The
reason I've never implemented this module is that in perl 5, sort()
cannot be overridden, so the usefulness seemed low; you would have to
rewrite your source code to use it.  I hope this limitation is fixed
in perl 6, because it would be a cool hack.

Finally, another argument in the opposite direction yet again.  It has
always seemed to me that this 'inconsistent sort comparator' thing is
a tempest in a teapot.  In the past it has gotten a lot of attention
because some system libraries have a qsort() function that dumps core
if the comparator is inconsistent.  

To me, this obviously indicates a defective implementation of
qsort().  If the sort function dumps core or otherwise detects an
inconsistent comparator, it is obviously functioning suboptimally.  An
optimal sort will not notice that the comparator is inconsistent,
because the only you can find out that the comparator is returning
inconsistent results is if you call it in a situation where you
already know what the result should be, and it returns a different
result.  An optimal sort function will not call the comparator if it
already knows what the result should be!

For example, consider the property from above:
if my_compare(a,b) < 0, and
   my_compare(b,c) < 0, then
   my_compare(a,c) < 0.

If the qsort() already knows that a


Re: Please make "last" work in "grep"

2001-05-10 Thread Mark-Jason Dominus



On (03 May 2001 10:23:15 +0300) you wrote:

> Michael Schwern:
> > 
> > Would be neat if:  my($first) = grep {...} @list;  knew to stop itself, yes.
> > 
> > It also reminds me of mjd's mention of:  my($first) = sort {...} @list;
> > being O(n) if Perl were really Lazy.
> 
> But it would need a completely different algorithm.  

Not precisely.  If you have lazy evaluation, then quicksort is exactly
what is wanted here.  For example, if you implement qsort in the
straightforward way in Haskell, and write

min = first quicksort list;

then it *does* run in O(n) time; in this case qucksort reduces to
Hoare's algorithm for min.

> my ($first, $second, $third) = sort {...} @list;

The Haskell version of this also runs in O(n) time.

> is kind-of plausible.  So we'd definitely want
> 
>   ((undef)x((@list+1)/2), $median) = sort {...} @list;

The Haskell equivalent of this (still using quicksort) runs in O(n log
n) time, which I believe is optimal for finding the median.




Re: explicitly declare closures???

2001-09-04 Thread Mark-Jason Dominus


Says Dave Mitchell:

> Closures ... can also be dangerous and counter-intuitive, espcially to
> the uninitiated. For example, how many people could say what the
> following should output, with and without $x commented out, and why:
> 
> {
> my $x = "bar";
> sub foo {
> # $x  # <- uncommenting this line changes the outcome
> return sub {$x};
> }
> }
> print foo()->();
> 

That is confusing, but it is not because closures are confusing.  It
is confusing because it is a BUG.  In Perl 5, named subroutines are
not properly closed.

If the bug were fixed, the result would be 'bar' regardless of whether
or not $x was commented.

This would solve the  problems with mod_perl also.

The right way to fix this is not to eliminate closures, or to require
declarations.  The right way to fix this is to FIX THE BUG.




Re: Objects, methods, attributes, properties, and other related frobnitzes

2003-02-20 Thread Mark Jason Dominus
Dan Sugalski <[EMAIL PROTECTED]>:
> At 2:06 PM + 2/19/03, Peter Haworth wrote:
> >On Fri, 14 Feb 2003 15:56:25 -0500, Dan Sugalski wrote:
> >>  I got clarification. The sequence is:
> >>
> >>  1) Search for method of the matching name in inheritance tree
> >>  2) if #1 fails, search for an AUTOLOAD
> >>  3) if #2 fails (or all AUTOLOADs give up) then do MM dispatch
> >
> >Shouldn't we be traversing the inheritance tree once, doing these three
> >steps at each node until one works, rather doing each step once for the
> >whole tree. MM dispatch probably complicates this, though.
> 
> No, you have to do it multiple times. AUTOLOAD is a last-chance 
> fallback, so it ought not be called until all other chances have 
> failed.

Pardon me for coming in in the middle, but it seems to me that only
one traversal should be necessary.  The first traversal can accumulate
a temporary linked list of AUTOLOAD subroutines.  If the first
traversal locates an appropriate method, the linked list is discarded.
If no appropriate method is found, control is dispatched to the
AUTOLOAD subroutine at the head of the list, if there is one; if the
list is empty the MM dispatch is tried.


Testing job

2004-04-19 Thread Mark Jason Dominus

I'm writing automated tests for the example code in my book, which
will go into production early next month.  I have the harness and test
apparatus all set up; I wrote a complete set of tests for chapter 6,
and I think I know how I want it done.  But I need help writing the
tests themselves, because time is short and I have a lot of other
stuff to do.

If you would be interested in helping me with this, send me mail right
away.  I believe that my publisher is willing to pay for it, although
I don't know how much.

-D.


Re: RFC 105 (v1) Downgrade or remove "In string @ must be \@" error

2000-08-16 Thread Mark-Jason Dominus


This has already been done for Perl 5.6.1.  Here is what perldelta.pod
has to say.



=head2 Arrays now Always Interpolate Into Double-Quoted Strings

In double-quoted strings, arrays now interpolate, no matter what.  The
behavior in perl 5 was that arrays would interpolate into strings if
the array had been mentioned before the string was compiled, and
otherwise Perl would raise a fatal compile-time error.  In versions
5.000 through 5.003, the error was

Literal @example now requires backslash

In versions 5.004_01 through 5.6.0, the error was

In string, @example now must be written as \@example

The idea here was to get people into the habit of writing
C<"fred\@example.com"> when they wanted a literal C<@> sign, just as
they have always written C<"Give me back my \$5"> when they wanted a
literal C<$> sign.

Starting with 5.6.1, when Perl now sees an C<@> sign in a
double-quoted string, it I attempts to interpolate an array,
regardless of whether or not the array has been used or declared
already.  The fatal error has been downgraded to an optional warning:

Array @example will be interpolated in string

This warns you that C<"[EMAIL PROTECTED]"> is going to turn into
C if you don't backslash the C<@>.

See L<http://www.plover.com/~mjd/perl/at-error.html> for more details
about the history here.




Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




TAI time

2000-08-18 Thread Mark-Jason Dominus


TAI is an international time standard.  It has a number of technical
advantages over UTC.  One of these advantages is that it doesn't have
any silly truck with leap seconds.

Dan Bernstein has defined a time format called TAI64 which is based on
TAI.  The format is very simple.  TAI64 is almost compatible with Unix
epoch time.  TAI64 has a resolution of one second and a range of about
300 billion years.

Bernstein has a small, good-quality library of code for manipulating
TAI64 values and for converting them to and from UTC or unix epoch
time.  The library is in the public domain, so there can't be any
license or copyright objection to including it in Perl.

TAI64 has one-second precision, but there are extensions to it,
TAI64N and TAI64A, with nanosecond and attosecond precision.  libtai
handles these extensions also.

libtai has functions to convert calendar dates and times (such as
"March 27 1823") into TAI values and back and to input and output date
and time strings.  It has functions for addition, subtraction, and
comparison of TAI times.  The interfce is simple and well-documented.

If we're going to standardize on a single time format for all
platforms, I wish we could choose a good format.  Unix time runs out
in 2038.

The libtai blurb is at:

http://cr.yp.to/libtai.html

I've included this below.

Public-domain source code for libtai:

http://cr.yp.to/libtai/libtai-0.60.tar.gz

The spec for TAI64:

http://cr.yp.to/libtai/tai64.html


BLURB:

libtai is a library for storing and manipulating dates and times.

libtai supports two time scales: (1) TAI64, covering a few hundred
billion years with 1-second precision; (2) TAI64NA, covering the same
period with 1-attosecond precision. Both scales are defined in terms of
TAI, the current international real time standard.

libtai provides an internal format for TAI64, struct tai, designed for
fast time manipulations. The tai_pack() and tai_unpack() routines
convert between struct tai and a portable 8-byte TAI64 storage format.
libtai provides similar internal and external formats for TAI64NA.

libtai provides struct caldate to store dates in year-month-day form. It
can convert struct caldate, under the Gregorian calendar, to a modified
Julian day number for easy date arithmetic.

libtai provides struct caltime to store calendar dates and times along
with UTC offsets. It can convert from struct tai to struct caltime in
UTC, accounting for leap seconds, for accurate date and time display. It
can also convert back from struct caltime to struct tai for user input.
Its overall UTC-to-TAI conversion speed is 100x better than the usual
UNIX mktime() implementation.

This version of libtai requires a UNIX system with gettimeofday(). It
will be easy to port to other operating systems with compilers
supporting 64-bit arithmetic.

The libtai source code is in the public domain.



Re: TAI Time

2000-08-19 Thread Mark-Jason Dominus


I agree with Tim that it's a red herring that unix systems don't
normally have access to a TAI source. 

The proposal under discussion is to use one time format for all
platforms.  So maybe there's a minor difficulty in converting unix
time to TAI time; probably it's not as large as the difficulty in
converting VMS time (for example) to whatever other
platform-independent standard we were going to agree on.

When you have one platform-independent standard, you necessarily
accept that there are going to be conversion difficulties form the
native time format to the standard format.  Converting from unix time
to TAI is one of the smaller such difficulties.

We could hack Dan's library so that it carries the leap second table
internally, and only tries to fall back to the file when the date is
out of range of the internal table.  

I was about to start a discussion of what this would mean for calls
like

localtime(20)

and then I realized that, as usual, I have no idea what the RFC is
actually proposing.


Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: RFC 138 (v1) Eliminate =~ operator.

2000-08-23 Thread Mark-Jason Dominus


It seems to me that there are at least two important things missing
from this proposal.

1. There is no substantive rationale presented for why the change
   would be desirable.

The only reasons you put forth are:

  * The syntax is ugly and unintuitive.

Ugliness is a matter of opinion, and I don't think it has a place
here.  Anyone else could simply reply, "Well, I think the =~ notation
is beautiful and elegant, and your notation is ugly and clumsy," and
there is no arguing with this point of view.

Intuition varies from person to person.  I think the proposal would be
stronger if you would discuss some specific technical problems with
the existing notation.  Normally when we say that a notation is
'unintuitive' what we mean is that it works differently from the way
people expect it to, so that they use it incorrectly.  You have not
provided any examples of how =~ is used incorrectly.

  * It performs a function that is semantically no different from
other forms of argument passing.  

The same could be said for any operator, including +, and in fact some
languages do treat + as a function whose operands are passed as
arguments.  For example, in Lisp,

(my-function arg1 arg2 arg3)

and

(+ arg1 arg2 arg3)

are syntactically identical.  Since your argument here applies as well
to +, -, ->, etc., it is not clear why your proposal is for =~ and not
for +, -, ->, also.  

I think you should add some sections to the proposal explaining what
the benefits of your proposed change would be.



The other thing that I think is missing from the proposal is a
discussion of precedence issues.  For example, you did not say what

/pat/ $x . $y ;

would do.  Is it equivalent to 

/pat/ ($x . $y) ;

or to

(/pat/ $x) . $y ;

?

I also worry that there may be some lexical issues lurking here.
Are you sure that it's never ambiguous whether a particular / will
indicate the start of a pattern match or a division operator?  I would
like to see some discussion of this.

I have several other complaints (I think you should either remove the
wacky ideas, or treat them fully) but these are my main worries about
the proposal.



Summary of regex-related RFCs so far

2000-08-23 Thread Mark-Jason Dominus


Several RFCs have been issued that relate to regexes or pattern
matching but which predate the perl6-language-regex list.  I have
asked the librarian to transfer ownership of these RFCs to this list.
In the meantime, here is a summary of the outstanding regex-related
RFCs:

72 (v1): The regexp engine should go backward as well as forward. 

It is proposed that the regular expression engine should be
designed so that, when it is in an accepting state, it will
either consume the character after the end of the currently
matching part of the target string, or the character just
before its beginning, depending on instructions embedded in
the regexp itself.

93 (v1): Regex: Support for incremental pattern matching 

This RFC proposes that, in addition to strings, subroutine
references may be bound (with =~ or !~ or implicitly) to a
regular expression. 

110 (v1): counting matches 

Provide a simple way of giving a count of matches of a pattern.

112 (v1): Assignment within a regex 

Provide a simple way of naming and picking out information
from a regex without having to count the brackets.

135 (v1): Require explicit m on matches, even with ?? and // as delimiters. 

C and C are what makes Perl hard to tokenize.
Requiring them to be written C and C would
solve this.

Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: RFC 138 (v1) Eliminate =~ operator.

2000-08-23 Thread Mark-Jason Dominus


> I'm not concerned about / being mistaken for division, since that
> ambiguity already exists with bare /pat/ matches. 

Yes, but the current ambiguity is resolved from context in a rather
complicated way.  Nevertheless it turns out that Perl does the right
thing in most cases.  You are proposing to change the context, and
it's not clear that the result will be the right thing as often as in
the past.

It may turn out that the new notation really does have exactly the
same ambiguities, but that's not clear to me now.  All I said was that
I would like to see some discussion of it.




Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Mark-Jason Dominus


> There's also long been talk/thought about making $& and $1 
> and friends magic aliases into the original string, which would
> save that cost.

Please correct me if I'm mistaken, but I believe that that's the way
they are implemented now.  A regex match populates the ->startp and
->endp parts of the regex structure, and the elements of these items
are byte offsets into the original string.  




Re: RFC 144 (v1) Behavior of empty regex should be simple

2000-08-24 Thread Mark-Jason Dominus


> >I propose that this 'last successful match' behavior be discarded
> >entirely, and that an empty pattern always match the empty string.
> 
> I don't see a consideration for simply s/successful// above, which
> has also been talked about.  

Thanks, I will add this to the next version.  I did consider that, and
I rejected it.  Here's my thinking: s/successful// does make the
feature somewhat more useful, but (a) all those uses are more easily
accomplished with qr() these days, and (b) it's still an
action-at-a-distance effect, which means that it's fragile and that
the behavior of working code can change suddenly and surprisingly when
it is modified.

If you have remarks about this topic that you think are missing,
please do let me know.




Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Mark-Jason Dominus


> >Please correct me if I'm mistaken, but I believe that that's the way
> >they are implemented now.  A regex match populates the ->startp and
> >->endp parts of the regex structure, and the elements of these items
> >are byte offsets into the original string.  
> 
> I haven't looked at it at all, and perhaps that 's sometihng Ilya
> did when creating @+ etc.  So you might be right.  

As far as I know it's the same in 5.000.

I thought the problem with $& was that the regex engine has to adjust
the offsets in the startp/endp arrays every time it scans forward a
character or backtracks a character.  

But maybe the effect of $& is greatly exaggerated or is a relic from
perl4?  Has anyone actually benchmarked this recently?




Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Mark-Jason Dominus


> But maybe the effect of $& is greatly exaggerated or is a relic from
> perl4?  Has anyone actually benchmarked this recently?

Matching with $& enabled is about 40% slower.

http://www.plover.com/~mjd/perl/amper.pl




Re: RFC 145 (v1) Brace-matching for Perl Regular Expressions

2000-08-24 Thread Mark-Jason Dominus


> What exactly is matched by \g and \G is controlled by two new special
> variables, @^g and @^G, which are arrays of strings. 

These sorts of global variables have been a problem in the past.
Since they change the meaning of the \g and \G escapes, I think they
should be pragmas or some other declaration that has a lexical scope.

This puzzle actually pops up in your RFC:

It is a run-time error to compile a regular expression that
contains \g or \G while the @^g and @^G arrays do not contain
the same number of elements.

If it is a run-time error to compile a regex, that means that the
regex compilation is occuring at run time.  That is a recipe for very
slow regexes.  Regex compilation needs to happen at compile time
except in special cases.

If the declarations have lexical scope, then Perl will be able to
optimize regexes that contain \g and \G.  With @^G and @^g global
variables, every regex that uses \g and \G will have to explicitly
examine the global variables every time it wants to match \g or \G,
because the values of @^g and @^G will not be known until run time and
might vary.  But if you have a lexically scoped declaration instead of
a global variable, then Perl will be able to compile \g as if you had
said [()] or whatever, and \G similarly.  This will make the regex
engine run faster.

(As a side note, there is no such variable as $^g, so you will have to
think of something else to call it.  Perhaps ${^Group_Open} and
${^Group_Close}?)

(Also, the \G escape already has a meaning in Perl 5, so it would
probably be better to think of some other name.)

> =head1 PROBLEMS
> 
> How should a \G without a prior \g be interpreted in a regular expression?

I don't think that's a big problem.  One reasonable option is to make
it a compile-time error:

\G without preceding \g in pattern at ...

So presumably Larry will be able to think of other reasonable
behaviors also.

The big problem I see that you didn't address is that you didn't say
what would happen when the target string contains mismatched
parentheses.

Your example was:

$string = "([b - (a + 1)] * 7)";
$string =~ /\g.*?\G/;

Now here \g matches the "(" and sets up \G so that \G will only match
the corresponding ")".  Then .*? matches "[b - (a + 1)] * 7" and \G
matches the ")".  

Now suppose the string were 

$string = "(b - a + 1] * 7)";
$string =~ /\g.*?\G/;

Now what happens here?  \g matches "(" and sets up \G so that \G will
only match the corresponding ")".  Then what?  I'm not sure from your
proposal.  

Your later example (in the 'implementation' section) suggests that '['
and ']' are ignored once \g matches a '('.  If that is true, then in
the example above, the .*?  would match "bb - a + 1] * 7".  I think
this won't be what people will want from \g...\G.  We will still going
to get a lot of questions from people asking how to tell if the
delimiters in a string are balanced.

(Site note: I'm not sure why you used .*? here instead of .*, since as
I understand your proposal, .* would have done the same thing.  I
suggest that you change .*? to .* or else add a remark about why this
would be different.)

Another ambiguity in your proposal:  You want

[\g]

to match any single open delimiter character.  But then later on you have
an example where @^g contains the string "/*".  What would [\g] do in
this case?

>As it continues scanning, it encounters the "]" between the "f" and the
>")". The \G does not match this "]" character, because the \g must match
>a ")".

You mean \G here instead of \g, don't you?

> sub parse
> {
> my $string = shift;
> while ($string =~ /([^\g])*(\g)(.*?)(\G)([^\g\G]*)/g)

Don't you mean ([^\g]*) instead of ([^\g])* here?





Re: RFC 110 (v3) counting matches

2000-08-28 Thread Mark-Jason Dominus


> Drawing on some of the proposals for extended 'for' syntax:
>   for my($mo, $dy, $yr) ($string =~ /(\d\d)-(\d\d)-(\d\d)/g) {
> ...
>   }
> 
> This still requires that you know how many () matching groups are in
> the RE, of course.  I don't think I would consider that onerous.

If ther regex is fixed at compile time, you can simple count.  But if
the regex varies at run time, it's not only onerous, it's pretty near
to impossible.



Re: RFC 110 (v3) counting matches

2000-08-28 Thread Mark-Jason Dominus

> > 1. Return the number of matches
> > 
> > 2. Iterate over each match in sequence
> > 
> > 3. Return list of all matches
> > 
> > 4. Return a list of backreferences
> 
> Please see RFC 164. It can handle all of 1-3. 

You seem to have missed my point.  I'm not asking for a notation that
can do all these four things.  We have such a notation already.

I'm asking for a notation that does these things *orthogonally* and
*consistently*.  

As nearly as I can tell RFC164 doesn't address this at all.
It's basically syntactic sugar for the same mess we have now.

If I am mistaken, please correct me.




Re: RFC 110 (v3) counting matches

2000-08-28 Thread Mark-Jason Dominus


> > $count = () = $string =~ /pattern/g;
> 
> Which I find cute as a demonstration of the Perl's context concept,
> but ugly as hell from usability viewpoint.  

I'd really like to see an RFC that looks into making the following
features more orthogonal:

1. Return the number of matches

2. Iterate over each match in sequence

3. Return list of all matches

4. Return a list of backreferences


Perl presently uses various combinations of /g and scalar/list context
to get these.  But some useful variants are missed.  For example,
suppose you have a string like this:

"04-23-64 02-13-62 02-01-99 05-13-18 08-10-99"

You can run a loop once for each date:

while ($string =~ /\d\d-\d\d-\d\d/g) {
  ...
}

You can also extract the month-day-year parts of the first date:

($mo, $dy, $yr) =  ($string =~ /(\d\d)-(\d\d)-(\d\d)/);

But there is no convenient way to run the loop once for each date and
split the dates into pieces:

# WRONG
while (($mo, $dy, $yr) = ($string =~ /\d\d-\d\d-\d\d/g)) {
  ...
}

This is an infinite loop.  It sets $mo $dy $yr to 04 23 64, repeatedly.

One solution here is:

while ($string =~ /\d\d-\d\d-\d\d/g) {
  ($mo, $dy, $yr) = ($& =~ /(\d\d)-(\d\d)-(\d\d)/)
  ...
}

Not only do you have to use $&, but you also have to write the pattern
twice.  

Another solution:

   @matches = ($string =~ /(\d\d)-(\d\d)-(\d\d)/g);
   while (@matches) {
 ($mo, $dy, $yr) = splice @matches, 0, 3;
 ...
   }

This is clumsy, and it doesn't work unless you know in advance how
many backreference groups the pattern will contain.  (Perl knows, and
this number is part of the struct regexp, but there is no way to get
Perl to tell you.)

My wish list for better orthogonality is actually a little longer than
the four items above, but the other items are more abstruse.




Re: RFC format

2000-08-29 Thread Mark-Jason Dominus


Nat Torkington writes:
> Mark-Jason Dominus writes:
> > RFC should have a section that addresses the feasibility of
> > translating perl5 to perl6 code if the proposed change is adopted.
> > This section should be required.
> 
> I agree.
> 
> Ziggy, want to patch the sample RFC and the RFC format document?

Since you haven't had a chance to do this yet, I thought it might help
if I supplied a patch.

--- rfc-format.html 2000/08/29 16:33:46 1.1
+++ rfc-format.html 2000/08/29 16:35:30
@@ -44,7 +44,7 @@
 Format
 
 RFCs are written in POD.  rfc-sample.pod is a sample. The important sections 
are: TITLE, VERSION, ABSTRACT,
-DESCRIPTION, IMPLEMENTATION, and REFERENCES. An optional section is STATUS.
+DESCRIPTION, IMPLEMENTATION, TRANSLATION, and REFERENCES. An optional section is 
+STATUS.
 
 
 A description of each section follows:
@@ -117,6 +117,16 @@
 
 Discussion of the possible implementations. This doesn't have to be
 completely defined down to the char *, instead enough to show that it 
can be done.
+
+
+
+TRANSLATION
+
+Discussion of the issues involved in translating old Perl 5 code to
+Perl 6 code.  If a Perl 5 feature is being eliminated, can it be
+emulated in Perl 6?  If a feature is being changed, can the old
+behavior be achieved with the new feature?  Remember that it must be
+possible to perform the translation automatically.
 
 
 
--- rfc-sample.pod  2000/08/29 16:38:41 1.1
+++ rfc-sample.pod  2000/08/29 16:38:13
@@ -47,6 +47,17 @@
 new model of signal handling which would make it difficult to reuse
 algorithms and code for systems programming from C.
 
+=head1 TRANSLATION
+
+In the 'Checkpointing' scenario, Perl 5 code would run without change.
+
+In the 'Event Loop' scenario, Perl would be supplied with a module,
+possibly Signal.pm, which provided a magical %SIG array which would
+emulate the old behavior.  Installing a handler into %SIG would
+actually register an event handler with Perl's event loop.  Using %SIG
+would automatically load this module, similar to the way Error.pm is
+loaded automatically when %! is used.
+
 =head1 REFERENCES
 
   RFC 6: "Standard Event Loop"



Re: RFC 166 (does-not-match)

2000-08-29 Thread Mark-Jason Dominus


> This is going to need a much better definition...

Yes, that was my point.

I snipped the following discussion, in which you argued against a
suggestion that I advanced only as an example of something that would
not work.

> (?^baz) should behave as (.*)(?{$1 !~ /baz/})

I don't think that's going to do it.  Consider this pattern:

/foo(?^baz)baz/

Here I am trying to match strings like "foobarbaz" and "foo---baz"
that have a foo and a baz separated by something else that is not a
baz.  But with your definition, 

"foobazbaz" =~ /foo(?^baz)baz/

is true, when I wanted it to be false.  This is because the (?^baz)
matches the empty string after the 'o', and the "baz" in the pattern
matches the first baz in the string, instead of the second one.  

> I think one should outlaw .* before or after a (?^foo) construct, as
> the result is meaningless.

As it stands now the whole notion is meaningless, because you have not
given it a meaning.  

Can you provide a detailed explanation of just what is and what is not
outlawed?  I presume that .+ is also forbidden.  What about a*, .?,
.{3}, etc.?

I wonder if this restriction is really necessary?

> I can tighten the definition up.  If there have been calls for a 
> (?^baz) type construct before, there will be again.  It is a matter of
> getting the definition straightforward and useable.

Yes, I agree completely.  I am looking forward to the next version of
your RFC.




Re: RFC 110 (v3) counting matches

2000-08-29 Thread Mark-Jason Dominus


OK, I think this discussion should be closed.

Richard should add a section to RFC110 that discusses the

$count = () = m/PAT/g;

locution and its advantages and disadvantages compared to his
proposal, duly taking into account the many valuable comments that
have been made.

Thanks to everyone who participated in the discussion.




Proposal for IMPLEMENTATION sections

2000-08-29 Thread Mark-Jason Dominus


The IMPLEMENTATION section of the RFC is supposed to be mandatory, but
there have been an awful lot of RFCs posted that have missing or
evasive IMPLEMENTATION sections.  I found more than 39% of all RFCs
have a missing or incomplete implementation section.  

Here are the results of my survey.  Of 166 total RFCs: (numbers 1-167,
except #41)

RFCs: 24 25 69 70 80 81 106 128 132 147 148 159 164

These 13 ( 8%) had very brief IMPLEMENTATION sections that
didn't contain any substantive discussion.  In these cases I
judged that an implementation section would have been
desirable.  Some RFCs do not need implementation sections.  I
have enumerated these separately below.

In some cases the section was actually flippant.  #147 is a
good example here.

RFCs: 21 26 62 84 88 110 112 131 136 137 140 149 162 165 166

These 15 ( 9%) had no IMPLEMENTATION section at all.  I was
surprised that the librarian had even accepted these, since
that section is not described as 'optional' in the RFC format
document.

RFCs: 97 100

These 2 ( 1%) said that implementation discussion was beyond
the scope of the RFC, which I don't understand, since it
clearly *is* part of the scope of the RFC.


RFCs: 8 12 21 23 31 40 53 54 55 58 59 72 73 93 103 104 120 133 134 150 167

These 21 (13%) contained remarks about the author's ignorance.

For example:
#53: "Dammit, Jim, I'm a doctor, not an engineer!"
#93: "I'll leave that to the internals guys. :-) "
#40: "I've no real concrete ideas on this, sorry."


RFCs: 5 6 14 30 39 45 64 75 87 89 109 113 115 160

These 14 ( 8%) contain IMPLEMENTATION sections, but do not
actually discuss implementation.  Instead, they contain more
or less detailed discussions of the *interfaces* to the
proposed new features.

I recommend that a change be made to the RFC metadocuments to
make the purpose of the implementation section clearer.

This makes a total of 65 (39%) that have missing or bogus
implementation sections.

Of the remainder:

RFCs: 1 2 3 4 10 11 13 17 18 19 22 27 32 35 36 37 38 42 43 44 46 47 48
  49 50 51 52 56 57 60 61 63 65 66 67 71 78 79 82 83 85 86 90 92
  95 96 98 99 108 111 116 117 119 121 123 124 129 130 135 138 139
  142 143 145 146 151 152 153 154 155 156 157 158 161 163

These 75 (45%) appeared to contain explanations of
implementation issues.  In some cases the discussion seemed
clearly deficient, but that's not the problem I'm trying to
address in this message.  I did not try to judge whether or
not the discussion was cogent or to the point, but only
whether a good-faith effort had been made to identify and
discuss the issues.


RFCs: 7 16 33 34 68 74 76 77 91 94 102 107 114 118 121 144

These 16 (10%) said something along the lines of "The
implementation should be straightforward."  I did not try to
judge whether this was actually true.  

RFCs: 9 28 29 101 105 125 126 127 141

These  9 ( 5%) don't contain any substantive discussion of
implementation issues, because it is not appropriate or
necessary.   For example, #125 is "Components in the Perl Core
Should Have Well-Defined APIs and Behavior" and #28 is "Perl
should stay Perl".



Summary:

Have implementation section:75  (45%)
Should have implementation but do not:  65  (39%)
"Implementation is straightforward":16  (10%)
Don't need implementation section:   9  ( 5%)


I don't think this is a good thing.  People are proposing all sorts of
stuff without thinking even a little bit about how it might be
implemented.  I think the proposals might be more carefully thought
through if the proposers were not allowed to evade thinking about
implementations.

Not everyone knows enough about Perl's internal design or about
programming design generally to be able to consider the issues.  I
suggest that these people should write to the approrpriate working
group chair and ask to be put in touch with someone who can help them
with the internals sections of their RFC.  Then they can work out some
of the details together and we might avoid some of the more obviously
half-baked suggestions.




Re: Proposal for IMPLEMENTATION sections

2000-08-29 Thread Mark-Jason Dominus


> > These 13 ( 8%) had very brief IMPLEMENTATION sections that
> > didn't contain any substantive discussion.  
> > 
> > These 21 (13%) contained remarks about the author's ignorance.
> > 
> > These 15 ( 9%) had no IMPLEMENTATION section at all.  
> 
> The distinction between these three cases is arbitrary and trivial,
> being as they are more a reflection of the authors' tastes.

No, that is not true.  The distinction between the first two groups is
trivial.  The third group is a group of RFCs that were published even
though the supposedly required section was omitted.  

I mentioned the remarks about the authors' ignorance because it seemed
to me that these were people who might have appreciated being hooked
up with someone who could help make their RFCs stronger.

> I wish you had applied the standard more evenly; imho, 97 & 100 had
> good reasons for their cursory treatments of implementation.

Sorry.  I should have put in a disclaimer that I did the survey very
quickly and I didn't try to be consistent.  The important point of the
survey was that many RFCs that should have implementation sections
lack them.  The details about why the section was omitted are there
mostly to pander to curiosity.

I'd like to amend my proposal.  Suppose that the librarian *suggests*
that RFC authors contact the WG chair when they submit RFCs that omit
the implementation section?  That way nobody is forced to do anything,
and many people might be grateful for the service.




Re: RFC 165: Allow variables in a tr///

2000-08-29 Thread Mark-Jason Dominus


> Would there be any interest in adding these two ideas to this RFC:
> 
> 1) tr is not regex function, so it should be regularized to
> 
>tr(SEARCH, REPLACE, MOD, STR)

MOD should be last, because you're frequently going to want to omit MOD.  

But I think this is worth discussing further, because it neatly
accomplishes the goal of the RFC in a straightforward way:

tr('a-z', 'A-Z', $str)

replaces a-z with A-Z, and

tr($foo, $bar, $str)

replaces the characters from $foo with the characters from $bar.
No special syntax is necessary.

People might even stop writing things like

tr/[a-z]/[A-Z]/

if we did that.



Re: RFC 165 (v1) Allow Varibles in tr///

2000-08-29 Thread Mark-Jason Dominus


> =head1 IMPLENTATION
> 
> No idea, but should be straight forward.

I think the reason this hasn't been done before it because it's *not*
quite straightforward.

The way tr/// works is that a 256-byte table is constructed at compile
time that say for each input character what output character is
produced.  Then when it's time to apply the tr/// to a string, Perl
iterates over the string one character at a time, looks up each
character in the table, and replaces it with the corresponding
character from the table.

With tr///e, you would have to generate the table at run-time.

This would suggest that you want the same sorts of optimizations that
Perl applies when it encounters a regex that contains variables:

1. Perl should examine the strings to see if they have changed
   since the last time it executed the code

2. It should rebuild the tables only if the strings changed

3. There should be a /o modifier that promises Perl that the
   variables will never change.

The implementation could be analogous to the way m/.../o is
implemented, with two separate op nodes: One that tells Perl
'construct the tables' and one that tells Perl 'transform the
string'.  The 'construct the tables' node would remove itself from the
op tree if it saw that the tr//o modifier was used.





Re: RFC 165: Allow variables in a tr///

2000-08-29 Thread Mark-Jason Dominus


> When does the structure get built?  That's why eg. tr[a-z][A-Z] 
> brooks no variables, for it is solely at compile time that these
> things occur, and why you must resort to delayed compilation via
> eval qq/.../ to prod the compiler into building you a new one.

Certainly.   But if there were no variable interpolation for regexes,
you could make the same argument about regexes.  I don't see any
reason why the regex solution couldn't or shouldn't be extended to
tr/// also.  If the pattern and replacement sets contain variables,
then table construction can be deferred until run time; if there are
no variables, the table is computed at compile time.

Building a tr/// table is much much simpler and much less work than
compiling a regex, but we don't make people write

eval " \$s =~ m/$pat/ "

when they want to interpolate a string into a regex at run time.
Instead, we take care of it transparently.  tr/// could easily be made
to work the exact same way.

> Maybe you want qt/.../.../ or something.

I don't think a new notation is necessary in this case.  All that's
needed is a small extension to the existing semantics, in a direction
that has already been thoroughly investigated.






Re: RFC 165: Allow variables in a tr///

2000-08-29 Thread Mark-Jason Dominus


> One thing to be careful of there is thread safety.  You can't hand
> the data off the syntax node (the one with the tr op on it), because
> tr/$foo/$bar/ wouldn't work for several threads in it at the same
> time then.

Certainly, but that is true for everything else that is in the op
node, which includes the pattern in m/.../o.  

One of my hopes is that the Perl 6 internals will fix this
long-standing error, in which case the solution they adopt will apply
to tr///e in the same way that it will to m//o and ?? and X...Y
and all the rest.




Re: RFC 110 (v2) counting matches

2000-08-29 Thread Mark-Jason Dominus


> /t is suggested for "counT", as /c is already taken.  Using /t
> without /g would be result in only 0 or 1 being returned, which is
> nearly the existing syntax.

It occurs to me that since none of the capital letters are taken, we
could adopt the convention that a capital letter as a regex modifier
will introduce a *word* which continues up to the next comma.  So for
example:


m/.../Count   (instead of m/.../t)
m/.../iCount  (instead of m/.../it)
m/.../Count,i (instead of m/.../ti)
m/.../Count,Insensitive   (instead of m/.../ti)

That would escape the problem that we are running out of letters and
also the problem that the current letters are hard to remember.





Re: RFC 110 (v3) counting matches

2000-08-29 Thread Mark-Jason Dominus


> On Mon, 28 Aug 2000, Mark-Jason Dominus wrote:
> 
> > But there is no convenient way to run the loop once for each date and
> > split the dates into pieces:
> > 
> > # WRONG
> > while (($mo, $dy, $yr) = ($string =~ /(\d\d)-(\d\d)-(\d\d)/g)) {
> >   ...
> > }
> 
> What I use in a script of mine is:
> 
> while ($string =~ /(\d\d)-(\d\d)-(\d\d)/g) {
> ($mo, $dy, $yr) = ($1, $2, $3);
> }
> 
> Although this, of course, also requires that you know the number of
> backreferences. 

The real problem I was trying to discuss was not this particular
application.  I was trying to point out a larger problem, which is
that there are several regex features that are enabled or disabled
depending on what context the match is in, so that if you want one
scalar-context feature and one list-context feature at the same time,
there is no direct way to do it.

> Nicer would be to be able to assign from @matchdata or something
> like that :)

I agree.  There are many operations that would be simpler if there was
a magic array that contained ($1, $2, $3, ...).  If anyone wants to
write an RFC on this, I will help.




Re: RFC 110 (v2) counting matches

2000-08-29 Thread Mark-Jason Dominus


> On Tue, 29 Aug 2000 08:47:25 -0400, Mark-Jason Dominus wrote:
> 
> >m/.../Count,Insensitive   (instead of m/.../ti)
> >
> >That would escape the problem that we are running out of letters and
> >also the problem that the current letters are hard to remember.
> 
> Yes, but wouldn't this give us backward compatibility problems? For
> example, code like
> 
>   $result = m/(.)/Insensitive, ord $1;

No, because that is presently a syntax error.  The one you have to
watch out for is:

$result = m/(.)/s,Insensitive, ord $1;

> And, I don't really see the need for the comma.
> 
> m/.../CountInsensitive   (instead of m/.../ti)

I guess, but to me CountInsensitive looks like one option, not two.




Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()

2000-08-29 Thread Mark-Jason Dominus


> Make your suggestions. But I think it is all off-base. None of this  is
> addressing some improvement in working conditions, ease of use, problems
> in the language, etc.

1. I don't agree.

2. This mailing list is also for discussing stylistic improvements to
   the language.  

3. If you think people are talking about the wrong things, then you
   should submit your own RFCs on the right things, instead of
   complaining about what other people are doing.  I have not seen any
   RFCs from you.

>  MJD's // killer RFC is a headache.

I would appreciate a clear discussion of why that is.  That is what we
are here for.  If the RFC does not lay out clearly what problem it is
tryhing to solve, that is a problem with the RFC and it's something we
should discuss on the list.  

However, this comment by itself is not useful.  

> I don't see how this solves an already existing problem. 

I didn't either, and I objected to RFC138 on that basis.  But Larry
said:

# Well, the fact is, I've been thinking about possible ways to get rid
# of =~ for some time now, so I certainly don't mind brainstorming in
# this direction.

So I consider the metasubject (of whether we should be discussing that
topic at all) to be officially closed.




RFC 166 (does-not-match)

2000-08-29 Thread Mark-Jason Dominus


Richard Proctor's RFC166 says:

> =head2 Matching Not a pattern
> 
> (?^pattern) matches anything that does not match the pattern.  On
> its own, one can use !~ etc to negatively match patterns, but to
> match a pattern that has foo(anything but not baz)bar is currently
> difficult.  With this syntax it would simply be /foo(?^baz)bar/.

The problem with this proposal is that it's really unclear what it
means.

The reason we don't have this feature today is not that it has never
been thought of before.  People have thought of this a hundred times.
The problem is that nobody has ever figured out how it should work.
I don't mean that the implemenation is difficult. I mean  that nobody
understand what such a a feature actually means.   Richard doesn't say
this in his RFC, even for the simple examples he raises.  He just
assumes that it will be obvious, but it isn't.  

"foo-bazbar"  =~ /foo(?^baz)bar/# true or false?
"foo-baz-bar" =~ /foo(?^baz)bar/# true or false?

OK, I'm going to try to invent a meaning for (?^baz).  I'm going to
choose what appears to be a reasonable choice, and see what happens.

Let's suppose that what (?^baz) means is "match any substring that is
not 'baz'."  That is a reasonably clear meaning.  Then it behaves like
(.*)(?{$1 ne 'baz'}) does today.  Then the examples above are both
true.

Now let's see how that choice works out.

"foobaz" =~ /foo.*(?^baz)/

This is TRUE, because "foo" matches "foo", ".*" matches "baz", and
"(?^baz)" matches the empty string at the end, which is a substring
that is not "baz".

In fact, with this apparently reasonable choice of meaning for
(?^baz), /foo.*(?^baz)/ will match anything that /foo.*/ will.  The
(?^baz) has hardly any effect at all.

It is a good thing that we did not implement it that way, because it
is sure to become an instant FAQ:  "Why does /foo.*(?^baz)/ match
'foobaz'?"  You are going to see this question in comp.lang.perl.misc
every week.

So this choice I made for the meaning of (?^baz) appears to have been
the wrong one. I could go on and make a different reasonable-seeming
choice and show what was wrong with it, but I don't want to belabor my
point, which is:

Every choice anyone has ever made for the meaning of (?^baz) has
always been the wrong one for one reason or another.  So without a
detailed explanation of what (?^baz) might mean, suggesting that Perl
6 have one is not helpful.  



RFC 166 (disambiguator)

2000-08-29 Thread Mark-Jason Dominus


Richard Proctor suggests that (?) will match the empty string. 
Then it can be inserted into regexes to separate elements that need to
be separated.  For example, /$foo(?)bar/ interpolates the value of
$foo and then looks for that pattern followed by 'bar'.   You cannot
simply write /$foobar/ because then Perl tries to interpolate $foobar,
which is not what you wanted.

1. You can already write /${foo}bar/ to get what you wanted.  This
   solution already works inside of double-quoted strings.  (?) would
   not work inside of double-quoted strings.

2. You can already write /$foo(?:)bar/ to get what you wanted.  This
   is almost identical to what Richard proposed anyway.

It is really not clear to me that this problem needs to be solved any
better than it is already.

I suggest that this section be removed from the RFC.

Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: RFC 110 (v3) counting matches

2000-08-29 Thread Mark-Jason Dominus


> >>solution to execute perl code inside a string, replacing "${\(...)}" and
> >
> >The first one doesn't work, and never did.  You want 
> >@{[]} and @{[scalar ]} instead.
> 
> "Doesn't work"?

I think what Tom means is that (for example)

print "${\(localtime())}\n";

does not produce "Tue Aug 29 19:15:55 2000".

Anyway, this is off-topic for this mailing list, so let's put an end
to this part of the discussion unless it relates somehow to regexes.




Re: RFC 110 (v3) counting matches

2000-08-30 Thread Mark-Jason Dominus


> On Tue, 29 Aug 2000, Mark-Jason Dominus wrote:
> 
> > OK, I think this discussion should be closed.
> 
> I think the bit about "having a special array containing all captured
> matches" might well still live on. The "counting" bit _per se_ is probably
> fairly closed, though.

I didn't mean to close the discussion about counting.  The only part
of the discussion that I thought should be closed was the argument 
about whether 

$count = () = m/.../g;

was a good idea, and the following discussion that was all about
context issues and context operators and had nothing to do with regexes.

Sorry that this was unclear.



Re: RFC 165 (v1) Allow Varibles in tr///

2000-08-29 Thread Mark-Jason Dominus


> Accepting variables in tr// makes no sense. It defeats the purpose of
> tr/// - extremely fast, known transliterations.

The propsal extends tr/// to handle extremely fast transliterations
whose nature is not known at compile time.

> 
> tr///e is the same as s///g:
> 
> tr/$foo/$bar/e  ==  s/$foo/$bar/g

It is nothing of the sort.

$foo = 'fo';
$bar = 'ba';

$s1 = $s2 = "foolproof";

$s1 =~ tr/$foo/$bar/e;
# The result is "baalpraab";

$s2 =~  s/$foo/$bar/g;
# The result is "baolproof"





Re: RFC 165 (v1) Allow Varibles in tr///

2000-08-29 Thread Mark-Jason Dominus


> Note that the 256-byte thing is out the window with Unicode, but that
> I no longer know how it is done.

Thanks.  I was going to mention that, but I forgot before I sent the
message.  The 256-byte thing is still in place with unicode, but it's
only used on byte strings, not on UTF8 strings.  Since the byte/UTF8
thing might be going out the window in Perl 6, it's hard to speculate
about the implications for tr///.

But I think my main point still stands: We don't have any problem with
reconstructing a (potentially humongous) regex structure at run time,
so I don't see why we should have a problem with reconstructing the
tr/// tables at run time.




Overlapping RFCs 135 138 164

2000-08-29 Thread Mark-Jason Dominus


RFC135: Require explicit m on matches, even with ?? and // as delimiters.

C and C are what makes Perl hard to tokenize.
Requiring them to be written C and C would
solve this.

(Nathan Torkington)

RFC138: Eliminate =~ operator.

Replace EXPR =~ m/.../ with m/.../ EXPR, and similarly for
s/// and tr///. Force an explicit dereference when using
qr/.../. Disallow the implicit treatment of a string as a
regular expression to match against.

(Steve Fink)

RFC164: Replace =~, !~, m//, and s/// with match() and subst()

Several people (including Larry) have expressed a desire to
get rid of C<=~> and C. This RFC proposes a way to replace
C and C with two new builtins, C and
C.

(Nathan Widger)


I would like to see these three RFCs merged into one if this is
appropriate.  I am calling on the three authors to discuss in private
email how this may be done.  I hope that the discussion will result in
the withdrawal at least two of the three RFCs, and that this private
discussion produces a new RFC.  The new RFC should discuss the points
raised by all three existing RFCs, should investigate several
solutions in parallel, and should compare them with one another and
contrast the benefits and drawbacks of each one.





Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: RFC 110 (v2) counting matches

2000-08-29 Thread Mark-Jason Dominus


> Mark-Jason Dominus wrote:
> > 
> > m/.../Count   (instead of m/.../t)
> > m/.../iCount  (instead of m/.../it)
> > m/.../Count,i (instead of m/.../ti)
> > m/.../Count,Insensitive   (instead of m/.../ti)
> 
> Blech, no. Please. Less typing good. More typing bad.
> 
> If you're just proposing synonyms, I don't see anyone using these
> besides as mnemnonics. In which case, the key is just making sure
> that we pick good letters.

Iwas proposing synonyms for the existing options, and an expanded
namespace for future options.

It is perfectly reasonable for common flags to get short names and
uncommon flags to get long names.  For example, I think that if the /c
option had had only a long name, it would have imposed very little
burden on the community, and it would have left /c itself available
for the more useful application of producing a count.

> I don't see us running out of letters. 

The problem is not with running out of letters.  The problem is with
running out of appropriate letters.

I raised this suggestion in response to Richard Proctor's observation
that /c was unavailable for 'count', and suggesting /t instead.  

> Last I checked, m// only takes half a dozen flags. 

m// and s/// presently take eight different flags. (cegimosx) In the
past, several others have been proposed, including /r, /t, and /z.

> And so on. This seems like a much more productive use, otherwise we're
> just wasting characters.

Characters are not in short supply.

Anyway, I will consider the subject closed unless someone produces an
RFC for it.




Re: Proposal for IMPLEMENTATION sections

2000-08-30 Thread Mark-Jason Dominus


> Any requirements on how solid an implementation section should be
> should be left to the working group chairs.

Sorry, I don't understand this.  What is the WGC's role here?





Re: Proposal for IMPLEMENTATION sections

2000-08-30 Thread Mark-Jason Dominus


> On Wed, Aug 30, 2000 at 02:29:33PM -0400, Mark-Jason Dominus wrote:
> > 
> > > Any requirements on how solid an implementation section should be
> > > should be left to the working group chairs.
> > 
> > Sorry, I don't understand this.  What is the WGC's role here?
> 
> My english native language is?  :-)

I didn't have problem with the parsing.  I had trouble with the meaning.

Suppose a WGC establishes a requirement for the solidity of the
implementation section, and receives an RFC that does not meet the
requirements.  What then?




Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

2000-08-31 Thread Mark-Jason Dominus


> I am unemcumbered by any knowledge of the regex engine implementation, 

Yeah.

But I do know something about it, and I have already expressed my
informed opinion.  Having you come along to say that you don't know
anything about it at all, but that you nevertheless think I am
mistaken, is bizarre.

> It might be possible to unroll this imagined inner test outside the loop -

Perhaps you could study the code in regexec.c for a little bit of
time, say fifteen minutes, and then make this suggestion again in
light of what you discover.

I have no problem with discussing this in more detail, but I don't
think it would be a good use of my time to discuss it with you when
you haven't looked at the code.







Re: $& and copying: rfc 158 (was Re: RFC 110 (v3) counting matches)

2000-08-31 Thread Mark-Jason Dominus


>   MD> One of Uri's suggestions in RFC 158 was to compute $& only for
>   MD> regexes that have a /k modifier.  This would solve the $& problem
>   MD> because Perl would compute $& only when asked to, and not for
>   MD> every other regex in the rest of the program.
> 
> the rfc was about making $& private to the block with the regex and
> only make the copy if /k is used or you use grabbing.

Making $& local to a block is not going to get a performance
improvement.  The reason $1 is block localized is for safety, not
speed.  Consider:

/(...)/;
foo();
print $1;

You might have had to worry that foo() would reset $1 somehow.  But
because $1 is block-localized, you can be sure that it will be
restored automatically when foo() returns.

The performance gain in your RFC comes from the /k option, regardless
of whether or not $& gets block scope.

> a side question i have is whether this extra copy is a runtime effect or
> compile time. i would imagine runtime with some global flag being
> checked to see if $& is being used. so you could run fast and later load
> a module uses $& which slows you down. 

That doesn't make any sense.  Your proposal says that $& is only set
for regexes that have /k.  Loading a module won't change your non-/k
regexes.

> in any case, i think we have a fair agreement on rfc 158 and i will
> freeze it if there is no further comments on it.

Please add a section that addresses Perl 5 -> Perl 6 translation
issues that will apply if your proposal is adopted.



Re: $& and copying: rfc 158 (was Re: RFC 110 (v3) counting matches)

2000-08-31 Thread Mark-Jason Dominus


> in any case, i think we have a fair agreement on rfc 158 and i will
> freeze it if there is no further comments on it.

In light of this:

 $&  The string matched by the last successful pattern match (not
 counting any matches hidden within a BLOCK or eval() enclosed
 by the current BLOCK).  (Mnemonic: like & in some editors.)
 This variable is read-only and dynamically scoped to the
 current BLOCK.

 (perlvar)

I think you should remove the parts of your propsal about making $& be
 autolocalized.

Thanks, Tom.




Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

2000-08-30 Thread Mark-Jason Dominus


The big thing I find missing from this RFC is compelling examples.
You are proposing a major change to the regex engine but you only have
two examples.  Both involve only fixed strings and one of them is
artificial.  I really think you need to discuss in more detail why
this feature would be useful.

You specifically said that you wanted your feature to be able to match
expressions other than fixed strings, but you didn't give any examples
of that.

> With the proposed extension, you could write:
> 
> m/GAAC(?r)(TTAAG|  )/
> 
> and the regexp engine doesn't have to go looking deep into your regexp to
> know where it should start potential matches.

OK, now here it's not really clear why you would want to use your
feature instead of doing something like this instead:

while (m/GAAC/g) {  
  last if substr($_, pos($_)-5, 5) eq 'GAATT';
  last if ...;
  ...;
}

You could make an argument that yours is more compact, but my version
it could easily be wrapped into a subroutine, and it doesn't seem like
a particularly common operation, so it doesn't seem like there needs
to be another way to say this.  Of course, I might have completely
missed the point.  More and better examples would be a great help
here.

> As a frivolous illustration, the string 
> 
>   ABCDEFGHIJKLM
> 
> would be matched by:
> 
> m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/

If I understand your proposal correctly, it will not change the
behavior of the regex if you collect the (?f) and (/r) sesctions
together.  If this is true, then these all have the same meaning:

 m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/   # Your example
 m/FGHIJK(?r)EDCB(?r)A^(?f)LM$/
 m/FGHIJK(?r)EDCBA^(?f)LM$/
 m/FGHIJKLM$(?r)EDCBA^/   # Why not just say this?

If I am correct, then it doesn't appear that there is ever any reason
to have more than one (?r) and one (?f) in a single regex.  Also,
since there is in effect an implicit (?f) at the beginning of every
regex, you don't need a (?f) escape at all, as in the example I just
showed.  

Did I misunderstand your proposal?  Or did I miss seeing the
implication of some example that you didn't include?  If I am correct,
I think you should eliminate (?f) from your proposal, since it is not
useful.

> It will be important to know the offset where the match begins, as
> well as where it ends (indeed it would be nice to have that info in
> Perl5 without having to pay the C performance penalty),
> so in addition to C, there might be a function C to
> give the start of the match -- or C might return both end and
> start offsets in a list context.

OK, that's very nice, but you say you don't want the $& penalty.
I suspect from your discussion that you don't really understand that
$& penalty.  There are two parts to the $& penalty.

The first part is that maintaining the information for $& has a cost.
Maintaining this information for your prepos() function is going to
incur an identical cost.

The other part of the $& penalty is because $& itself is a global
variable, the penalty has to be paid by every regex in the program.
This is not a problem with the information in $&; it is a problem with
the interface to the information.  If the interface were different, $&
would not be a problem.  For example, if $& were only set on regexes
with a /k modifier, as proposed in RFC158, a lot of the pain of $&
would go away.

Now if something like RFC158 were adopted, then your rationale for
prepos() would go away, because length($&) would no longer be
particularly expensive.  At least, there would be no reason to suppose
it would be more expensive than your proposal.

However, a prepos() function had exactly the same problem as $&
presently has.  Whenever Perl did a regex match on any regex in the
entire program, it would have no way of knowing whether prepos() might
be called much later, so the cost of computing and storing the
prepos() information would be incurred.

Rather than evading the $& problem, as you suggest, introducing
prepos() is going to make it even worse.

You can evade this problem by making prepos() lexically scoped.  For
example, prepos() information is only computed for regexes that have
the /q modifier on the end, or is only available inside the scope of a
'use prepos' declaration.  Either of these would fix this problem.

> I have no idea whether this feature will help people parsing right-to-left
> languages; it seems likely to help with bi-directional texts (see RFC 50).

I was wondering that myself, but I don't think it will, because RTL
text is not encoded backwards in the string itself.  It only *prints*
right-to-left.  But I may be mistaken, and I think you should consult
with Roman Parparov on this point before submitting the next revision
of this RFC.

Finally, some general comments: First, it seems to me that if there
were simply a better interface to pos() and to length($&), the need
for this feature would go away.  Let's

Re: RFC 165 (v1) Allow Varibles in tr///

2000-08-29 Thread Mark-Jason Dominus


> > The way tr/// works is that a 256-byte table is constructed at compile
> > time that say for each input character what output character is
> 
> Speaking of which, what's going to happen when there are more than 256
> values to map?


It's already happened, but I forget the details.



Re: RFC 165 (v1) Allow Varibles in tr///

2000-08-30 Thread Mark-Jason Dominus


> Ok, I can understand that.  But, what happens when we get to UTF16?  Aren't
> we talking about 256k per tr///, then?  That seems like a lot of memory
> that is potentially wasted and could lead to some really large footprints.

I don't understand what this discussion has to do with this mailing
list, and I don't understand what your point is.  tr/// has already
been implemented.  It uses a 256-byte table.  tr/// has already been
extended to UTF8 strings, and it takes a certain amount of memory.
Perhaps that amount is 256K, perhaps not.  If it is, what does that
have to do with us here? 

If this discussion should go on anywhere, it should be on the
perl6-internals list.  If you want to register an opinion that 256K
bytes is too expensive, you should do that on perl6-internals.  It is
up to them to figure out if the current implementation is wasteful of
memory and to devise a new implementation if so.

For the record, the UTF8 version of tr/// does not use a plain 256K
table.  It uses a data strcuture called a 'swash'; this is also the
data structure that is used by the UTF8 versions of 'uc' etc., the
\p{...} regex escapes, and the others.  The swash is based on a hash,
and the code is in utf8.c.





Re: RFC 110 (v3) counting matches

2000-08-31 Thread Mark-Jason Dominus


> (mystery: how
> can filling in $& be a lot slower than filling in $1?)

It isn't.  It's the same.  $1 might even be more expensive than $&.

It appears that many people don't understand the problem with $&.  I
will try to explain.

Maintaining the information required by $1 or $& slows down the regex
match, possibly by as much as forty to sixty percent, or more.  (How
much depends on details of the regex and the target string.)

For this reason, Perl has an optimization in it so that if you never
use $& anywhere in your program, Perl never maintains the information,
and every regex in your program runs faster.

But if you do use $& somewhere, Perl cannot apply the optimization,
and it must compute the $& information for every regex in the program.
Every regex becomes much slower.

In particular, if you load a module whose author happened to use $&,
all your regexes get slower, which might be an unpleasant surprise,
since you might not be aware of the cause.

A regex with backreferences is *also* slow.  But using backreferences
in one regex does not make all the *other* regexes slow.  If you have

/(...)/   # regex 1
/.../ # regex 2

Perl knows that it must compute the backreference information for
regex 1, and knows that it can skip computing the backreference
information for regex 2, because regex 2 contains no parentheses.

If you use a module that contains regexes that use backreferences,
those regexes run slowly, but there is no effect on *your* regexes.

The cost is just as high for backreferences as for $&, but the
backreference cost is paid only by regexes that actually need it.

The $& cost is paid by every regex in the entire program, whether they
used it or not.  This is because Perl has no way to tell which regexes
use $& and which do not. 

One of Uri's suggestions in RFC 158 was to compute $& only for regexes
that have a /k modifier.  This would solve the $& problem because Perl
would compute $& only when asked to, and not for every other regex in
the rest of the program.




perl6-language-regex summary for 20000831

2000-08-31 Thread Mark-Jason Dominus
ry neatly.  Nathan Wiger pointed out that this
was covered by RFC 164.

I pointed out that the implementation would have construct the
translation table at run-time, and that this brings in the same issues
as when a regex is constructed at run time.  For example, a new tr///o
option becomes desirable for the same reaosn the m//o is desirable.
Tom and I had a discussion of these issues, but there do not appear
to be any issues here that do not also come up in connection with
interpolated regexes.

There was a sidetrack about the implementation of tr/// in the
presence of Unicode strings.

RFC 166: Additions to regexs  (Richard Proctor)

This RFC unfortunately proposes three totally unrelated changes.

Richard proposed a 'does not match' operator, with the example that

/a(?^b)c/

would match ac, axc, a---c, but not abc, a-b-c, or abbbc.  But did not
include a complete enough explanation of what it would do to enable
anyone to implement it.  (Nobody has been able to produce a sensible
description of such an operator, which is probably why Perl doesn't
have one yet.)  Richard said he would tighten up the definition, but
version 2 has not appeared yet.

Richard also proposed a (?) operator that would match the empty
string.  You would use this in cases like /$foo(?)bar/ where it is
inappropriate to abut $foo and bar.  It was pointed out that
/${foo}bar/ and /$foo(?:)bar/ already work for this purpose.  Richard
agreed that this was what he wanted.

The third proposal was that (?@foo) be taken to interpolate the string
(join "|", @foo).  There was no discussion of this.
 
RFC 170: Generalize =~ to a special-purpose assignment operator
 (Nathan Wiger)

This is probably the most interesting and far-reaching RFC proposed
this week, but there was essentially no discussion.



Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: perl6-language-regex summary for 20000831

2000-09-01 Thread Mark-Jason Dominus


> On Thu, Aug 31, 2000 at 12:34:05PM -0400, Mark-Jason Dominus wrote:
> > 
> > perl6-language-regex
> > 
> > Summary report 2831
> > 
> > RFC 72: The regexp engine should go backward as well as
> > forward. (Peter Heslin)
> > 
> > This topic did not attract much discussion until the very end of the
> > week.  I sent the author a detailed critique, to which he has not
> > responded. 
> 
> As the author in question, I would like to note in my defense that there

I wasn't trying to attack you, so no defense is required.  I was just
reporting on the current status of the RFC.

> I posted a detailed response (within 24 hours) and have now posted a
> revised RFC. 

You did, and it was excellent.  Thanks very much.



Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Mark-Jason Dominus


> >>>>> "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes:
> 
> Mark-Jason> I have some ideas about how to do this, and I will try to
> Mark-Jason> write up an RFC this week.
> 
> "You want Icon, you know where to find it..." :)

That's exactly my motivation.  It seems to me that trying to cram Icon
into regexes isn't working well, but that a small transplant of Icon
into the core language might suffice instead of a lot of cramming.




Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Mark-Jason Dominus


> I think what is needed is something along the line of :

Joe McMahon and I are working on something along these lines.



Re: What's in a Regex (was RFC 145)

2000-09-07 Thread Mark-Jason Dominus


> >   2. Many people - including Larry - have voiced their desire
> >  to see =~ die a horrible death
> 
> Please provide a look-up-able reference to Larry's saying that he
> wanted to =~ to die horrible death.  

Larry said:

# Well, the fact is, I've been thinking about possible ways to get rid
# of =~ for some time now, so I certainly don't mind brainstorming in
# this direction.

That is in 
<[EMAIL PROTECTED]>

which is archived at 

http://www.mail-archive.com/perl6-language-regex@perl.org/msg3.html

I think Nathan was exaggerating here, but maybe he knows something I don't.




Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Mark-Jason Dominus


> >...My point is that I think we're approaching this
> >the wrong way.  We're trying to apply more and more parser power into what
> >classically has been the lexer / tokenizer, namely our beloved
> >regular-expression engine.

I've been thinking the same thing.  It seems to me that the attempts to
shoehorn parsers into regex syntax have either been unsuccessful
(yielding an underpowered extension) or illegible or both.

An approach that appears to have been more successful is to find ways
to integrate regexes *into* parser code more effectively.  Damian
Conway's Parse::RecDescent module does this, and so does SNOBOL.

In SNOBOL, if you want to write a pattern that matches balanced
parenteses, it's easy and straightforward and legible:

parenstring = '(' *parenstring ')'  
| *parenstring *parenstring
| span('()')


(span('()') is like [^()]* in Perl.)

The solution in Parse::RecDescent is similar.

Compare this with the solutions that work now:

 # man page solution
 $re = qr{
  \(
(?:
   (?> [^()]+ )# Non-parens without backtracking
 |
   (??{ $re }) # Group with matching parens
 )*
  \)
}x;

This is not exactly the same, but I tried a direct translation:

 $re = qr{ \( (??{$re}) \)
 | (??{$re}) (??{$re})
 | (?> [^()]+)
 }x;

and it looks worse and dumps core.  

This works:

qr{
  ^
  (?{ local $d=0 })
  (?:   
  \(
  (?{$d++}) 
   |  
  \)
  (?{$d--})
  (?
(?{$d<0})
(?!) 
  )  
   |  
  (?> [^()]* )
  
  )* 


  (?
(?{$d!=0})  
(?!)
  )
 $
}x;

but it's rather difficult to take seriously.

The solution proposed in the recent RFC 145:

/([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g

is not a lot better.  David Corbin's alternative looks about the same.

On a different topic from the same barrel, we just got a proposal that
([23,39]) should match only numbers between 23 and 39.  It seems to me
that rather than trying to shoehorn one special-purpose syntax after
another into the regex language, which is already overloaded, that it
would be better to try to integrate regex matching better with Perl
itself.  Then you could use regular Perl code to control things like
numeric ranges.  

Note that at present, you can get the effect of [(23,39)] by writing
this:

(\d+)(?(?{$1 < 23 || $1 > 39})(?!))

which isn't pleasant to look at, but I think it points in the right
direction, because it is a lot more flexible than [(23,39)].  If you
need to fix it to match 23.2 but not 39.5, it is straightforward to do
that:  
  
(\d+(\.\d*)?)(?(?{$1 < 23 || $1 > 39})(?!))

The [(23,39)] notation, however, is doomed.All you can do is
propose Yet Another Extension for Perl 7.

The big problem with 

(\d+)(?(?{$1 < 23 || $1 > 39})(?!))

is that it is hard to read and understand.

The real problem here is that regexes are single strings.  When you
try to compress a programming language into a single string this way,
you end up with something that looks like Befunge or TECO.  We are
going in the same direction here.

Suppose there were an alternative syntax for regexes that did *not*
require that everything be compressed into a single string?  Rather
than trying to pack all of Perl into the regex syntax, bit by bit,
using ever longer and more bizarre punctuation sequences, I think a
better solution would be to try to expose the parts of the regex
engine that we are trying to control.

I have some ideas about how to do this, and I will try to write up an
RFC this week.



perl6-language-regex summary for 20000911

2000-09-11 Thread Mark-Jason Dominus


perl6-language-regex

Summary report 2911

RFC 72: The regexp engine should go backward as well as
forward. (Peter Heslin)

The author sent revised version of the RFC.  There seem to be two ideas
here:

1. The lookbehind assertions should work for variable-length
   patterns.  (At present they match only fixed-length strings.)

2. The programmer would be able to optimize the regex match by
   directing the engine to an unlikely part of the pattern first.  For
   example, if you are looking for 'e.*z', and you write

/eX*z/

   the regex engine might look first for an e, then some following
   X's, in hopes of finding a 'z afterwards.  If there are many e's
   and few z's, this may result in many false starts.  Peter's idea
   seems to be that with 

/z(?r)X*e/

   the regex engine would look first for a 'z', and then for
   *preceding* 'X*', and then for an 'e' before that, which might be
   faster. 

   Bart Lateur said that the regex engine should do this sort of
   optimization automatically.  (In fact it already does in some
   cases.)  

Hugo points out that a variable-length lookbehind might be more powerful.

RFC 93: Regex: Support for incremental pattern matching  (Damian Conway)

No discussion this week.

RFC 110: counting matches  (Richard Proctor)

Richard released version 4 of the RFC, which just adds a couple of
personal opinions  about how 

$number = () = m/.../g;

is ugly.  He says that he is going to add some suggestions from Hugo
van der Sanden and then freeze the RFC at the end of the week.

RFC 112: Assignment within a regex  (Richard Proctor)

No discussion.

RFC 138: Eliminate =~ operator.  (Steve Fink)

Steve withdrew this RFC in favor of RFC 164.

RFC 144: Behavior of empty regex should be simple  (Mark Dominus)

Frozen.

RFC 145: Brace-matching for Perl Regular Expressions  (Eric Roode)

David Corbin suggested an alternative syntax.  This sparked a long
series of syntactic suggestions.  Nathan Wiger suggested a special
syntax for matching XML-style open and close tags.  (However, he did
not submit an RFC.)

I *still* think that all the proposals for this functionality are too
limited and too specific.  Others seem to be thinking in the same
direction.  Jonathan Scott Duff asked

What if we just provided deep enough hooks into the RE engine
that specialized parsing constructs like these could easily be
added by those who need them?

Michael Maraist said that more powerful and convenient parsin should
be incorporated into Perl, not into the regex engine.  Tom
Christiansen expressed agreement.  Damian Conway suggested that
people look at Parse::RecDescent and suggested that this and other
parsing modules were the right direction to go.

I sent a note about SNOBOL syntaxes, and promosed that Joe McMahon and
I would an RFC, but it has not appeared yet.

RFC 150: Extend regex syntax to provide for return of a hash of
 matched subpatterns  (Kevin Walker)

Kevin reported back on Python's version of this feature.  He said that
the major deficiency in his proporsl is that there is nothing
analogous to the \1 in /(.*)\1/.  He promised a revised RFC, which has
not appeared yet.

RFC 158: Regular Expression Special Variables  (Uri Guttman)

It was pointed out that part of Uri's proposal was to make $& block
scoped, but $& is already block scoped.  Uri has not sent a revised RFC.


RFC 164: Replace =~, !~, m//, s///, and tr// with match(), subst(),
 and trade()  (Nathan Wiger)

Surprisingly, there was no discussion about this RFC this week.

RFC 165: Allow variables in tr///  (Richard Proctor)

Richard suggested that he freeze the RFC, but I don't believbe he has
sufficiently taken into account the last round of discussion.  I have
asked for a revision.

RFC 166: Additions to regexs  (Richard Proctor)

Richard plans to drop two of the three items here, and retain only
the one that makes

(?@foo)

equivalent to

(??(join '|', @foo))

I pointed out that this is already possible with a compile-time
overloaded constant, and provided a demonstration module.

RFC 170: Generalize =~ to a special-purpose assignment operator
 (Nathan Wiger)

Still little discussion of this.

RFC 197: Numberic Value Ranges In Regular Expressions (David Nichol)
  
There was no discussion.

RFC 198: Boolean Regexes (Richard Proctor)

Richard says that this is a development of the 'negated expression'
idea that he dropped from RFC 166.  

Discussion from RFC 166 pointed out that Richard had not said clearly
what he wanted his (?^...) proposal to do.  Richard proposed a
definition, and I followed up pointing out that even his proposed
definition did not do what he wanted.  Richard then said he would
tighten up the definition, but it appears that he didn't do this.

However, there has been no discussion of this proposal.




Re: RFC 197 (v1) Numberic Value Ranges In Regular Expressions

2000-09-11 Thread Mark-Jason Dominus


I have some trouble understanding just what the proposal is, since the
RFC doesn't contain any examples.  But I gather that you want to usurp
*both* the (...) and the [...] notation for numeric ranges.

This would change the meaning of any code that happened to contain a
regex like this:

/(12.3,45.67)/

That seems to me like a very bad idea.  

Usurping /[...]/ isn't quite as awful an idea, since patterns like
/[12,34]/ are probably rare.  

The behavior you want is already possible without an extension:

/(\d+\.?\d*)  # look for a number
 (?  
   (?{$1 < 37.3 || $1 > 200}) # If it's out of range
   (?!)   # ...then backtrack
 )
/x

I agree that this isn't really pretty, but

1. the proposed notation is really nasty, since it overloads existing
   well-established notations, and

2. I think a better response would be to find a way to use the
   existing features with a prettier notation, since they are much
   more generally applicable than your proposed extension.





Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

2000-09-11 Thread Mark-Jason Dominus


> Simply put, I want variable-length lookbehind.  

Why didn't you simply propose that the (?<...) operator be fixed to
support variable-length expressions?  Why so much additional machinery?




Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

2000-09-11 Thread Mark-Jason Dominus


> As to your contention that "at best" (?r) will defeat many present
> optimizations, can you tell me why this will necessarily be so in the
> new engine? 

Let me explain my thinking along these lines.  I've made a number of
assumptions, which may not be correct, and certainly aren't obvious.

I have been supposing all along that the Perl 6 regex engine will
incorporate the Perl 5 regex engine directly.  This may turn out to be
wrong, but I did think it through.  I think this for several reasons:

1.  Writing even a simple regex engine is nontrivial.  Writing a regex
engine as fast and as complicated as Perl's a very difficult.
Even Perl's regex engine was not written from scratch; it was
based on code supplied by Henry Spencer.

2.  Very few people are available who are capable of reimplementing
Perl's regex engine.  The people on this list are clearly not
going to do it.  According to someone on this list, some of the
people here are not even competent to look at the regex engine
code.

More to the point, I don't know of anyone who has volunteered, and
when I try to think of candidates, nobody likely comes to mind.

3.  Regexes are one of Perl's most essential features.  If the regexes
are slow, that is a big problem for Perl.  The existing regex
engine is fast, partly because it has years of optimizations in
it.  To start over would be to throw that all away.

4.  People have tried implementing regex engines along different
principles before, and have not been able to find anything faster
than the current strategy. 

For example, in Perl regexes are compiled into fixed-size
bytecodes; when a regex (such as /a(b|c)d/) contains a branch, the
branch is expressed as a bytecode offset.

It might seem that one could do better: Instead of using
bixed-size bytecodes, compile each regex operator as a C structure
with a pointer to the struct for the next opcode.  A branch
operator will have pointers to two other structures, instead of to
only one.

People have tried this more than once.  It turns out that this is
slower than the bytecode approach.

5.  Larry has already said that he expects that much of the initial
Perl 6 code will actually be Perl 5 code, just as much of the
initial Perlk 5 code was actually Perl 4 code.  (See
http://www.mail-archive.com/perl6-language@perl.org/msg01194.html) 

Perhaps the Perl 6 engine will be a fresh reimplementation, but I do
not think that that is very likely, because there is no good reason to
do it and because it does not appear that there is anyone available
and qualified who wants to do it.

Even if the Perl 6 engine *is* a fresh reimplementation, it seems
likely that it will operate on the same principles as the Perl 5
engine.

So I have been supposing that the Perl 6 regex engine will probably
not be rewritten from scratch, and if it *is* rewritten from scratch,
it will probably still look a lot like the Perl 5 regex engine.

As I said, this might be mistaken, but I think that it's the way to bet.





Re: RFC 165: Allow Variables in tr/// (post hugo)

2000-09-11 Thread Mark-Jason Dominus


> I propose adding the first para as a note and moving RFC to frozen soon.

You did not address my points about tr///o and related issues.

I suggest that you submit a revised RFC and then freeze it a week
afterwards if there is still no discussion.




Re: RFC 166 (v1) Additions to regexs

2000-09-11 Thread Mark-Jason Dominus


> (?@foo) is sort of equivalent to (??{join('|',@foo)}), ie it expands into a
> list of alternatives.  One could possible use just @foo, for this.

It just occurs to me that this is already possible.  I've written a
module, 'atq', such that if you write

use atq;

then your regexes may contain the sequence

(?\@foo)

with the meaning that you asked for.  

(The \ is necessary here because (?@foo) already has a meaning under
Perl 5, and I think your proposal must address this.)

Anyway, since this is possible under Perl 5 with a fairly simple
module, I wonder if it really needs to be in the Perl 6 core.  Perhaps
it would be better to propose that the module be added to the Perl 6
standard library?

Module is at

http://www.plover.com/~mjd/perl/atq.tgz




Re: $& and copying: rfc 158 (was Re: RFC 110 (v3) counting matches)

2000-09-11 Thread Mark-Jason Dominus


> > in any case, i think we have a fair agreement on rfc 158 and i will
> > freeze it if there is no further comments on it.
> 
> I think you should remove the parts of your propsal about making $& be
>  autolocalized.

If you're not planning to revise your RFC, let me know so that I can
ask the librarian to mark it as withdrawn.




Re: RFC 110 counting matches (post Hugo)

2000-09-11 Thread Mark-Jason Dominus


> I propose adding this note.  His preference for the working of
> /t and /g seems the most appropriate.  Unless I here any further
> discussion I propose moving this RFC to frozen this week.

Please post a complete, revised version of the RFC *before* you freeze it.




Re: RFC 166 (v1) Additions to regexs

2000-09-12 Thread Mark-Jason Dominus


> > (The \ is necessary here because (?@foo) already has a meaning under
> > Perl 5, and I think your proposal must address this.)
> 
> (?@foo) has no meaning I checked the code

I don't know what you mean, but you're mistaken, because it means to
interpolate @foo as in a double-quoted string.




Re: RFC - Prototype RFC Implementations - Seperating the men from the boys.

2000-09-11 Thread Mark-Jason Dominus


> Bad reasons
> I do not have time.
> I do not have the tuits.

I think it would be a step in the right direction if the WG chairs
actually required RFC authors to maintain their RFCs.




What good are WG chairs?

2000-09-11 Thread Mark-Jason Dominus


> I think it would be a step in the right direction if the WG chairs
> actually required RFC authors to maintain their RFCs.

I also think it would be a step in the right direction if the WG
chairs wrote up summaries like they said they would.  They obviously
don't.

Frankly, I don't really see what the WG chairs are for, unless maybe
it's to play list mom.




Re: RFC 166 (v1) Additions to regexs

2000-09-13 Thread Mark-Jason Dominus


> On Tue, 12 Sep 2000 19:01:35 -0400, Mark-Jason Dominus wrote:
> 
> >I don't know what you mean, but you're mistaken, because it means to
> >interpolate @foo as in a double-quoted string.
> 
> Which is precisely the meaning he wants for it, with $" set to '|'.

"Which is precisely the meaning he wants for it, except for the parts
 that are different."

So it presently has a different meaning.  

So he should say so in the section of RFC about mirgation issues.

Why is this so complicated?




Re: XML/HTML-specific ?< and ?> operators?

2000-09-11 Thread Mark-Jason Dominus


> :Anyway, Snobol has a nice heuristic to prevent infinite recursion in
> :cases like this, but I'm not sure it's applicable to the way the Perl
> :regex engine works.  I will think about it.
> 
> It is probably worth adding the heuristic above: anytime you recurse
> into the same re at the same position, there is an infinite loop.


That is basically it, except that in snobol it is inside out:  Each
recursively interpolated pattern is assumed to match a string of at
least length 1, and if the remaining part of the target string isn't
sufficiently long to match the rest of the pattern after recursion,
then the recursion is skipped.




Re: XML/HTML-specific ?< and ?> operators?

2000-09-11 Thread Mark-Jason Dominus


> : it looks worse and dumps core.
> 
> That's because the first non-paren forces it to recurse into the
> second branch until you hit REG_INFTY or overflow the stack. Swap
> second and third branches and you have a better chance:

I think something else goes wrong there too.  


>   $re = qr{...}
> (I haven't checked that there aren't other problems with it, though.)

Try this:

"(x)(y)" -~ /^$re$/;

This should match, but it dumps core.  I don't think there is infinite
recursion, although I might be mistaken.

Anyway, Snobol has a nice heuristic to prevent infinite recursion in
cases like this, but I'm not sure it's applicable to the way the Perl
regex engine works.  I will think about it.




Re: what (?x) are in use? (was RFC 145 (alternate approach))

2000-09-11 Thread Mark-Jason Dominus


> In theory, all letters should be reserved to map to future flags for
> the same reason. 

My recollection is that Larry specifically mandated this, and that's
why (?p...) was changed to (??...) in 5.6.0.




Re: Conversion of undef() to string user overridable for easy debugging

2000-09-13 Thread Mark-Jason Dominus


> This reminds me of a related but rather opposite desire I have had
> more than once: a quotish context that would be otherwise like q() but
> with some minimal extra typing I could mark a scalar or an array to be
> expanded as in qq(). 

I have wanted that also, although I don't remember why just now.  (I
think have some notes somewhere about it.)  I will RFC it if you want.

Note that there's prior art here: It's like Lisp's backquote operator.




Re: RFC 208 (v2) crypt() default salt

2000-09-14 Thread Mark-Jason Dominus


> =head1 TITLE
> 
> crypt() default salt
> 
> =head1 VERSION
> 
>   Maintainer: Mark Dominus <[EMAIL PROTECTED]>
>   Date: 11 Sep 2000
>   Last Modified: 13 Sep 2000
>   Mailing List: [EMAIL PROTECTED]
>   Number: 208
>   Version: 2
>   Status: Developing

If there are no objections, I will freeze this in twenty-four hours.



Re: types that fail to suck

2000-09-12 Thread Mark-Jason Dominus


> You talked about Good Typing at YAPC, but I missed it.  There's a
> discussion of typing on perl6-language.  Do you have notes or a
> redux of your talk available to inform this debate?

http://www.plover.com/~mjd/perl/yak/typing/TABLE_OF_CONTENTS.html
http://www.plover.com/~mjd/perl/yak/typing/typing.html

Executive summary of the talk:

1. Type checking in C and Pascal sucks.

2. Just because static type checking is a failure in C and Pascal
   doesn't mean you have to give up on the idea.

3. Languages like ML have powerful compile-time type checking that is
   successful beyond the wildest imaginings of people who suffered
   from Pascal.

4. It is probably impossible to get static, ML-like type checking into
   Perl without altering it beyond recognition.

5. However, Perl does have some type checking mechanisms, and more are
   coming up.


Maybe I should also mention that last week I had a dream in which I
had a brilliant idea for adding strong compile-time type checking to
Perl, but when I woke up I realized it wasn't going to work.





Re: RFC 166 (v2) Alternative lists and quoting of things

2000-09-15 Thread Mark-Jason Dominus


> (?Q$foo) Quotes the contents of the scalar $foo - equivalent to
> (??{ quotemeta $foo }).

How is this different from

\Q$foo\E

? 



Re: 'eval' odd thought

2000-09-15 Thread Mark-Jason Dominus


> eval should stay eval.

Yes, and this is the way to do that.  

When you translate a script, the translator should translate things so
that they have the same meanings as they did before.  If it doesn't
also translate eval, then your Perl 5 scripts will be using the Perl 6
eval, which isn't what you wanted.




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Mark-Jason Dominus


> On Mon, Sep 25, 2000 at 08:56:47PM +0000, Mark-Jason Dominus wrote:
> > I think the proposal that Joe McMahon and I are finishing up now will
> > make these obsolete anyway.
> 
> Good! The less I have to maintain the better...

Sorry, I meant that it would make (??...) and (?{...}) obsolete, not
that it will make your RFC obsolete.  Our proposal is agnostic about
whether (??...) and (?{...}) should be eliminated.




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Mark-Jason Dominus


I think the proposal that Joe McMahon and I are finishing up now will
make these obsolete anyway.