huffman encoding and bit operators

2001-10-04 Thread esp5

Picked up from slashdot:

> I really dislike having a concatenation operator that's a valid identifier 
> character. There's just no reason for it.

> And I don't completely buy the excuse that we're running out of punctuation
> characters. If you're going to jumble some of them up and talk about the 
> "Huffman" reason for picking multi-character operators for some and not
> for others: let's do a little more housecleaning. Get rid of the bit 
> operators as punctuation characters and make them named operators. That frees 
> up some punctuaction for something that's not used very often anyway.

> BTW, if you're under the impression that they're used frequently try:
>  for i in `find /usr/local/lib/perl5 -name *.pm`; 
>  do if perl -wc $i >/dev/null 2>&1; then perl -MO=Terse $i | grep 'bit_'; 
>  fi done >> bitops

> Sure it doesn't pick up everything (late compilation), but in the 1056 files 
> I have there I had exactly 7 occurances of bit operators. This does not
> impress me enough to use valuable punctuation.

Now, I just tried it on perl5.6.1, and roughly speaking he's right. 
I get eight. They are:

/usr/local/lib/perl5/5.6.1/sun4-solaris/B/Concise.pm
SVOP (0x20c7a0) const  PV (0x217ac4) "bit_and"
SVOP (0x20c7c8) const  PV (0x217ad0) "bit_xor"
SVOP (0x20c7f0) const  PV (0x217adc) "bit_or"
/usr/local/lib/perl5/5.6.1/sun4-solaris/File/Glob.pm
BINOP (0x2003f8) bit_or [21]
/usr/local/lib/perl5/5.6.1/File/Temp.pm
BINOP (0x2c7bf8) bit_or [13]
BINOP (0x2c7b30) bit_or [11]
BINOP (0x2c7f40) bit_or [19]
BINOP (0x2c2118) bit_or [29]

So, maybe this would be a good way to supplement the language design, by 
categorizing how popular given operators are in actual practise?

Ed

(ps- as for the concatenation operator remark, all I can say is that I hope that 
something else is done, else statements like $my_variable _ FUNCTION are going 
to look awful confusing...)




stringification of objects, subroutine refs

2002-05-10 Thread esp5

I was wondering how perl6 would stringify (as in Data::Dumper):

1) objects with 'my' and 'our' variables
2) $.property
2) subroutines (and coderefs)
3) properties (both is/has)

Right now, the fact that subroutines come out as 'sub { "DUMMY" }' is a minor
pain to work around, having all of these as holes would be too much IMO.

Ed



regex and xml/html/*ml

2002-06-05 Thread esp5

hmm.

Just read (skimmed) apocalypse 5, had one concern - it looks like we are on a
serious collision course with parsing the various *mls.

before:

m#..etc#

after

m#\\\#

Also, the space being backslashed sort of bugs me. Surely there is going to be
a 'non-x' modifier? And perhaps a modifier to change the character for logical
tags from <> to something else (like <<>>, perhaps?)

Ed



Re: Apoc 5 questions/comments

2002-06-07 Thread esp5

>> Can we please have a 'reverse x' modifier that means "treat whitespace as
>> literals"?

> I'll talk about that with Larry. If he were to approve it, it might possibly
> be :W.

Likewise, could we please have a modifier that makes <> literal, and aliases
<> as something else so *ml can match easier? 

The most serious objection to this was 'well, use modules for matching *ml" - 
which simply points out that the current incarnation of perl6 regex doesn'
t handle a very large class of matching problems very well.

Ed



Re: Apoc 5 questions/comments

2002-06-07 Thread esp5


f
On Fri, Jun 07, 2002 at 05:10:49PM -0400, Trey Harris wrote:
> In a message dated Fri, 7 Jun 2002, [EMAIL PROTECTED] writes:
> > The most serious objection to this was 'well, use modules for matching *ml" -
> > which simply points out that the current incarnation of perl6 regex doesn'
> > t handle a very large class of matching problems very well.
> 
> I don't think that's what people were saying at all.  They were saying you
> should use modules, not because it's too hard to do in Perl 6 regexes, but
> because *ml are well-formed, well-published languages and it doesn't make
> sense to reinvent the wheel when you're nearly certain to miss cases
> handled by the standard modules.

hmm. I thought that was perl's forte - doing quick and dirty, small scripts.

No matter how intuitive, well thought-out, or polished, working through 
a module is always going to be more restrictive than doing it through a regular
expression. It might be better in some cases, yes, but sometimes you just
want the freedom to do stuff by hand.


> Unless I'm missing something, I'm assuming that those modules, when
> rewritten in Perl 6, will be able to dump the specialized parsers and go
> to using grammars as given in A5.

No, you're not missing anything. I just don't want to be forced to be used 
modules/rules, that's all. 

And I *don't* want to backslash every damn $@#$% < I see in a XML document.
We have syntactic sugar to stop people from having to backslash \ in window's
paths, to stop people from having to backslash / inside of regular expressions.
I'd argue that being able to match *ml cleanly (and without modules or rules
or APIs) would be a hell of a lot more important.

Ed

(
ps - and no, I don't want to be forced to go back to use perl5's regex. If 
people do, that just shows the shortcomings of the perl6 system, IMO
)



exegesis 5 question: matching negative, multi-byte strings

2002-10-01 Thread esp5

I was wondering what the favored syntax in perl6 would be to match negative
multi-byte strings. In perl 5:

$sql = "select * from a where b union select * from c where d";

my $nonunion = "[^u]|u[^n]|un[^i]|uni[^o]|unio[^n]";
my (@subsqls) = ($sql =~ m"((?:$nonunion)*");

guaranteeing that the subsqls have all text up to, but not including the string
"union".

I suppose I could say:

rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } }

although that seems awful slow, and I suppose I that I could do the same thing
in perl6 as I did in perl5, although that gets ugly if you need to combine 
matching strings without "union" in them with, say parens:

rule parens {   \* [ <-[()]> + : |  ]*  \) }
rule non_union_non_parens   
{
[< -[()u] > | 
u< -[()n] > | 
un   < -[()i] > | 
uni  < -[()o] > | 
unio < -[()n] > 
] 
}

my (@subsqls) = ($sql =~ m" ([  |  ]*) ");

And finally, I suppose I could write a sql grammar (which for this application,
and most) is definitely overkill. So I guess I'd like something shorter, 
something where you could say:

< -["union"] >

or 

< -["union"\(\)] >

or 

< -["union""select"\(\)] >

a generic negative, multi-byte string matching mechanism. Any thoughts? 
Am I missing something already present or otherwise obvious?

Ed



Re: exegesis 5 question: matching negative, multi-byte strings

2002-10-01 Thread esp5

On Tue, Oct 01, 2002 at 01:24:45PM -0600, Luke Palmer wrote:
> 
> > [Negative matching]
> 
> > a generic negative, multi-byte string matching mechanism. Any thoughts? 
> > Am I missing something already present or otherwise obvious?
> 
> Maybe I'm misundertanding the question, but I think you want negative
> lookahead:
> 
> Perl 5:   /(.*)(?!>union)/
> Perl 6:   /(.*) /
> 
> Luke

no, that doesn't work, because of the way regexes operate. The '.*' captures 
everything, and since the string after everything (ie: the end of the string)
doesn't match 'union', the regex succeeds without backtracking. Try it:

perl -e ' $a = "this has the string union in it"; my ($b) = ($a =~ m"(.*)(?!>union)"); 
print $b;'

prints: 

this has the string union in it

not 'this has the string'.

Ed




Re: exegesis 5 question: matching negative, multi-byte strings

2002-10-01 Thread esp5

On Tue, Oct 01, 2002 at 06:32:07PM -0400, Mike Lambert wrote:
> > guaranteeing that the subsqls have all text up to, but not including the string
> > "union".
> >
> > I suppose I could say:
> >
> > rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } }
> 
> What's wrong with: ?
> 
> rule getstuffbeforeunion { (.*?) union | (.*) }
> 
> "a union" => "a "
> "b" => "b"
> 
> Am I missing something here?
> 
> Mike Lambert
> 

hmm... well, it works, but its not very efficient. It basically 
scans the whole string to the end to see if there is a "union" string, and 
then backtracks to take the alternative. And hence, its not very scalable. 
It also doesn't 'complexify' very well.

Suppose you had a long string of text, and you wanted to 'harden' your regex
against the substring union appearing in double-quoted strings, single-quoted 
strings, etc. etc, without writing a sql parser. I just don't see how to do this
with ? - I would do something like (taking a page from Mr. Friedl's book ) - 

rule regex_matching_sql 
{
[
<-[u()"']>+ : |
: |
 : |
 : |

]*
}

rule parens
{
\(
[
<-["'()]>+  : |
 : |
 : |
 
]*
\)
}

rule single_string
{
\' [ <-[\'\\]>+ : | \.\' ]* \'
}

rule double_string
{
\" [ <-[\"\\]>+ : | \.\" ]* \"
}

rule non_union {  [ u < - ['"()n] > | un ... | uni ... | unio ... | u$ ] * }

Of course I could also be missing something, but I just don't see how to do this
with .*?. 

Ed

(ps:
As for:

/(.*)  /

I'm not sure how that works; and whether or not its very 'complexifiable' 
(as per above) . If it does a match against every single substring (take all 
characters, look for union, if it exists, roll back a character, do 
the same thing, etc. etc. etc.) then this isn't good enough.  The non_union 
rule listed above is about as efficient as it can get; it does no backtracking,
and it keeps the common matches up front so they match first without 
alternation.
)



Re: exegesis 5 question: matching negative, multi-byte strings

2002-10-01 Thread esp5

On Tue, Oct 01, 2002 at 05:24:43PM -0400, Peter Behroozi wrote:
> On Tue, 2002-10-01 at 16:44, [EMAIL PROTECTED] wrote:
> > doesn't work (just tried it out, not sure why it doesn't) but even if it did,
> > it would be awful slow. It would try one character, look at the next for the 
> > string union, come back for the next character, look for the string union,
> > etc. etc. etc.
> > 
> > whereas
> > 
> > ([^u]+|u[^n])
> > 
> > doesn't do any backtracking at all..
> > 
> > Ed
> 
> perl -e ' $a = "this has the string union in it"; 
> my ($b) = ($a =~ m"((?:(?!union).)*)"); print $b;'
> 
> prints the desired result for me at least.  It also should be comparably

whoops. Must have mistyped. Works for me now.

> efficient to the alternation since the match for the string 'union'
> should fail if the first character is not 'u', etc.  The alternation
> also matches a character at a time except in special cases, where I am
> reasonably sure that the extra overhead from alternation compensates for
> multi-character matching.  This method also does no backtracking for the
> provided example; I am not sure what made you think that it did.
> 
> Peter
> 

well, when I said backtracking, I meant it didn't flip between the current 
character and the next. I couldn't check real numbers doing benchmarking 
because the ?! construct core dumps on both perl-5.6.1 and perl-5.8 on large 
strings.

But when benchmarked on small (30 line strings) using:

my $regex1 = qr{(?:(?!union).)*}sx;
my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;

timethese
(10,
{   
'questionbang' => sub { my ($b) = ($line =~ m"($regex1)"); },
'alternation'   => sub { my ($b) = ($line =~ m"($regex2)"); }
}
);

I get:

Benchmark: timing 10 iterations of alternation, questionbang...
alternation: 11 wallclock secs (10.66 usr +  0.00 sys = 10.66 CPU) @ 9380.86/s 
(n=10)
questionbang: 18 wallclock secs (18.81 usr +  0.00 sys = 18.81 CPU) @ 5316.32/s 
(n=10)

so ?! is a bit slower. It could probably be made faster though.

However, I'm still skeptical as it being a good replacement for the alternation.
Look at my posted message (about making the regex be able to handle nested 
parens, etc) and see if you can come up with an easy way handle the case I 
mentioned there..

Ed



Re: exegesis 5 question: matching negative, multi-byte strings

2002-10-02 Thread esp5

On Wed, Oct 02, 2002 at 10:39:17AM +0300, Markus Laire wrote:
> On 1 Oct 2002 at 18:47, [EMAIL PROTECTED] wrote:
> 
> > > > all text up to, but not including the string "union".
> > >
> > > rule getstuffbeforeunion { (.*?) union | (.*) }
> > > 
> > > "a union" => "a "
> > > "b" => "b"
> > 
> > hmm... well, it works, but its not very efficient. It basically 
> > scans the whole string to the end to see if there is a "union" string, and 
> > then backtracks to take the alternative. And hence, its not very scalable. 
> > It also doesn't 'complexify' very well.
> 
> What about
> 
> Perl 5:   /(.*?)(?:union|$)/
> Perl 6:   /(.*?) [union | $$]/
> 
> or if you want to exlude 'union' from match
> 
> Perl 5:   /(.*?)(?=union|$)/
> Perl 6:   /(.*?) [ | $$]/
> 

that's exceedingly slow, at least by my benchmark. So far, I've got 4 
possibilities:

my $regex1 = qr{(?:(?!union).)*}sx;
my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;
my $regex3 = qr{(?:[^u]+|(?!union).)*}sx;
my $regex4 = qr{(.*?)(?=union|$)}sx;

timethese
(
10,
{
'questionbang'  => sub { ($line =~ m"($regex1)"); },
'questionbang2' => sub { ($line =~ m"($regex3)"); },
'alternation'   => sub { ($line =~ m"($regex2)"); }
'nongreedy' => sub { ($line =~ m"($regex4)"); },
}
);


which come out:

alternation:  8 wallclock secs ( 7.71 usr +  0.00 sys =  7.71 CPU) @ 12970.17/s 
(n=10)
questionbang: 17 wallclock secs (16.05 usr +  0.00 sys = 16.05 CPU) @ 6230.53/s 
(n=10)
questionbang2:  8 wallclock secs ( 7.74 usr +  0.00 sys =  7.74 CPU) @ 12919.90/s 
(n=10)
nongreedy: 41 wallclock secs (41.74 usr +  0.00 sys = 41.74 CPU) @ 2395.78/s (n=10)


So yes, a form can be constructed out of ?! which is of approximately equal 
speed to the alternation.

However, in straight C, the corresponding time is:

2.31u 0.02s 0:02.37 98.3%

which tells me that a lot of optimisation could be made with a generic 
mechanism for (non)matching multi-byte character classes. The problem has 
to be dealt with anyways when considering unicode... And which form would people
rather type:

(<-[^u]>+|(?!union).)*

or
<-[^'union']>*

I'd say the second scores over the first in intuition, if nothing else...

Ed



Re: Perl6 summary for week beginning 2002-09-30

2002-10-06 Thread esp5

> Someone mysteriously known only as "Ed" asked what the favored syntax would be
> to match negative multi-byte strings in Perl 6. It wasn't entirely clear
> what the question was, but one thing is sure: the Perl 6 pattern matching
> engine will have a lot of scope for optimisation.

Oops, sorry, just realized my mailing header info didn't contain my full name
(Ed Peschko). 

Anyways, the point was that multi-byte non-matching support was abysmal, 
and to propose a new syntax.  The fact that the best thing people could come
up with was:

(?:(?!union).)*

after a long brainstorming session of false starts and false solutions just 
points out to the fact that it could be simpler.

And I think that by making the concept of "character class" more generic 
(into a 'string class' as it were, where alternations can take 
arbitrary-length strings) matches a class of real world problems closely.
Ex: nested begin/end loops, ie 

BEGIN
<-['BEGIN''END'>+ |

END

as well as giving strong hints to the optimiser to do the match fast.

Ed



Re: exegesis 5 question: matching negative, multi-byte strings

2002-10-07 Thread esp5

On Mon, Oct 07, 2002 at 07:11:08AM -0500, [EMAIL PROTECTED] wrote:
> > match negative multi-byte strings
> 
> 
> in perl5, I'd tend to do
> 
> m/(?:(?!union).)*/is
> 
> or to capture
> 
> m/((?:(?!union).)*)/is

yeah, I'm not arguing that there isn't a solution available, just that the 
solution is convoluted and somewhat un-intuitive (witness the many people
who write (.*)(?!union)).

Anyways, to make that perform with any speed you need to say:

((?:[^u]+|(?!union).*))

which is uglier than sin...

Ed (Peschko)