huffman encoding and bit operators
Picked up from slashdot: > I really dislike having a concatenation operator that's a valid identifier > character. There's just no reason for it. > And I don't completely buy the excuse that we're running out of punctuation > characters. If you're going to jumble some of them up and talk about the > "Huffman" reason for picking multi-character operators for some and not > for others: let's do a little more housecleaning. Get rid of the bit > operators as punctuation characters and make them named operators. That frees > up some punctuaction for something that's not used very often anyway. > BTW, if you're under the impression that they're used frequently try: > for i in `find /usr/local/lib/perl5 -name *.pm`; > do if perl -wc $i >/dev/null 2>&1; then perl -MO=Terse $i | grep 'bit_'; > fi done >> bitops > Sure it doesn't pick up everything (late compilation), but in the 1056 files > I have there I had exactly 7 occurances of bit operators. This does not > impress me enough to use valuable punctuation. Now, I just tried it on perl5.6.1, and roughly speaking he's right. I get eight. They are: /usr/local/lib/perl5/5.6.1/sun4-solaris/B/Concise.pm SVOP (0x20c7a0) const PV (0x217ac4) "bit_and" SVOP (0x20c7c8) const PV (0x217ad0) "bit_xor" SVOP (0x20c7f0) const PV (0x217adc) "bit_or" /usr/local/lib/perl5/5.6.1/sun4-solaris/File/Glob.pm BINOP (0x2003f8) bit_or [21] /usr/local/lib/perl5/5.6.1/File/Temp.pm BINOP (0x2c7bf8) bit_or [13] BINOP (0x2c7b30) bit_or [11] BINOP (0x2c7f40) bit_or [19] BINOP (0x2c2118) bit_or [29] So, maybe this would be a good way to supplement the language design, by categorizing how popular given operators are in actual practise? Ed (ps- as for the concatenation operator remark, all I can say is that I hope that something else is done, else statements like $my_variable _ FUNCTION are going to look awful confusing...)
stringification of objects, subroutine refs
I was wondering how perl6 would stringify (as in Data::Dumper): 1) objects with 'my' and 'our' variables 2) $.property 2) subroutines (and coderefs) 3) properties (both is/has) Right now, the fact that subroutines come out as 'sub { "DUMMY" }' is a minor pain to work around, having all of these as holes would be too much IMO. Ed
regex and xml/html/*ml
hmm. Just read (skimmed) apocalypse 5, had one concern - it looks like we are on a serious collision course with parsing the various *mls. before: m#..etc# after m#\\\# Also, the space being backslashed sort of bugs me. Surely there is going to be a 'non-x' modifier? And perhaps a modifier to change the character for logical tags from <> to something else (like <<>>, perhaps?) Ed
Re: Apoc 5 questions/comments
>> Can we please have a 'reverse x' modifier that means "treat whitespace as >> literals"? > I'll talk about that with Larry. If he were to approve it, it might possibly > be :W. Likewise, could we please have a modifier that makes <> literal, and aliases <> as something else so *ml can match easier? The most serious objection to this was 'well, use modules for matching *ml" - which simply points out that the current incarnation of perl6 regex doesn' t handle a very large class of matching problems very well. Ed
Re: Apoc 5 questions/comments
f On Fri, Jun 07, 2002 at 05:10:49PM -0400, Trey Harris wrote: > In a message dated Fri, 7 Jun 2002, [EMAIL PROTECTED] writes: > > The most serious objection to this was 'well, use modules for matching *ml" - > > which simply points out that the current incarnation of perl6 regex doesn' > > t handle a very large class of matching problems very well. > > I don't think that's what people were saying at all. They were saying you > should use modules, not because it's too hard to do in Perl 6 regexes, but > because *ml are well-formed, well-published languages and it doesn't make > sense to reinvent the wheel when you're nearly certain to miss cases > handled by the standard modules. hmm. I thought that was perl's forte - doing quick and dirty, small scripts. No matter how intuitive, well thought-out, or polished, working through a module is always going to be more restrictive than doing it through a regular expression. It might be better in some cases, yes, but sometimes you just want the freedom to do stuff by hand. > Unless I'm missing something, I'm assuming that those modules, when > rewritten in Perl 6, will be able to dump the specialized parsers and go > to using grammars as given in A5. No, you're not missing anything. I just don't want to be forced to be used modules/rules, that's all. And I *don't* want to backslash every damn $@#$% < I see in a XML document. We have syntactic sugar to stop people from having to backslash \ in window's paths, to stop people from having to backslash / inside of regular expressions. I'd argue that being able to match *ml cleanly (and without modules or rules or APIs) would be a hell of a lot more important. Ed ( ps - and no, I don't want to be forced to go back to use perl5's regex. If people do, that just shows the shortcomings of the perl6 system, IMO )
exegesis 5 question: matching negative, multi-byte strings
I was wondering what the favored syntax in perl6 would be to match negative multi-byte strings. In perl 5: $sql = "select * from a where b union select * from c where d"; my $nonunion = "[^u]|u[^n]|un[^i]|uni[^o]|unio[^n]"; my (@subsqls) = ($sql =~ m"((?:$nonunion)*"); guaranteeing that the subsqls have all text up to, but not including the string "union". I suppose I could say: rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } } although that seems awful slow, and I suppose I that I could do the same thing in perl6 as I did in perl5, although that gets ugly if you need to combine matching strings without "union" in them with, say parens: rule parens { \* [ <-[()]> + : | ]* \) } rule non_union_non_parens { [< -[()u] > | u< -[()n] > | un < -[()i] > | uni < -[()o] > | unio < -[()n] > ] } my (@subsqls) = ($sql =~ m" ([ | ]*) "); And finally, I suppose I could write a sql grammar (which for this application, and most) is definitely overkill. So I guess I'd like something shorter, something where you could say: < -["union"] > or < -["union"\(\)] > or < -["union""select"\(\)] > a generic negative, multi-byte string matching mechanism. Any thoughts? Am I missing something already present or otherwise obvious? Ed
Re: exegesis 5 question: matching negative, multi-byte strings
On Tue, Oct 01, 2002 at 01:24:45PM -0600, Luke Palmer wrote: > > > [Negative matching] > > > a generic negative, multi-byte string matching mechanism. Any thoughts? > > Am I missing something already present or otherwise obvious? > > Maybe I'm misundertanding the question, but I think you want negative > lookahead: > > Perl 5: /(.*)(?!>union)/ > Perl 6: /(.*) / > > Luke no, that doesn't work, because of the way regexes operate. The '.*' captures everything, and since the string after everything (ie: the end of the string) doesn't match 'union', the regex succeeds without backtracking. Try it: perl -e ' $a = "this has the string union in it"; my ($b) = ($a =~ m"(.*)(?!>union)"); print $b;' prints: this has the string union in it not 'this has the string'. Ed
Re: exegesis 5 question: matching negative, multi-byte strings
On Tue, Oct 01, 2002 at 06:32:07PM -0400, Mike Lambert wrote: > > guaranteeing that the subsqls have all text up to, but not including the string > > "union". > > > > I suppose I could say: > > > > rule nonunion { (.*) :: { fail if ($1 =~ m"union$"); } } > > What's wrong with: ? > > rule getstuffbeforeunion { (.*?) union | (.*) } > > "a union" => "a " > "b" => "b" > > Am I missing something here? > > Mike Lambert > hmm... well, it works, but its not very efficient. It basically scans the whole string to the end to see if there is a "union" string, and then backtracks to take the alternative. And hence, its not very scalable. It also doesn't 'complexify' very well. Suppose you had a long string of text, and you wanted to 'harden' your regex against the substring union appearing in double-quoted strings, single-quoted strings, etc. etc, without writing a sql parser. I just don't see how to do this with ? - I would do something like (taking a page from Mr. Friedl's book ) - rule regex_matching_sql { [ <-[u()"']>+ : | : | : | : | ]* } rule parens { \( [ <-["'()]>+ : | : | : | ]* \) } rule single_string { \' [ <-[\'\\]>+ : | \.\' ]* \' } rule double_string { \" [ <-[\"\\]>+ : | \.\" ]* \" } rule non_union { [ u < - ['"()n] > | un ... | uni ... | unio ... | u$ ] * } Of course I could also be missing something, but I just don't see how to do this with .*?. Ed (ps: As for: /(.*) / I'm not sure how that works; and whether or not its very 'complexifiable' (as per above) . If it does a match against every single substring (take all characters, look for union, if it exists, roll back a character, do the same thing, etc. etc. etc.) then this isn't good enough. The non_union rule listed above is about as efficient as it can get; it does no backtracking, and it keeps the common matches up front so they match first without alternation. )
Re: exegesis 5 question: matching negative, multi-byte strings
On Tue, Oct 01, 2002 at 05:24:43PM -0400, Peter Behroozi wrote: > On Tue, 2002-10-01 at 16:44, [EMAIL PROTECTED] wrote: > > doesn't work (just tried it out, not sure why it doesn't) but even if it did, > > it would be awful slow. It would try one character, look at the next for the > > string union, come back for the next character, look for the string union, > > etc. etc. etc. > > > > whereas > > > > ([^u]+|u[^n]) > > > > doesn't do any backtracking at all.. > > > > Ed > > perl -e ' $a = "this has the string union in it"; > my ($b) = ($a =~ m"((?:(?!union).)*)"); print $b;' > > prints the desired result for me at least. It also should be comparably whoops. Must have mistyped. Works for me now. > efficient to the alternation since the match for the string 'union' > should fail if the first character is not 'u', etc. The alternation > also matches a character at a time except in special cases, where I am > reasonably sure that the extra overhead from alternation compensates for > multi-character matching. This method also does no backtracking for the > provided example; I am not sure what made you think that it did. > > Peter > well, when I said backtracking, I meant it didn't flip between the current character and the next. I couldn't check real numbers doing benchmarking because the ?! construct core dumps on both perl-5.6.1 and perl-5.8 on large strings. But when benchmarked on small (30 line strings) using: my $regex1 = qr{(?:(?!union).)*}sx; my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx; timethese (10, { 'questionbang' => sub { my ($b) = ($line =~ m"($regex1)"); }, 'alternation' => sub { my ($b) = ($line =~ m"($regex2)"); } } ); I get: Benchmark: timing 10 iterations of alternation, questionbang... alternation: 11 wallclock secs (10.66 usr + 0.00 sys = 10.66 CPU) @ 9380.86/s (n=10) questionbang: 18 wallclock secs (18.81 usr + 0.00 sys = 18.81 CPU) @ 5316.32/s (n=10) so ?! is a bit slower. It could probably be made faster though. However, I'm still skeptical as it being a good replacement for the alternation. Look at my posted message (about making the regex be able to handle nested parens, etc) and see if you can come up with an easy way handle the case I mentioned there.. Ed
Re: exegesis 5 question: matching negative, multi-byte strings
On Wed, Oct 02, 2002 at 10:39:17AM +0300, Markus Laire wrote: > On 1 Oct 2002 at 18:47, [EMAIL PROTECTED] wrote: > > > > > all text up to, but not including the string "union". > > > > > > rule getstuffbeforeunion { (.*?) union | (.*) } > > > > > > "a union" => "a " > > > "b" => "b" > > > > hmm... well, it works, but its not very efficient. It basically > > scans the whole string to the end to see if there is a "union" string, and > > then backtracks to take the alternative. And hence, its not very scalable. > > It also doesn't 'complexify' very well. > > What about > > Perl 5: /(.*?)(?:union|$)/ > Perl 6: /(.*?) [union | $$]/ > > or if you want to exlude 'union' from match > > Perl 5: /(.*?)(?=union|$)/ > Perl 6: /(.*?) [ | $$]/ > that's exceedingly slow, at least by my benchmark. So far, I've got 4 possibilities: my $regex1 = qr{(?:(?!union).)*}sx; my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx; my $regex3 = qr{(?:[^u]+|(?!union).)*}sx; my $regex4 = qr{(.*?)(?=union|$)}sx; timethese ( 10, { 'questionbang' => sub { ($line =~ m"($regex1)"); }, 'questionbang2' => sub { ($line =~ m"($regex3)"); }, 'alternation' => sub { ($line =~ m"($regex2)"); } 'nongreedy' => sub { ($line =~ m"($regex4)"); }, } ); which come out: alternation: 8 wallclock secs ( 7.71 usr + 0.00 sys = 7.71 CPU) @ 12970.17/s (n=10) questionbang: 17 wallclock secs (16.05 usr + 0.00 sys = 16.05 CPU) @ 6230.53/s (n=10) questionbang2: 8 wallclock secs ( 7.74 usr + 0.00 sys = 7.74 CPU) @ 12919.90/s (n=10) nongreedy: 41 wallclock secs (41.74 usr + 0.00 sys = 41.74 CPU) @ 2395.78/s (n=10) So yes, a form can be constructed out of ?! which is of approximately equal speed to the alternation. However, in straight C, the corresponding time is: 2.31u 0.02s 0:02.37 98.3% which tells me that a lot of optimisation could be made with a generic mechanism for (non)matching multi-byte character classes. The problem has to be dealt with anyways when considering unicode... And which form would people rather type: (<-[^u]>+|(?!union).)* or <-[^'union']>* I'd say the second scores over the first in intuition, if nothing else... Ed
Re: Perl6 summary for week beginning 2002-09-30
> Someone mysteriously known only as "Ed" asked what the favored syntax would be > to match negative multi-byte strings in Perl 6. It wasn't entirely clear > what the question was, but one thing is sure: the Perl 6 pattern matching > engine will have a lot of scope for optimisation. Oops, sorry, just realized my mailing header info didn't contain my full name (Ed Peschko). Anyways, the point was that multi-byte non-matching support was abysmal, and to propose a new syntax. The fact that the best thing people could come up with was: (?:(?!union).)* after a long brainstorming session of false starts and false solutions just points out to the fact that it could be simpler. And I think that by making the concept of "character class" more generic (into a 'string class' as it were, where alternations can take arbitrary-length strings) matches a class of real world problems closely. Ex: nested begin/end loops, ie BEGIN <-['BEGIN''END'>+ | END as well as giving strong hints to the optimiser to do the match fast. Ed
Re: exegesis 5 question: matching negative, multi-byte strings
On Mon, Oct 07, 2002 at 07:11:08AM -0500, [EMAIL PROTECTED] wrote: > > match negative multi-byte strings > > > in perl5, I'd tend to do > > m/(?:(?!union).)*/is > > or to capture > > m/((?:(?!union).)*)/is yeah, I'm not arguing that there isn't a solution available, just that the solution is convoluted and somewhat un-intuitive (witness the many people who write (.*)(?!union)). Anyways, to make that perform with any speed you need to say: ((?:[^u]+|(?!union).*)) which is uglier than sin... Ed (Peschko)