date:20020118

RE: on parrot strings

2002-01-18 Thread Brent Dax


Jarkko Hietaniemi: 
About the implementation of character classes: since the Unicode code
point range is big, a single big bitmap won't work any more: firstly,
it would be big.  Secondly, for most cases, it would be wastefully
sparse.  A balanced binary tree of (begin, end) points of ranges is
suggested.  That would seem to give the required flexibility and
reasonable compromise betwen speed and space for implementing the
operations required by both the traditional regular expression
character classes (complement, case-ignorance) and the new Unicode
character class semantics (difference, intersection) (see the Unicode
Technical Report #18, I,
http://www.unicode.org/unicode/reports/tr18/ )

Another, possible simpler way would be to use inversion lists:
1-dimensional arrays where odd (starting from zero) indices store
the beginnings of ranges belonging to the class, and and even indices
store the beginnings of ranges not belonging to the class.
Note "array" instead of (a linked) "list": with an array one can do
binary search to determine membership (so an inversion list is in
effect a flattened binary tree).  Yet another way would be to use
various two-level table schemes.  The choice of the appropriate data
structure, as always, depends on the expected operational (read vs
modify) mix and the expected data distribution.

###

Since I seem to be the main regex hacker for Parrot, I'll respond to
this as best I can.

Currently, we are using bitmaps for character classes.  Well, sort of.
A Bitmap in Parrot is defined like this:

typedef struct bitmap_t {
char*   bmp;
STRING* bigchars;
} Bitmap;

Characters <256 are stored as a bitmap in bmp; other characters are
stored in bigchars and linear-searched.  This is a temporary measure,
since Parrot isn't yet dealing with many characters outside of ASCII.
Several schemes have been proposed for the final version; I'm currently
leaning towards an array of arrays of arrays of bitmaps (one level for
each byte of the character):

INTVAL ch;
return
bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch)
>>3] & (1<<(FORTH_BYTE(ch) & 7));

Ungainly, but it works.  It would actually be a bit more
complicated--only the arrays that we actually used would be allocated to
save space--but you get the idea.  (However, I'm quite flexible on the
implementation chosen.  I'll look at the ideas you propose in more
detail; if anyone else has any suggestions, suggest them.)

As for character encodings, we're forcing everything to UTF-32 in
regular expressions.  No exceptions.  If you use a string in a regex,
it'll be transcoded.  I honestly can't think of a better way to
guarantee efficient string indexing.

--Brent Dax
[EMAIL PROTECTED]
Parrot Configure pumpking and regex hacker

 . hawt sysadmin chx0rs
 This is sad. I know of *a* hawt sysamin chx0r.
 I know more than a few.
 obra: There are two? Are you sure it's not the same one?

Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Michael G Schwern


Reading this in Apoc 4

sub mywhile ($keyword, &condition, &block) {
my $l = $keyword.label;
while (&condition()) {
&block();
CATCH {
my $t = $!.tag;
when X::Control::next { die if $t && $t ne $l); next }
when X::Control::last { die if $t && $t ne $l); last }
when X::Control::redo { die if $t && $t ne $l); redo }
}
}
}

Implies to me:

A &foo prototype means you can have a bare block anywhere in the
arg list (unlike the perl5 syntax).

Calling &foo() does *not* effect the callstack, otherwise the
above would not properly emulate a while loop.

If that's true, can pull off my custom iterators?
http:[EMAIL PROTECTED]/msg08343.html

Will this:

class File;
sub foreach ($file, &block) {
# yeah, I know.  The RFC was all about exceptions and I'm
# not using them in this example.
open(FILE, $file) || die $!;

while() {
&block();
}

close FILE;
}

allow this:

File.foreach('/usr/dict/words') { print }

or would the prototype be (&file, &block)?

And would this:

my $caller = caller;
File.foreach('/usr/dict/words') { 
print $caller eq caller ? "ok" : "not ok" 
}

be ok or not ok?  It has to be ok if mywhile is going to emulate a
while loop.


-- 

Michael G. Schwern   <[EMAIL PROTECTED]>http://www.pobox.com/~schwern/
Perl Quality Assurance  <[EMAIL PROTECTED]> Kwalitee Is Job One
navy ritual:
first caulk the boards of the deck,
then plug up my ass.
-- japhy

Re: on parrot strings

2002-01-18 Thread Bryan C. Warnock

Thanks, Jarrko.

On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
> The most important message is that give up on 8-bit bytes, already.
> Time to move on, chop chop.

Do you think/feel/wish/demand that the textual (string) APIs should differ 
from the binary (byte) APIs?  (Both from an internal Parrot perspective and 
at the language level.)

This may be beyond the scope of the document, but do you have an opinion on 
whether strings need to be entirely encapsulated within a single structure, 
or whether "virtual" strings (comprising several disparate substrings) are a 
viable addition?  

typedef struct {
 UINTVALsize;
 UINTVALindex;
 UINTVALindex_offset;
 UINTVALlast_offset;
 UINTVALsize_valid:1;
 UINTVALoffset_valid:1;
 UINTVALlast_valid:1;
 UINTVALcontinued:1;
 PARROT_STRING  string;
 PARROT_SIZED_STRINGstring_continued;
} PARROT_SIZED_STRING

This was discussed earlier mostly for alleviating some of the headaches 
associated with variable-width encodings. 

-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 04:51:07AM -0500, Bryan C. Warnock wrote:
> Thanks, Jarrko.
> 
> On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
> > The most important message is that give up on 8-bit bytes, already.
> > Time to move on, chop chop.
> 
> Do you think/feel/wish/demand that the textual (string) APIs should differ 
> from the binary (byte) APIs?  (Both from an internal Parrot perspective and 
> at the language level.)

I tried to address this issue at two points in the document, "Of Bits
and Bytes", and one paragraph in "TO DO" talking about encoding
conversions and I/O.  But I guess the answer is "yes and yes", I think
the APIs should be different.  It pains my UNIX heart but thinking in
terms of just bytes was a convenient illusion that worked as long we
kept ourselves to 8-bit byte character sets.  I think the illusion
works no more.

> This may be beyond the scope of the document, but do you have an opinion on 
> whether strings need to be entirely encapsulated within a single structure, 
> or whether "virtual" strings (comprising several disparate substrings) are a 
> viable addition?  
> 
>   typedef struct {
>UINTVALsize;
>UINTVALindex;
>UINTVALindex_offset;
>UINTVALlast_offset;
>UINTVALsize_valid:1;
>UINTVALoffset_valid:1;
>UINTVALlast_valid:1;
>UINTVALcontinued:1;
>PARROT_STRING  string;
>PARROT_SIZED_STRINGstring_continued;
>   } PARROT_SIZED_STRING

First off, I think virtual strings (if you define strings as "a linear
collection of characters (or bytes)" are a great idea, that's why I
suggested them a while ago even in the context of Perl 5 (though I
admit I also simply liked the proposed name: VVs...)  But I also think
they are high-level enough that they probably should not be any of the
low-level string structures.  For example: one nifty thing you can do
with virtual strings is that they can be read-only windows to another
string, and I don't think the read-onlyness flag belongs to the
low-level strings: it's something coming from above.  Similarly
from virtual strings composed of slices of several other strings:
how do you manage the book-keeping of these other strings?  Too complex:
let's keep the low-level, ummm, low-level.

> This was discussed earlier mostly for alleviating some of the headaches 
> associated with variable-width encodings. 

If we keep the low-level limited to just a handful of encodings
(I proposed three), and the variable encodings well-behaved (UTF-8 as
opposed to the gnarlier ones), I don't think the burden will be too bad.

> -- 
> Bryan C. Warnock
> [EMAIL PROTECTED]

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


> Since I seem to be the main regex hacker for Parrot, I'll respond to
> this as best I can.
> 
> Currently, we are using bitmaps for character classes.  Well, sort of.
> A Bitmap in Parrot is defined like this:
> 
>   typedef struct bitmap_t {
>   char*   bmp;
>   STRING* bigchars;
>   } Bitmap;
> 
> Characters <256 are stored as a bitmap in bmp; other characters are
> stored in bigchars and linear-searched.  This is a temporary measure,

This is similar to how Perl 5 does them: the low eight bits are in a
32-byte bitmap, the "wide characters" are stored after it (in a funky
data structure, I won't go into more detail so that people won't lose
their lunch/breakfast/meal)

> since Parrot isn't yet dealing with many characters outside of ASCII.
> Several schemes have been proposed for the final version; I'm currently
> leaning towards an array of arrays of arrays of bitmaps (one level for
> each byte of the character):
> 
>   INTVAL ch;
>   return
> bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch)
> >>3] & (1<<(FORTH_BYTE(ch) & 7));

dup + dup *  ... oh, you meant FOURTH.

> Ungainly, but it works.  It would actually be a bit more
> complicated--only the arrays that we actually used would be allocated to
> save space--but you get the idea.  (However, I'm quite flexible on the
> implementation chosen.  I'll look at the ideas you propose in more
> detail; if anyone else has any suggestions, suggest them.)

Ungainly, yes.

(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer granularity): the problem is that a lot of
space is wasted, since the "usage patterns" of Unicode character
classes tend to be rather scattered and irregular.  Yes, I see
that you said: "only the arrays that we actually used would be
allocated to save space"-- which reads to me: much complicated
logic both in creation and access to make the data structure *look*
simple.  I'm a firm believer in getting the data structures right,
after which the code to access them almost writes itself.

I would suggest the inversion lists for the first try.  As long as
character classes are not very dynamic once they have been created
(and at least traditionally that has been the case), inversion lists
should work reasonably well.

> As for character encodings, we're forcing everything to UTF-32 in
> regular expressions.  No exceptions.  If you use a string in a regex,
> it'll be transcoded.  I honestly can't think of a better way to
> guarantee efficient string indexing.

I'm fine with that.  The bloat is of course a shame, but as long as
that's not a real problem for someone, let's not worry about it too
much.

> --Brent Dax
> [EMAIL PROTECTED]
> Parrot Configure pumpking and regex hacker
> 
>  . hawt sysadmin chx0rs
>  This is sad. I know of *a* hawt sysamin chx0r.
>  I know more than a few.
>  obra: There are two? Are you sure it's not the same one?

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Piers Cawley

Michael G Schwern <[EMAIL PROTECTED]> writes:

> Reading this in Apoc 4
>
> sub mywhile ($keyword, &condition, &block) {
> my $l = $keyword.label;
> while (&condition()) {
> &block();
> CATCH {
> my $t = $!.tag;
> when X::Control::next { die if $t && $t ne $l); next }
> when X::Control::last { die if $t && $t ne $l); last }
> when X::Control::redo { die if $t && $t ne $l); redo }
> }
> }
> }
>
> Implies to me:
>
> A &foo prototype means you can have a bare block anywhere in the
> arg list (unlike the perl5 syntax).
>
> Calling &foo() does *not* effect the callstack, otherwise the
> above would not properly emulate a while loop.
>
> If that's true, can pull off my custom iterators?
> http:[EMAIL PROTECTED]/msg08343.html
>
> Will this:
>
> class File;
> sub foreach ($file, &block) {
> # yeah, I know.  The RFC was all about exceptions and I'm
> # not using them in this example.
> open(FILE, $file) || die $!;
>
> while() {
> &block();
> }
>
> close FILE;
> }

Hmm... making up some syntax on the fly. I sort of like the idea of
being able to do

class File;
sub foreach ($file, &block) is Control {
# 'is Control' declares this as a control sub, which, amongst
# other things 'hides' itself from caller. (We can currently 
# do something like this already using Hooks::LexWrap type
# tricks.

open my $fh, $file or die $!; POST { close $fh }

while () {
my @ret = wantarray ?? list &block() :: (scalar &block());
given $! {
when c::RETURN { return wantarray ?? @ret :: @ret[0] }
}
}
}

This is, of course, dependent on $! not being set to a RETURN control
'exception' in the case where we just fall off the end of the block.

It's also dependent on being able to get continuations from caller
(which would be *so* cool

> allow this:
>
> File.foreach('/usr/dict/words') { print }

Sounds plausible to me.

> or would the prototype be (&file, &block)?

I prefer the ($file, &block) prototype.

> And would this:
>
> my $caller = caller;
> File.foreach('/usr/dict/words') { 
> print $caller eq caller ? "ok" : "not ok" 
> }
>
> be ok or not ok?  It has to be ok if mywhile is going to emulate a
> while loop.

In theory there's nothing to stop you writing it so that that is the
case. I'd like it to be as simple as adding an attribute to the
function declaration (and if it isn't that simple out of the box, it
will almost certainly be, if not trivial, at least possible to write
something to *make* it that simple...)

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

Apoc 4?

2002-01-18 Thread David Whipp


Michael G Schwern wrote:

> Reading this in Apoc 4 ...

I looked on http://dev.perl.org/perl6/apocalypse/: no sign of Apoc4. Where
do I find this latest installment?


Dave.

Re: Apoc 4?

2002-01-18 Thread Will Coleda


http://www.perl.com/pub/a/2002/01/15/apo4.html

David Whipp wrote:
> 
> Michael G Schwern wrote:
> 
> > Reading this in Apoc 4 ...
> 
> I looked on http://dev.perl.org/perl6/apocalypse/: no sign of Apoc4. Where
> do I find this latest installment?
> 
> Dave.

Re: Apoc 4?

2002-01-18 Thread Dan Sugalski


>Michael G Schwern wrote:
>
>>  Reading this in Apoc 4 ...
>
>I looked on http://dev.perl.org/perl6/apocalypse/: no sign of Apoc4. Where
>do I find this latest installment?

www.perl.com. dev.perl.org must just not have a link yet.

-- 

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

RE: on parrot strings

2002-01-18 Thread Hong Zhang


> (1) There are 5.125 bytes in Unicode, not four.
> (2) I think the above would suffer from the same problem as one common
> suggestion, two-level bitmaps (though I think the above would suffer
> less, being of finer granularity): the problem is that a lot of
> space is wasted, since the "usage patterns" of Unicode character
> classes tend to be rather scattered and irregular.  Yes, I see
> that you said: "only the arrays that we actually used would be
> allocated to save space"-- which reads to me: much complicated
> logic both in creation and access to make the data 
> structure *look*
> simple.  I'm a firm believer in getting the data structures right,
> after which the code to access them almost writes itself.
> 
> I would suggest the inversion lists for the first try.  As long as
> character classes are not very dynamic once they have been created
> (and at least traditionally that has been the case), inversion lists
> should work reasonably well.

My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did 
measurement on it, to handle all unicode category, simple casing,
and decimal digit value, I need about 23KB table for Unicode 3.1
(0x0 to 0x10), about 15KB for (0x0 to 0x).

For simple character class, such as [\p{IsLu}\p{InGreak}], the regex
does not need to emit optimized bitmap. Instead, the regex just generate
an union, the first one will use standard unicode category lookup, the
second one is a simple range.

If user mandate to use fast bitmap, and the character class is not
extremely complicated, we will only probably need about several K for
each char class.

> > As for character encodings, we're forcing everything to UTF-32 in
> > regular expressions.  No exceptions.  If you use a string in a regex,
> > it'll be transcoded.  I honestly can't think of a better way to
> > guarantee efficient string indexing.

I don't think UTF-32 will save you much. The unicode case map is variable
length, combining character, canonical equivalence, and many other thing
will require variable length mapping. For example, if I only want to
parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
time, the regcomp() can find out whether this regexp will need complicated
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully 
decompose input string, strip off any combining character, and do 8-bit
Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
(BTW, the equivalent matching means match English spelling against French
spell, disregarding diacritics.)

I think we should explore more choices and do some experiments.

Hong

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


> I don't think UTF-32 will save you much. The unicode case map is variable
> length, combining character, canonical equivalence, and many other thing
> will require variable length mapping. For example, if I only want to

This is true.

> parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
> time, the regcomp() can find out whether this regexp will need complicated
> preprocessing. Another example, if I want to search for /resume/e,
> (equivalent matching), the regex engine can normalize the case, fully 
> decompose input string, strip off any combining character, and do 8-bit

Hmmm.  The above sounds complicated not quite what I had in mind
for equivalence matching: I would have just said "both the pattern
and the target need to normalized, as defined by Unicode".  Then 
the comparison and searching reduce to the trivial cases of byte
equivalence and searching (of which B-M is the most popular example).

> Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
> (BTW, the equivalent matching means match English spelling against French
> spell, disregarding diacritics.)
> 
> I think we should explore more choices and do some experiments.

What do you mean by *we*? :-) I am not a p6-internals regular, nor do
I intend to, there are only so many hours in a day.  But yes, the
sooner we get into exploration/experiment mode, the better.  The
Unicode mindset *must* be adopted sooner rather than later,
"unwriting" 8-bit-byteism out of the code later is hell.  Hopefully my
little treatise will kick Parrot more or less in the right direction.

> Hong

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote:
> > (1) There are 5.125 bytes in Unicode, not four.
> > (2) I think the above would suffer from the same problem as one common
> > suggestion, two-level bitmaps (though I think the above would suffer
> > less, being of finer granularity): the problem is that a lot of
> > space is wasted, since the "usage patterns" of Unicode character
> > classes tend to be rather scattered and irregular.  Yes, I see
> > that you said: "only the arrays that we actually used would be
> > allocated to save space"-- which reads to me: much complicated
> > logic both in creation and access to make the data 
> > structure *look*
> > simple.  I'm a firm believer in getting the data structures right,
> > after which the code to access them almost writes itself.
> > 
> > I would suggest the inversion lists for the first try.  As long as
> > character classes are not very dynamic once they have been created
> > (and at least traditionally that has been the case), inversion lists
> > should work reasonably well.
> 
> My proposal is we should use mix method. The Unicode standard class,
> such as \p{IsLu}, can be handled by a standard splitbin table. Please
> see Java java.lang.Character or Python unicodedata_db.h. I did 
> measurement on it, to handle all unicode category, simple casing,
> and decimal digit value, I need about 23KB table for Unicode 3.1
> (0x0 to 0x10), about 15KB for (0x0 to 0x).

Don't try to compete with inversion lists on the size: their size is
measured in bytes.  For example "Latin script", which consists of 22
separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search).  "Encoding the whole range" is a non-issue
bordering on a joke: two ints, or 8 bytes.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

RE: on parrot strings

2002-01-18 Thread Hong Zhang


> > preprocessing. Another example, if I want to search for /resume/e,
> > (equivalent matching), the regex engine can normalize the case, fully 
> > decompose input string, strip off any combining character, and do 8-bit
> 
> Hmmm.  The above sounds complicated not quite what I had in mind
> for equivalence matching: I would have just said "both the pattern
> and the target need to normalized, as defined by Unicode".  Then 
> the comparison and searching reduce to the trivial cases of byte
> equivalence and searching (of which B-M is the most popular example).

You are right in some sense. But "normalized, as defined by Unicode"
may not be simple. I look at unicode regex tr18. It does not specify
equivalence of "resume" vs "re`sume`", but user may want or may not
want this kind of normalization.

Hong

RE: on parrot strings

2002-01-18 Thread Hong Zhang


> > My proposal is we should use mix method. The Unicode standard class,
> > such as \p{IsLu}, can be handled by a standard splitbin table. Please
> > see Java java.lang.Character or Python unicodedata_db.h. I did 
> > measurement on it, to handle all unicode category, simple casing,
> > and decimal digit value, I need about 23KB table for Unicode 3.1
> > (0x0 to 0x10), about 15KB for (0x0 to 0x).
> 
> Don't try to compete with inversion lists on the size: their size is
> measured in bytes.  For example "Latin script", which consists of 22
> separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
> ints, or 176 bytes. Searching for membership in an inversion list is
> O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> bordering on a joke: two ints, or 8 bytes.

When I said mixed method, I did intend to include binary search. The binary
search is a win for sparse character class. But bitmap is better for large
one. Python uses two level bitmap for first 64K character.

Hong

Ex4, Apo5, when ?

2002-01-18 Thread raptor


Did u passed "Bermuda Triangle" :")

raptor

Re: Ex4, Apo5, when ?

2002-01-18 Thread Dan Sugalski


At 10:16 AM +0200 1/18/02, raptor wrote:
>Did u passed "Bermuda Triangle" :")

It may be a bit before Ex4 is done. Damian's on a cruise ship at the 
moment, so even if he's got the time (and I don't think he does) he's 
likely lacking connectivity. I expect he'll give us word at some 
point what the schedule is.

As for A5, that's up to Larry's schedule. It's the RE apocalypse, 
though, so should hopefully be a bit less brain-bending. (And thus 
done sooner) No promises, of course, as I'm not Larry.
-- 

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote:
> > > My proposal is we should use mix method. The Unicode standard class,
> > > such as \p{IsLu}, can be handled by a standard splitbin table. Please
> > > see Java java.lang.Character or Python unicodedata_db.h. I did 
> > > measurement on it, to handle all unicode category, simple casing,
> > > and decimal digit value, I need about 23KB table for Unicode 3.1
> > > (0x0 to 0x10), about 15KB for (0x0 to 0x).
> > 
> > Don't try to compete with inversion lists on the size: their size is
> > measured in bytes.  For example "Latin script", which consists of 22
> > separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> When I said mixed method, I did intend to include binary search. The binary
> search is a win for sparse character class. But bitmap is better for large
> one.

"Better" in what sense?  Smaller?  Certainly not.  Faster?  Maybe, maybe not.
Yes, accessing the right bytes and doing the bit arithmetics is about as
fast as one can hope doing anything in CPUs.  But: the 15KB is quite a lot
of stuff to move around for, say, [0-9].  Yes, bitmaps win in pathological
cases where you, say, choose every other character of the Unicode.

I guess I agree with you that a combination of bitmaps and binary
searchable things (inversion lists or trees) is good, but I guess we
differ in that my gut feeling is that the latter should be the
default, not the bitmaps.

I also think this low-level detail should be completely hidden from,
say, the writers of the regex engine, all they should see is
code_point_in_class(cp, cc), and that the low-level "character class
engine" should dynamically pick whichever low-level implementation is
"best", and naturally that only one of the low-level implementations
is being used (for one character class) at a time: hybrids (meaning
dual book-keeping) sound to me like a fruitful breeding area for bugs.

> Python uses two level bitmap for first 64K character.

And their Unicode implementation is doing how well? :-)

> Hong

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: Ex4, Apo5, when ?

2002-01-18 Thread Michael G Schwern


On Fri, Jan 18, 2002 at 03:35:59PM -0500, Dan Sugalski wrote:
> At 10:16 AM +0200 1/18/02, raptor wrote:
> >Did u passed "Bermuda Triangle" :")
> 
> It may be a bit before Ex4 is done. Damian's on a cruise ship at the 
> moment, so even if he's got the time (and I don't think he does) he's 
> likely lacking connectivity. I expect he'll give us word at some 
> point what the schedule is.

They've got connectivity all right.  We've been getting plenty of
drunken ramblings on IRC from folks on the cruise.


-- 

Michael G. Schwern   <[EMAIL PROTECTED]>http://www.pobox.com/~schwern/
Perl Quality Assurance  <[EMAIL PROTECTED]> Kwalitee Is Job One
Your average appeasement engineer is about as clued-up on computers as
the average computer "hacker" is about B.O.
-- BOFH

Re: Ex4, Apo5, when ?

2002-01-18 Thread Dan Sugalski


At 4:17 PM -0500 1/18/02, Michael G Schwern wrote:
>On Fri, Jan 18, 2002 at 03:35:59PM -0500, Dan Sugalski wrote:
>>  At 10:16 AM +0200 1/18/02, raptor wrote:
>>  >Did u passed "Bermuda Triangle" :")
>>
>>  It may be a bit before Ex4 is done. Damian's on a cruise ship at the
>>  moment, so even if he's got the time (and I don't think he does) he's
>>  likely lacking connectivity. I expect he'll give us word at some
>>  point what the schedule is.
>
>They've got connectivity all right.  We've been getting plenty of
>drunken ramblings on IRC from folks on the cruise.

Well, so much for *that* excuse. :) Bet they're still hard up for 
free time, though.

-- 

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Apo4: PRE, POST

2002-01-18 Thread David Whipp


Apo4, when introducing POST, mentions that there is a
corresponding "PRE" block "for design-by-contract
programmers".

However, I see the POST block being used as a finalize;
and thus allowing (encouraging?) it to have side effects.
I can't help feeling that contract/assertion checking
should not have side effects. Furthermore, there should
be options to turn off PRE/POST processing for higher
performance. Perhaps we'll learn more about contracts
(inc. invariants, inheritance) in a later apo? Will we
still use the Class::Contract module?


Dave.

--
Dave Whipp, Senior Verification Engineer,
Fast-Chip inc., 950 Kifer Rd, Sunnyvale, CA. 94086
tel: 408 523 8071; http://www.fast-chip.com
Opinions my own; statements of fact may be in error.

Re: on parrot strings

2002-01-18 Thread Steve Fink

On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> ints, or 176 bytes. Searching for membership in an inversion list is
> O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> bordering on a joke: two ints, or 8 bytes.

[Clarification from a noncombatant] You meant O(log N).

I like the inversion list idea. But its speed is proportional to the
toothiness of the character class, and while I have good intuition for
what that means in 7-bit US-ASCII, I have no idea how bad it gets for
other languages. "Vowels"? "Capital letters"? Would anyone ever want
to select all Chinese characters with a particular radical?

That's just lookup. We should also consider other character class
operations: union, subtraction, intersection. They're pretty
straightforward and fast (O(N)) for inversion lists. (Yes, all these
operations can be postponed until lookup time, regardless of the
underlying represention, in which case the time of union(C1,C2) is
just the time of C1 + time of C2 + time of an 'or'.)

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
> On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> [Clarification from a noncombatant] You meant O(log N).

Duh, yes.  At least someone is awake :-)

> I like the inversion list idea. But its speed is proportional to the
> toothiness of the character class, and while I have good intuition for

Yup.

> what that means in 7-bit US-ASCII, I have no idea how bad it gets for
> other languages. "Vowels"? "Capital letters"? Would anyone ever want

As far as I can see, and guestimate (watch out for waving hands),
it would behave pretty well In Real Life.  If we are talking about the
predefined existing categories like Lu, or Greek script, or Cyrillic
block, they are pretty well localized and not scattershot.
User-specified characters are likely to be well localized to
one or few scripts.

> to select all Chinese characters with a particular radical?
> 
> That's just lookup. We should also consider other character class
> operations: union, subtraction, intersection. They're pretty
> straightforward and fast (O(N)) for inversion lists. (Yes, all these

Yes, since they are by definition sorted, merging (or negatively
merging) them is pretty simple.

> operations can be postponed until lookup time, regardless of the
> underlying represention, in which case the time of union(C1,C2) is
> just the time of C1 + time of C2 + time of an 'or'.)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

RE: Apo4: PRE, POST

2002-01-18 Thread Garrett Goebel


From: David Whipp [mailto:[EMAIL PROTECTED]]
> 
> Apo4, when introducing POST, mentions that there is a
> corresponding "PRE" block "for design-by-contract
> programmers".
> 
> However, I see the POST block being used as a finalize;
> and thus allowing (encouraging?) it to have side effects.

It may very well be the case that a procedure's POST block could have side
effects. However, if Larry and Damian are on the same frequency... then a
_method_'s PRE/POST blocks will not have side effects. At least that is what
I perhaps incorrectly inferred from one of previous discussion which Damian
participated in on the perl6-language list about subroutine wrappers,
Hook::LexWrapper, or whatever the means to the ends were in that thread.


> I can't help feeling that contract/assertion checking
> should not have side effects. Furthermore, there should
> be options to turn off PRE/POST processing for higher
> performance. Perhaps we'll learn more about contracts
> (inc. invariants, inheritance) in a later apo?

I hope so. I am particularly interested to hear how PRE/POST blocks will
work in the context of methods and inheritence.


> Will we still use the Class::Contract module?

Your guess is as good as mine. It looks like there will be fewer reasons for
most people to use it. Especially if all you need is assertions. IMO: Its
nice just to hear Larry say "design-by-contract" programmers, and know that
he's still talking about Perl ;)

We'll just have to see how Perl6 DBC support works out with regards to
encapsulation, inheritence, and Class::Contract's other odds and ends. But I
imagine support for things like a POST block checking an object against its
previous state via &old, and things like shortening and flattening, etc.
will still require a Class::Contract.

Re: Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Dan Sugalski

At 3:37 PM + 1/18/02, Piers Cawley wrote:
>Michael G Schwern <[EMAIL PROTECTED]> writes:
>
>Hmm... making up some syntax on the fly. I sort of like the idea of
>being able to do
>
> class File;
> sub foreach ($file, &block) is Control {
> # 'is Control' declares this as a control sub, which, amongst
> # other things 'hides' itself from caller. (We can currently
> # do something like this already using Hooks::LexWrap type
> # tricks.
>
> open my $fh, $file or die $!; POST { close $fh }
>
> while () {
> my @ret = wantarray ?? list &block() :: (scalar &block());
> given $! {
> when c::RETURN { return wantarray ?? @ret :: @ret[0] }
> }
> }
> }
>
>This is, of course, dependent on $! not being set to a RETURN control
>'exception' in the case where we just fall off the end of the block.

I don't think you'll see $! being set to anything other than real 
errors. Larry may change that, but I'd doubt it. It's more a global 
status than anything else. Exceptions would go elsewhere, I'd hope.

I personally would like to see subs be taggable as transparent to 
yielding, so if you call a sub, and it calls a sub, that inner sub 
could yied out of the caller if the caller was transparent. Not, 
mind, that the scheme doesn't have issues, but...

>It's also dependent on being able to get continuations from caller
>(which would be *so* cool)

For some brainwarping version of cool. :)

>  > allow this:
>>
>>  File.foreach('/usr/dict/words') { print }
>
>Sounds plausible to me.
>
>>  or would the prototype be (&file, &block)?
>
>I prefer the ($file, &block) prototype.

I think it'll be ($file, &block), as that makes the most sense.

-- 

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
> On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> [Clarification from a noncombatant] You meant O(log N).
> 
> I like the inversion list idea. But its speed is proportional to the
> toothiness of the character class, and while I have good intuition for
> what that means in 7-bit US-ASCII, I have no idea how bad it gets for
> other languages. "Vowels"? "Capital letters"? Would anyone ever want
> to select all Chinese characters with a particular radical?
> 
> That's just lookup. We should also consider other character class
> operations: union, subtraction, intersection. They're pretty

Complement of an inversion list is neat: insert 0 at the beginning
(and append max+1), unless there already is one, in which case delete
the 0 (and shift the list and delete the max+1).  Again, O(N). 
(One could of course have a bit for a 'negative character class',
but that would in turn complicate the computations.)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: on parrot strings

2002-01-18 Thread Steve Fink

On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
> Complement of an inversion list is neat: insert 0 at the beginning
> (and append max+1), unless there already is one, in which case delete
> the 0 (and shift the list and delete the max+1).  Again, O(N). 
> (One could of course have a bit for a 'negative character class',
> but that would in turn complicate the computations.)

If we have hybrid notation, we'll be stuck with not only a bit for
that, but also a complete expression tree for character classes.
(Which is necessary if we use a Unicode library that only exposes
property test functions, not numeric ranges.)

We *do* want to have (with some notation)
[[:digit:]\p{FunkyLooking}aeiou except 7], right?

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
> On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
> > Complement of an inversion list is neat: insert 0 at the beginning
> > (and append max+1), unless there already is one, in which case delete
> > the 0 (and shift the list and delete the max+1).  Again, O(N). 
> > (One could of course have a bit for a 'negative character class',
> > but that would in turn complicate the computations.)
> 
> If we have hybrid notation, we'll be stuck with not only a bit for
> that, but also a complete expression tree for character classes.
> (Which is necessary if we use a Unicode library that only exposes
> property test functions, not numeric ranges.)
> 
> We *do* want to have (with some notation)
> [[:digit:]\p{FunkyLooking}aeiou except 7], right?

Of course.  But that is all resolvable in regex compile time.
No expression tree needed.

[[:digit:]\p{FunkyLooking}aeiou$FooBar] is an ickier case,
but even there the constant parts can be resolved in regex
compile time.  (Don't say "locales" or I'll ha've have to hurt you,
for your own good. :-)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: on parrot strings

2002-01-18 Thread Steve Fink

On Sat, Jan 19, 2002 at 12:28:15AM +0200, Jarkko Hietaniemi wrote:
> On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
> > On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
> > > Complement of an inversion list is neat: insert 0 at the beginning
> > > (and append max+1), unless there already is one, in which case delete
> > > the 0 (and shift the list and delete the max+1).  Again, O(N). 
> > > (One could of course have a bit for a 'negative character class',
> > > but that would in turn complicate the computations.)
> > 
> > If we have hybrid notation, we'll be stuck with not only a bit for
> > that, but also a complete expression tree for character classes.
> > (Which is necessary if we use a Unicode library that only exposes
> > property test functions, not numeric ranges.)
> > 
> > We *do* want to have (with some notation)
> > [[:digit:]\p{FunkyLooking}aeiou except 7], right?
> 
> Of course.  But that is all resolvable in regex compile time.
> No expression tree needed.

My point was that if inversion lists are insufficient for describing
all the character classes we might be interested in, then we'll need
the tree. And an example of why inversion lists would be insufficient
is if we have a character API that only allows queries of the sort "is
this character FunkyLooking or not?", rather than "what ranges of
characters are FunkyLooking?" (Unless you want to do "is 0
FunkyLooking? is 1 FunkyLooking? ... is 4294967295 FunkyLooking?" at
compile time.)

> compile time.  (Don't say "locales" or I'll ha've have to hurt you,
> for your own good. :-)

Was the ' in ha've unintentional, or is that an acute accent mark? :-)

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


> > > We *do* want to have (with some notation)
> > > [[:digit:]\p{FunkyLooking}aeiou except 7], right?
> > 
> > Of course.  But that is all resolvable in regex compile time.
> > No expression tree needed.
> 
> My point was that if inversion lists are insufficient for describing
> all the character classes we might be interested in, then we'll need
> the tree. And an example of why inversion lists would be insufficient
> is if we have a character API that only allows queries of the sort "is
> this character FunkyLooking or not?", rather than "what ranges of
> characters are FunkyLooking?" (Unless you want to do "is 0
> FunkyLooking? is 1 FunkyLooking? ... is 4294967295 FunkyLooking?" at
> compile time.)

I think the answer to that dilemma is obvious: we do want an API that
tells which ranges FunkyLooking covers and guess what: the answers
to such questions can be represented as inversion lists.

> > compile time.  (Don't say "locales" or I'll ha've have to hurt you,
> > for your own good. :-)
> 
> Was the ' in ha've unintentional, or is that an acute accent mark? :-)

I was aiming for pirate accent.  Arr.  Discussing parrots and all.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: Apo4: PRE, POST

2002-01-18 Thread Me


> [concerns over conflation of post-processing and post-assertions]

Having read A4 thoroughly, twice, this was my only real concern
(which contrasted with an overall sense of "wow, this is so cool").

--me

Re: [PATCH] gcc -ansi -pedantic unrealistically strict [APPLIED]

2002-01-18 Thread Dan Sugalski


At 12:51 PM -0500 1/15/02, Andy Dougherty wrote:
>I think the optimal fix here is simply to remove -ansi -pedantic.
>-ansi may well have some uses, but even the gcc man pages say
>"There is no reason to use this option [-pedantic]; it exists only
>to satisfy pedants."

Applied. thanks. (Though I have to believe there's some reason for pedantic)
-- 

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Re: [OBNOXIOUS PATCH] docs/running.pod [APPLIED]

2002-01-18 Thread Dan Sugalski


At 9:30 AM -0800 1/15/02, Steve Fink wrote:
>This patch add docs/running.pod, which lists the various executables
>Parrot currently includes, examples of running them, and mentions of
>where they fail to work. It's more of a cry for help than a useful
>reference. :-) I've been having trouble recently when making changes
>in figuring out whether I broke anything, because any non-default way
>of running the system seems to be already broken. I can't tell what
>brokenness is expected and what isn't.

Applied, with some chagrin. Thanks.
-- 

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Re: Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Piers Cawley

Dan Sugalski <[EMAIL PROTECTED]> writes:
> At 3:37 PM + 1/18/02, Piers Cawley wrote:
>>Michael G Schwern <[EMAIL PROTECTED]> writes:
>>
>>Hmm... making up some syntax on the fly. I sort of like the idea of
>>being able to do
>>
>> class File;
>> sub foreach ($file, &block) is Control {
>> # 'is Control' declares this as a control sub, which, amongst
>> # other things 'hides' itself from caller. (We can currently
>> # do something like this already using Hooks::LexWrap type
>> # tricks.
>>
>> open my $fh, $file or die $!; POST { close $fh }
>>while () {
>> my @ret = wantarray ?? list &block() :: (scalar &block());
>> given $! {
>> when c::RETURN { return wantarray ?? @ret :: @ret[0] }
>> }
>> }
>> }
>>
>>This is, of course, dependent on $! not being set to a RETURN control
>>'exception' in the case where we just fall off the end of the block.
>
> I don't think you'll see $! being set to anything other than real
> errors. Larry may change that, but I'd doubt it. It's more a global
> status than anything else. Exceptions would go elsewhere, I'd hope.

Um... I'm not sure that's how I read the Apocalypse. And if it doesn't
get set how on earth are we going to be able to tell how a block
exited in the case of home rolled looping/iterating constructs where
we're going to want to write:

sub foo {
...
File.foreach($file_path) {
...
return ($someval) if /some_pattern/;
...
}
}

and have foo return. 

Maybe we'll have to have something like:

while () {
try {
temp c::RETURN is Error;
temp c::NEXT is Error;
temp c::REDO is Error;
temp c::LAST is Error;

wantarray ?? list &block() :: (scalar &block());

DEFAULT { throw };
}
   }

Then, because the control structures are temporarily Errors within the
scope of the try block they get thrown up to the first thing that can
handle them. In the case of NEXT/REDO/LAST, that's the while loop, and
in the case of the RETURN, that's the enclosing subroutine. But it
seems kludgy as hell.

> I personally would like to see subs be taggable as transparent to
> yielding, so if you call a sub, and it calls a sub, that inner sub
> could yied out of the caller if the caller was transparent. Not, mind,
> that the scheme doesn't have issues, but...
>[...]
>>It's also dependent on being able to get continuations from caller
>>(which would be *so* cool)
>
> For some brainwarping version of cool. :)

Hmm... the example I wrote which might possibly have used
continuations got wiped 'cos I realised I wasn't exactly clear on how
they were going to work. But I still think being able to grab a
continuation from up the stack somewhere could be handy, allowing
syntax like:

&block.call_from($continuation);

Which is sort of nice, and sort of really, really evil. The thing is,
given continuations and $continuation.want (so I can work out what
context the continuation called in...) I can see how to implement it:

class BLOCK;
sub call_from ($continuation) {
given $continuation.want {
when LIST { $continuation.return(list .yield)   }
default   { $continuation.return(scalar .yield) }
}
}

Of course, I could have got *completely* the wrong end of the stick
about continuations. And this example doesn't do the 'right thing' for
caller, but hey, it's a start.

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

Benchmarking regexps against perl5

2002-01-18 Thread Nicholas Clark


A thought occurred to me a few days ago:

If I remember correctly, attempts to benchmark parrot's developing regular
expressions against perl's regular expressions are proving "disappointing".
However, perl5 has the advantage of a regular expression optimiser as I
understand it, or at least code to work out the optimal place to start a
match, and interesting strategies to discard things that never match.

How hard is it to "knobble" a perl5 to disable the regular expression
optimiser? Surely that would level the playing field, so that parrot's
regexp engine speed would be directly comparable with perl's regexp engine
speed?

And then later perl5 be allowed its optimiser back once parrot has one.

Nicholas Clark
-- 
ENOCHOCOLATE http://www.ccl4.org/~nick/CV.html

Re: on parrot strings

2002-01-18 Thread Nicholas Clark

On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:

> > As for character encodings, we're forcing everything to UTF-32 in
> > regular expressions.  No exceptions.  If you use a string in a regex,
> > it'll be transcoded.  I honestly can't think of a better way to
> > guarantee efficient string indexing.
> 
> I'm fine with that.  The bloat is of course a shame, but as long as
> that's not a real problem for someone, let's not worry about it too
> much.

Forcing everything to UTF-32 in the API?
Or just forcing everything to UTF-32 until perl 6.0 is released, as trying
to do UTF-8 (and UTF-16 ...) regexps now is premature optimisation?

To me it seems that making UTF-32 do everything correctly which the real
world can use while encoding optimised versions are written is better than
having a snazzy 4 encoding autoswitcher that is wrong and therefore not
releasable to the world.

But I don't know about how the internals of all these things work, so I
may well be wrong on any technical detail.

Nicholas Clark
-- 
ENOCHOCOLATE http://www.ccl4.org/~nick/CV.html

Re: Apo4: PRE, POST

2002-01-18 Thread Piers Cawley

"Me" <[EMAIL PROTECTED]> writes:

>> [concerns over conflation of post-processing and post-assertions]
>
> Having read A4 thoroughly, twice, this was my only real concern
> (which contrasted with an overall sense of "wow, this is so cool").

I think that people have sort of got used to the fact that Perl 6 is
not going to look quite as much like perl5 as they thought it was going
to. Either that or they've all buggered off...

Personally I'm loving it. The small changes in the syntax are all
coming together to give us something that's going to be far easier to
parse (and therefore far easier to mess with syntacticly, which is
what excites me; I've long had the mathematician's view that stuff
becomes so much easier when you have the right notation. A more
mutable perl means that I build myself the right notation and then
solve the problem -- I want to invent my own syntactic sugar if that
makes sense...)

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

A question

2002-01-18 Thread Piers Cawley


Okay boys and girls, what does this print:

my @aaa = qw/1 2 3/;
my @bbb = @aaa;

try {
print "$_\n";
}

for @aaa; @bbb -> my $a; my $b {
print "$a:$b";
}

I'm guessing one of:
1:1
2:2
3:3

or a syntax error, complaining about something near
C<@bbb -> my $a ; my $b {>

In other words, how does the parser distinguish between postfix for
followed by a semicolon, and the new semicolon enhanced 'normal' for?

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi


On Fri, Jan 18, 2002 at 11:40:17PM +, Nicholas Clark wrote:
> On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:
> 
> > > As for character encodings, we're forcing everything to UTF-32 in
> > > regular expressions.  No exceptions.  If you use a string in a regex,
> > > it'll be transcoded.  I honestly can't think of a better way to
> > > guarantee efficient string indexing.
> > 
> > I'm fine with that.  The bloat is of course a shame, but as long as
> > that's not a real problem for someone, let's not worry about it too
> > much.
> 
> Forcing everything to UTF-32 in the API?

I think Brent meant UTF-32 internally for the regexen.  When you say
/a/, Parrot sees 0x00 0x00 0x00 0x41.

> To me it seems that making UTF-32 do everything correctly which the real
> world can use while encoding optimised versions are written is better than
> having a snazzy 4 encoding autoswitcher that is wrong and therefore not
> releasable to the world.

Now, now.

But yes, maybe selecting *one* first (and getting its implementation
right) would be good, and in that case it's either UTF-16 (which is
reasonably compact, but variable length), or UTF-32 (which is a bit
asteful, but fixed length, and therefore easy to think in).
So I guess UTF-32 wins.

> But I don't know about how the internals of all these things work, so I
> may well be wrong on any technical detail.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: A question

2002-01-18 Thread Glenn Linderman

That particular example is flawed, because the try expression is turned
into a try statement because the } stands alone on its line.

But if you eliminate a couple newlines between } and for, then your
question makes sense (but the code is not well structured, but hey, maybe
you take out all the newlines for a one-liner...).

The answer in that case is probably a syntax error, and to avoid it, you
put a ; between the } and the for.

Piers Cawley wrote:

> Okay boys and girls, what does this print:
>
> my @aaa = qw/1 2 3/;
> my @bbb = @aaa;
>
> try {
> print "$_\n";
> }
>
> for @aaa; @bbb -> my $a; my $b {
> print "$a:$b";
> }
>
> I'm guessing one of:
> 1:1
> 2:2
> 3:3
>
> or a syntax error, complaining about something near
> C<@bbb -> my $a ; my $b {>
>
> In other words, how does the parser distinguish between postfix for
> followed by a semicolon, and the new semicolon enhanced 'normal' for?
>
> --
> Piers
>
>"It is a truth universally acknowledged that a language in
> possession of a rich syntax must be in need of a rewrite."
>  -- Jane Austen?

--
Glenn
=
Due to the current economic situation, the light at the
end of the tunnel will be turned off until further notice.

Re: Apo4: PRE, POST

2002-01-18 Thread Glenn Linderman

Me wrote:

> > [concerns over conflation of post-processing and post-assertions]
>
> Having read A4 thoroughly, twice, this was my only real concern
> (which contrasted with an overall sense of "wow, this is so cool").
>
> --me

Yes, very, very cool.

I especially liked how RFC 88 was "accepted with caveats" and RFC 119
was "rejected but assimilated", given my personal involvement in that
topic.  Seeing as how all the insufficiencies in RFC 88 that RFC 119 was
trying to cure have been cured extremely well, I am quite a happy
camper.  I never cared what the words were as long as they make sense,
and Larry picked good words.  There are no non-object exceptions, but
given the depth of object integration into the core concepts that seems
to have been accepted for Perl 6 (but was uncertain at the time of RFC
writing), that is not a problem.

Also very cool was the resulting switch statement.  Its integration with
=~ and CATCH is brilliant.  That RFC had a much too large table of DWIM
cases to understand, and Perl 6 still has quite a few, but all of them
seem to DWIM for me, whereas a number of the ones in the RFC seemed
quite contrived and obscure to me.

The only thing that seems somewhat questionable is the elimination of
bare blocks... handy for defining short term variables... a common
metaphor for reading a whole file was

  { local $/; $whole_file = ; }

but I guess putting "do" in front isn't too onerous for the reduced
ambiguity.

--
Glenn
=
Due to the current economic situation, the light at the
end of the tunnel will be turned off until further notice.

Re: Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Larry Wall


Michael G Schwern writes:
: Reading this in Apoc 4
: 
: sub mywhile ($keyword, &condition, &block) {
: my $l = $keyword.label;
: while (&condition()) {
: &block();
: CATCH {
: my $t = $!.tag;
: when X::Control::next { die if $t && $t ne $l); next }
: when X::Control::last { die if $t && $t ne $l); last }
: when X::Control::redo { die if $t && $t ne $l); redo }
: }
: }
: }
: 
: Implies to me:
: 
: A &foo prototype means you can have a bare block anywhere in the
: arg list (unlike the perl5 syntax).

That is correct.

: Calling &foo() does *not* effect the callstack, otherwise the
: above would not properly emulate a while loop.

Maybe it's transparent to caller but not to caller($n).  I'm not sure how
much of a problem this will be.  Inside &block it's a closure, which
carries a lot of the context you need already.  Continuations may be
overkill.

: If that's true, can pull off my custom iterators?
: http:[EMAIL PROTECTED]/msg08343.html
: 
: Will this:
: 
: class File;
: sub foreach ($file, &block) {
: # yeah, I know.  The RFC was all about exceptions and I'm
: # not using them in this example.
: open(FILE, $file) || die $!;

That's

my $FILE = open $file || die;

and so on.

: while() {
: &block();
: }
: 
: close FILE;
: }
: 
: allow this:
: 
: File.foreach('/usr/dict/words') { print }

File.foreach('/usr/dict/words', { print })

or even (presuming the prototype is available for parsing):

File.foreach '/usr/dict/words' { print }

: or would the prototype be (&file, &block)?
: 
: And would this:
: 
: my $caller = caller;
: File.foreach('/usr/dict/words') { 
: print $caller eq caller ? "ok" : "not ok" 
: }
: 
: be ok or not ok?  It has to be ok if mywhile is going to emulate a
: while loop.

I don't see why the default caller has to be caller(1).  In any event,
user-define control code will need to be able to get out of the way
of the programmer's expectations.  A return certainly needs to return
from the surrounding lexical sub block, not from a mere bare block.

Larry

Re: Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Larry Wall


Piers Cawley writes:
: Hmm... making up some syntax on the fly. I sort of like the idea of
: being able to do
: 
: class File;
: sub foreach ($file, &block) is Control {
: # 'is Control' declares this as a control sub, which, amongst
: # other things 'hides' itself from caller. (We can currently 
: # do something like this already using Hooks::LexWrap type
: # tricks.

Maybe, but we'll need more explicit parsing control for other things,
so this may fall out of that.

: open my $fh, $file or die $!; POST { close $fh }

More like:

my $fh = open $file or die;
 
: while () {
: my @ret = wantarray ?? list &block() :: (scalar &block());
: given $! {
: when c::RETURN { return wantarray ?? @ret :: @ret[0] }
: }
: }

That "given $!" would have to be a CATCH, or the code would never be
executed on a control exception.

: This is, of course, dependent on $! not being set to a RETURN control
: 'exception' in the case where we just fall off the end of the block.

I'd say that's correct.

: It's also dependent on being able to get continuations from caller
: (which would be *so* cool

Hmm, might not need to go that far.

: > allow this:
: >
: > File.foreach('/usr/dict/words') { print }
: 
: Sounds plausible to me.

We're not using Ruby syntax here.  Any closure is a real argument with
a real formal argument name, and is called via ordinary &block(...)
syntax, not yield.

: > or would the prototype be (&file, &block)?
: 
: I prefer the ($file, &block) prototype.

I don't see why it would ever be &file.  It's just a string.

: > And would this:
: >
: > my $caller = caller;
: > File.foreach('/usr/dict/words') { 
: > print $caller eq caller ? "ok" : "not ok" 
: > }
: >
: > be ok or not ok?  It has to be ok if mywhile is going to emulate a
: > while loop.
: 
: In theory there's nothing to stop you writing it so that that is the
: case. I'd like it to be as simple as adding an attribute to the
: function declaration (and if it isn't that simple out of the box, it
: will almost certainly be, if not trivial, at least possible to write
: something to *make* it that simple...)

Precisely.

Larry

Parrot strings

2002-01-18 Thread Melvin Smith


Anyone have any objection to adding a couple of calls to terminate
and/or return null terminated strings from Parrot strings for places
where an API expects a standard C string?

I'm not sure of the preferred way to handle this. It would be nice to
at least try to terminate the current string buffer first if there is room
in the buffer and only if that fails to do an allocate or copy.

Or. is it already there and I don't see it.

-Melvin

Re: Does this mean we get Ruby/CLU-style iterators?

2002-01-18 Thread Piers Cawley


Larry Wall <[EMAIL PROTECTED]> writes:

> Michael G Schwern writes:
> : Reading this in Apoc 4
> : 
> : sub mywhile ($keyword, &condition, &block) {
> : my $l = $keyword.label;
> : while (&condition()) {
> : &block();
> : CATCH {
> : my $t = $!.tag;
> : when X::Control::next { die if $t && $t ne $l); next }
> : when X::Control::last { die if $t && $t ne $l); last }
> : when X::Control::redo { die if $t && $t ne $l); redo }
> : }
> : }
> : }
> : 
> : Implies to me:
> : 
> : A &foo prototype means you can have a bare block anywhere in the
> : arg list (unlike the perl5 syntax).
>
> That is correct.
>
> : Calling &foo() does *not* effect the callstack, otherwise the
> : above would not properly emulate a while loop.
>
> Maybe it's transparent to caller but not to caller($n).  I'm not sure how
> much of a problem this will be.  Inside &block it's a closure, which
> carries a lot of the context you need already.  Continuations may be
> overkill.

I think having the caller($n) stack work so that control structures
are transparent no matter where they came from is really, really
important. But we can do that right now by pulling Hooks::LexWrap type
tricks:

temp &CORE::GLOBAL::caller = { ... };

Problem solved. I'd just hoped it was something we'd not have to do
ourselves in the general case.

> : If that's true, can pull off my custom iterators?
> : http:[EMAIL PROTECTED]/msg08343.html
> : 
> : Will this:
> : 
> : class File;
> : sub foreach ($file, &block) {
> : # yeah, I know.  The RFC was all about exceptions and I'm
> : # not using them in this example.
> : open(FILE, $file) || die $!;
>
> That's
>
> my $FILE = open $file || die;
>
> and so on.
>
> : while() {
> : &block();
> : }
> : 
> : close FILE;
> : }
> : 
> : allow this:
> : 
> : File.foreach('/usr/dict/words') { print }
>
> File.foreach('/usr/dict/words', { print })
>
> or even (presuming the prototype is available for parsing):
>
> File.foreach '/usr/dict/words' { print }

Hmm... does this mean that control structures are just going to be
normal expression (a la Ruby)? Or are if/for/loop etc going to be
special cases? I really like them not being special cases, but I can
also see that having:

foreach foreach @a { ... } { ... }

be legal syntax would be very weird indeed. Hmm... going the whole
ruby hog would mean that:

{ ... }.foreach @ary;

would be valid. Hmm...

> : or would the prototype be (&file, &block)?
> : 
> : And would this:
> : 
> : my $caller = caller;
> : File.foreach('/usr/dict/words') { 
> : print $caller eq caller ? "ok" : "not ok" 
> : }
> : 
> : be ok or not ok?  It has to be ok if mywhile is going to emulate a
> : while loop.
>
> I don't see why the default caller has to be caller(1).  In any event,
> user-define control code will need to be able to get out of the way
> of the programmer's expectations.  A return certainly needs to return
> from the surrounding lexical sub block, not from a mere bare block.

And caller has to 'lie' about its stack, because otherwise methods
that get called from within the loop that do caller($n) will get
confused. 

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

Re: A question

2002-01-18 Thread Piers Cawley


[reformatting response for readability and giving Glenn a stiff talking
to]
Glenn Linderman <[EMAIL PROTECTED]> writes:
> Piers Cawley wrote:
>
>> Okay boys and girls, what does this print:
>>
>> my @aaa = qw/1 2 3/;
>> my @bbb = @aaa;
>>
>> try {
>> print "$_\n";
>> }
>>
>> for @aaa; @bbb -> my $a; my $b {
>> print "$a:$b";
>> }
>>
>> I'm guessing one of:
>> 1:1
>> 2:2
>> 3:3
>>
>> or a syntax error, complaining about something near
>> C<@bbb -> my $a ; my $b {>
>>
>> In other words, how does the parser distinguish between postfix for
>> followed by a semicolon, and the new semicolon enhanced 'normal' for?
>
> That particular example is flawed, because the try expression is turned
> into a try statement because the } stands alone on its line.
>
> But if you eliminate a couple newlines between } and for, then your
> question makes sense (but the code is not well structured, but hey, maybe
> you take out all the newlines for a one-liner...).
>
> The answer in that case is probably a syntax error, and to avoid it, you
> put a ; between the } and the for.

Yeah, that's sort of where I got to as well. But I just wanted to make
sure. I confess I'm somewhat wary of the ';' operator, especially
where it's 'unguarded' by brackets, and once I start programming in
Perl 6 then 

for (@aaa ; @bbb -> $a; $b) { ... }

will be one of my personal style guidelines.

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

Re: on parrot strings

2002-01-18 Thread Piers Cawley


Hong Zhang <[EMAIL PROTECTED]> writes:

>> > preprocessing. Another example, if I want to search for /resume/e,
>> > (equivalent matching), the regex engine can normalize the case, fully 
>> > decompose input string, strip off any combining character, and do 8-bit
>> 
>> Hmmm.  The above sounds complicated not quite what I had in mind
>> for equivalence matching: I would have just said "both the pattern
>> and the target need to normalized, as defined by Unicode".  Then 
>> the comparison and searching reduce to the trivial cases of byte
>> equivalence and searching (of which B-M is the most popular example).
>
> You are right in some sense. But "normalized, as defined by Unicode"
> may not be simple. I look at unicode regex tr18. It does not specify
> equivalence of "resume" vs "re`sume`", but user may want or may not
> want this kind of normalization.

But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:

   /r[ee`]sum[ee`]/

Of course, it might be nice to have something that lets us do

   /r\any_accented(e)sum\any_accented(e)/

(or some such, notation is terrible I know), but my point is that such
searches should be explicit.

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

46 matches

Mail list logo