Re: Idea for safe signal handling by a byte code interpreter

2001-03-23 Thread Neil Jerram

> "Karl" == Karl M Hegbloom <[EMAIL PROTECTED]> writes:

Karl>  Then, from strategic points within the VM, just as the
Karl> emacsen check for QUIT, you'd check for that signal flag or
Karl> counter, and run the signal handlers from a bottom half of
Karl> some kind.  This way, you know that the interpretter is in a
Karl> consistent state and that you're not halfway through
Karl> something that ought be more atomic, or whatever.

Karl>  There should be a way to say that "this is a critical
Karl> section; don't call signal handler here" in the scripting
Karl> language.

As far as I understand (both your point and the Guile code), this is
what Guile does already.  The signal handler for signal SIG sets
got_signal[SIG] to 1, and marks a system async to run at the next safe
point.  (A system async is simply a thunk that can be marked for
asynchronous execution.)  Guile checks whether it is safe to run
system asyncs in the macros SCM_ALLOW_INTS and SCM_REALLOW_INTS; thus,
in the Guile C code, the code between SCM_DEFER_INTS and
SCM_ALLOW_INTS, or between SCM_REDEFER_INTS and SCM_REALLOW_INTS, is a
critical section as you suggest.

Best regards,
Neil



Re: Schwartzian Transform

2001-03-23 Thread James Mastros

On Thu, Mar 22, 2001 at 11:13:47PM -0500, John Porter wrote:
> Brent Dax wrote:
> > Someone else showed a very ugly syntax with an anonymous
> > hash, and I was out to prove there was a prettier way to do it.
> Do we want prettier?  Or do we want more useful?
> Perl is not exactly known for its pretty syntax.
If you have to explicitly specify both the forward and inverse transforms,
then it isn't very useful -- it's nothing more then map/sort/map.  OTOH, if
you only have to specify the forward mapping, it becomes more useful.  Thus,
I think the best syntax is
tsort({xform}, {compare}, @list), where the {}s are anon blocks or curried
expressions (same thing) and xform specifies the forward mapping (IE (lc
^_)) and compare specifies the comparator (IE (^_ cmp ^_)).

This would always (do the equiv to) create a LoL in the inner map, sort on
the ->[0] elem, and extract the ->[1] elem.  Thus, it might not be as
effecent as a hand-crafted schwartzian, but will be at least as efficent as
a naieve straight sort (except in pathalogical cases, like tsort((^_),
(^_<=>^_), @list)).

   -=- James Mastros
-- 
The most beautiful thing we can experience is the mysterious.  It is the
source of all true art and science.  He to whom this emotion is a stranger,
who can no longer pause to wonder and stand wrapt in awe, is as good as dead.
-=- Albert Einstein
AIM: theorbtwo   homepage: http://www.rtweb.net/theorb/



Re: Unicode handling

2001-03-23 Thread Jarkko Hietaniemi

On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote:
> At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote:
> > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:
> >
> >   DS> U doesn't really signal "glyph" to me, but we are sort of limited
> >   DS> in what we have left. We still need a zero-width assertion for
> >   DS> glyph boundary within regexes themselves.
> >
> >how about \C? it doesn't seem to be taken and would mean char boundary (not
> >exactly a glyph but close enough).
> 
> That's got the unfortunate mental association with C's char for lots of 
> folks, and I know I'd probably get it stuck to codepoint rather than glyph 
> if I didn't use it much.

*cough* \C *is* taken.

> >also \U has a meaning in double quotish strings.

"\Uindeed."

> >
> >uri
> >
> >--
> >Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
> >SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
> >The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
> >The Best Search Engine on the Net  --  http://www.northernlight.com
> 
> 
>   Dan
> 
> --"it's like this"---
> Dan Sugalski  even samurai
> [EMAIL PROTECTED] have teddy bears and even
>   teddy bears get drunk

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 11:26 PM 3/23/2001 +, Dave Mitchell wrote:
>Dan Sugalski <[EMAIL PROTECTED]> doodled:
> > At 11:09 PM 3/23/2001 +, Simon Cozens wrote:
> > >For instance, chr() will produce Unicode codepoints. But you can 
> pretend that
> > >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. 
> I hope
> > >and suspect there'll be an equivalent of "use bytes" which makes chr(256)
> > >either blow up or wrap around.
> >
> > Actually no it won't. If the string you're doing a chr on is tagged as
> > EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this:
> >
> > chr($foo) == chr($bar);
> >
> > could evaluate to false if one of the strings is EBCDIC and the other
> > isn't.
>
>Err, perhaps I'm being dumb here - but surely $foo and $bar arent
>typed strings, they're just numbers (or strings which match /^\d+$/) ???

D'oh! Too much blood in my caffeine stream. Yeah, I was thinking of ord.

chr will emit a character of the type appropriate to the current default 
string context. The default context will probably be settable at compile 
time, or be the platform native type, alterable somehow. Probably "use 
blah;" but that's a language design issue. :)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




RE: Unicode handling

2001-03-23 Thread Dan Sugalski

At 11:05 AM 3/23/2001 -0600, Garrett Goebel wrote:

>From: Nicholas Clark [mailto:[EMAIL PROTECTED]]
> >
> > On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> > > 1) All Unicode data perl does regular expressions against
> > >will be in Normalization Form C, except for...
> > > 2) Regexes tagged to run against a decomposed form will
> > >instead be run against data in Normalization Form D.
> > >   (What the tag is at the perl level is  up for grabs. I'd
> > >   personally choose a D suffix)
> > > 3) Perl won't otherwise force any normalization on data
> > >already in Unicode format.
> >
> > So if I understand that correctly, running a regexp against a
> > scalar will cause that scalar to become normalized in a
> > defined way (C or  D, depending on regexp)
>
>I'm not sure whether to read that as resulting in scalar being normalized, 
>or if the "data perl does the regular expressions against" would be a 
>normalized copy of that scalar's value.

It could be either way.

>Wouldn't normalizing the scalar lose information? I don't know Unicode, 
>but surely someone must have a use for storing strings in both NFC and 
>NFD. Is it valid to intermix both forms? Isn't there a need to preserve 
>the data in its original encoding? I don't like the idea of the language 
>losing information without the programmer's permission.

Whether normalizing loses information seems to depend on your definition of 
"lose". When you take a Unicode string and put it into either NFC or NFD, 
the result is equivalent, but not the same. The Unicode standard specifies 
what characters and character sequences are equivalent. When you're dealing 
with Unicode data, you're not supposed to care about the actual code 
points, as far as I can tell. (With the possible exception of general 
things like "must be NFC" or "must be NFD")


Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




RE: Unicode handling

2001-03-23 Thread Dan Sugalski

At 01:26 PM 3/23/2001 -0500, NeonEdge wrote:
>Dan Sugalski wrote:
> >If we do, then something as simple as this:
> >
> >   while () {
> > $count++ if /bar/;
> > print OUT $_;
> >   }
> >
> >would potentially result in the output file being rather different from the
> >input file. Equivalent, yes, but different. Whether that's bad or not is an
> >open question.
>
>I don't believe that any scalar defined within the parsed application space
>should be transformed permanently. There shouldn't be any difference between
>the input file and the output file in the above example (it could cause 
>issues with non-Perl apps).

This is not unreasonable. I'd prefer not to carry around multiple versions 
of strings if it can be avoided, though. That can add up really quickly.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 11:52 AM 3/23/2001 -0800, Hong Zhang wrote:
> > >I recommend to use 'u' flag, which indicates all operations are performed
> > >against unicode grapheme/glyph. By default re is performed on codepoint.
> >
> > U doesn't really signal "glyph" to me, but we are sort of limited in what
> > we have left. We still need a zero-width assertion for glyph boundary
> > within regexes themselves.
>
>The 'u' flag means "advanced unicode feature(s)", which includes "always
>matching against glyph/grapheme, not codepoint". What it really means is
>up to discussion.  I think we probably still need "glyph" or "grapheme"
>boundary in some cases.

Fair enough. I think there are some cases where there's a base/combining 
pair of codepoints that don't map to a single combined-character code 
point. Not matching on a glyph boundary could make things really odd, but 
I'd hate to have the checking code on by default, since that'd slow down 
the common case where the string in NFC won't have those.

> > >We need the character equivalence construct, such as [[=a=]], which
> > >matches "a", "A ACUTE".
> >
> > Yeah, we really need a big list of these. PDD anyone?
>
>I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2
>regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can
>implement in Perl 6.

That's a separate issue I think I'll dodge for right now.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 10:48 PM 3/23/2001 +, Simon Cozens wrote:
>On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> > Yes, I realize that point 5 may result in someone getting a meaningless
> > Unicode string. Too bad--it is *not* the place of a programming 
> language to
> > enforce validity on data. That's the programmer's job.
>
>But points 4 and 5 do enforce Unicode on everyone. Not that I'm particularly
>upset by that idea, but... :)

Nah, they only apply to data that perl's tagged as Unicode, either because 
its input stream is marked that way or because the program explicitly 
converted the data. A plain:

   open FOO, "some.file";
   while () { whatever($_)}

probably won't be dealing with Unicode data. (Unless for some reason perl's 
been told that all files are Unicode by default)

I expect the default data types for data that comes from files will be 
either binary or ascii for most systems, EBCDIC on OS/390 systems, and 
potentially Unicode on Windows and non-US systems.

Dealing with typed strings might make binmode (and perhaps corresponding 
asciimode, unicodemode, or ebcdicmode) more frequently used. Or not.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Bryan C. Warnock

On Friday 23 March 2001 14:48, you wrote
> In Unicode, there's theoretically no locale. Theoretically...

Well, yes, but Unicode makes no pretenses about encoding the world's 
languages - just the various symbols use by the world's languages.

If you want to orient Perl so that it remains(?) data-oriented, even when 
processing text, as an independent underlayer to locale processing, then 
that's fine, I guess, as long as you aren't forcing *someone's* locale onto 
it at that lower layer.

If you want to orient it so that it processes the text as...  well, a textual 
respresentation of a language, then you'll have to consider locale issues.  
At some point.  

Okay, now I see Hong's response.  Yes, I'm understanding now.  Not character 
equivalence from a linguistic perspective, but simply

/({base glyph}{combining glyphs}*)/

Okay, I'll go back to lurking.



-- 
Bryan C. Warnock
[EMAIL PROTECTED]



RE: Unicode handling

2001-03-23 Thread NeonEdge

Dan Sugalski wrote:
>If we do, then something as simple as this:
>
>   while () {
> $count++ if /bar/;
> print OUT $_;
>   }
>
>would potentially result in the output file being rather different from the
>input file. Equivalent, yes, but different. Whether that's bad or not is an
>open question.

I don't believe that any scalar defined within the parsed application space
should be transformed permanently. There shouldn't be any difference between
the input file and the output file in the above example (it could cause issues
with non-Perl apps).
I think the rule should be to store normalized scalars as separate from the
original and leave the original unaffected.
There are specific cases where it would be OK to normalize the original, such
as error strings, and other scalars used internally by Perl. Perhaps the
developer could 'use normalize' to force the scalars to be normalized for
optimization purposes, but Perl shouldn't force normalization.
Grant M.





Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 02:31 PM 3/23/2001 -0500, Bryan C. Warnock wrote:
>On Friday 23 March 2001 14:18, Dan Sugalski wrote:
> > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote:
> > > > 6) There will be a glyph boundary/non-glyph boundary pair of regex
> > > > characters to match the word/non-word boundary ones we already have.
> > >
> > >(While
> > >
> > > > I'd personally like \g and \G, that won't work as \G is already taken)
> > > >
> > > > I also realize that the decomposition flag on regexes would mean that
> > > > s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the
> > > > previous paragraph.
> > >
> > >I recommend to use 'u' flag, which indicates all operations are performed
> > >against unicode grapheme/glyph. By default re is performed on codepoint.
> >
> > U doesn't really signal "glyph" to me, but we are sort of limited in what
> > we have left. We still need a zero-width assertion for glyph boundary
> > within regexes themselves.
> >
> > >We need the character equivalence construct, such as [[=a=]], which
> > >matches "a", "A ACUTE".
> >
> > Yeah, we really need a big list of these. PDD anyone?
> >
>
>But surely this is a locale issue, and not an encoding one?  Not every
>language recognizes the same character equivalences.

In Unicode, there's theoretically no locale. Theoretically...

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote:
> > 6) There will be a glyph boundary/non-glyph boundary pair of regex
> > characters to match the word/non-word boundary ones we already have.
>(While
> > I'd personally like \g and \G, that won't work as \G is already taken)
> >
> > I also realize that the decomposition flag on regexes would mean that
> > s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the
> > previous paragraph.
>
>I recommend to use 'u' flag, which indicates all operations are performed
>against unicode grapheme/glyph. By default re is performed on codepoint.

U doesn't really signal "glyph" to me, but we are sort of limited in what 
we have left. We still need a zero-width assertion for glyph boundary 
within regexes themselves.

>We need the character equivalence construct, such as [[=a=]], which
>matches "a", "A ACUTE".

Yeah, we really need a big list of these. PDD anyone?

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




RE: Unicode handling

2001-03-23 Thread Garrett Goebel

From: Nicholas Clark [mailto:[EMAIL PROTECTED]]
> 
> On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> > 1) All Unicode data perl does regular expressions against 
> >will be in Normalization Form C, except for...
> > 2) Regexes tagged to run against a decomposed form will 
> >instead be run against data in Normalization Form D.
> >   (What the tag is at the perl level is  up for grabs. I'd
> >   personally choose a D suffix)
> > 3) Perl won't otherwise force any normalization on data 
> >already in Unicode format.
> 
> So if I understand that correctly, running a regexp against a 
> scalar will cause that scalar to become normalized in a
> defined way (C or  D, depending on regexp)

I'm not sure whether to read that as resulting in scalar being normalized,
or if the "data perl does the regular expressions against" would be a
normalized copy of that scalar's value.

Wouldn't normalizing the scalar lose information? I don't know Unicode, but
surely someone must have a use for storing strings in both NFC and NFD. Is
it valid to intermix both forms? Isn't there a need to preserve the data in
its original encoding? I don't like the idea of the language losing
information without the programmer's permission.


> > 5) Any character-based call (ord, substr, whatever) will 
> >deal with whatever code-points are at the location
> >specified. If the string is LATIN SMALL LETTER A, 
> >COMBINING ACUTE ACCENT and someone does a 
> >substr($foo, 1, 1) on it, you get back the single
> >character COMBINING ACUTE ACCENT, and an ord would
> >return the value 796.
> 
> So if you do (ord, substr, whatever) on a scalar without 
> knowing where it has been, you have no idea whether you're
> working on normalised or not. And in fact the same scalar
> may be come denormalised:
> 
>   $bar = substr $foo, 3, 1;
>   &frob ($foo);
>   $baz = substr $foo, 3, 1;

Hmm... if I put on my "everything is an object in Perl 6" blinders, wouldn't
that be:

$foo : utf8d = "timtowtdi"; 
$bar : utf8  = substr $foo, 3, 1;
$baz : char8 = substr($foo,0,3) . substr($bar,3,3) . "tdi";

o  $foo would be normalized to NFD
o  substr would know what $foo is and operate on it per NFD
o  $bar would be normalized to NFC.
o  $baz would work with byte characters indeterminantly

i.e., substr, ord, length would DWIM based on what type of string it is.


>  $foo =~ /^$bar$/;# did I need to \Q \E this?
>
> might be true at the same time as
> 
>  $foo ne $bar

Have the match operate on a copy $bar normalized to whatever $foo is.


> I'm in too minds about this. It feels like it would be hard 
> to implement the internals to make eq work on normalized
> forms without either
> 
> 1: causing it to not be read only, hence UTF8 in might not be UTF8 out
>because it had been part of an eq
> 
> or
> 
> 2: having to double buffer almost every scalar, with both the 
> original UTF8
>and a (cached copy) normalized form

I really don't want to see #1. Do my naive suggestions get around #2?

Garrett



Re: Unicode handling

2001-03-23 Thread Bryan C. Warnock

On Friday 23 March 2001 14:18, Dan Sugalski wrote:
> At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote:
> > > 6) There will be a glyph boundary/non-glyph boundary pair of regex
> > > characters to match the word/non-word boundary ones we already have.
> >
> >(While
> >
> > > I'd personally like \g and \G, that won't work as \G is already taken)
> > >
> > > I also realize that the decomposition flag on regexes would mean that
> > > s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the
> > > previous paragraph.
> >
> >I recommend to use 'u' flag, which indicates all operations are performed
> >against unicode grapheme/glyph. By default re is performed on codepoint.
>
> U doesn't really signal "glyph" to me, but we are sort of limited in what
> we have left. We still need a zero-width assertion for glyph boundary
> within regexes themselves.
>
> >We need the character equivalence construct, such as [[=a=]], which
> >matches "a", "A ACUTE".
>
> Yeah, we really need a big list of these. PDD anyone?
>

But surely this is a locale issue, and not an encoding one?  Not every 
language recognizes the same character equivalences.


-- 
Bryan C. Warnock
[EMAIL PROTECTED]



Re: Unicode handling

2001-03-23 Thread Uri Guttman

> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:

  DS> U doesn't really signal "glyph" to me, but we are sort of limited
  DS> in what we have left. We still need a zero-width assertion for
  DS> glyph boundary within regexes themselves.

how about \C? it doesn't seem to be taken and would mean char boundary (not
exactly a glyph but close enough).

also \U has a meaning in double quotish strings.

uri

-- 
Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  --  http://www.northernlight.com



Re: Unicode handling

2001-03-23 Thread Simon Cozens

On Fri, Mar 23, 2001 at 03:15:41PM -0800, Brad Hughes wrote:
> Simon Cozens wrote:
> [...]
> > I'm just not sure it's fair on Old World hackers. Will there be a way to stop
> > Perl upgrading stuff to Unicode on the way in?
> 
> and I'm probably not the only Old World hacker that would
> prefer a build option to simply eliminate Unicode support altogether...

Eh, no, read it again. (I had to.) It won't interfere with Old World hackers
at all. Data coming in won't be implicitly converted to Unicode, so programs
under Perl 6 should see legacy data the same way as they do under 5.6.0; if
you pretend that Unicode isn't there, it won't bother you. 

For instance, chr() will produce Unicode codepoints. But you can pretend that
they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope
and suspect there'll be an equivalent of "use bytes" which makes chr(256)
either blow up or wrap around.

So we're not exactly forcing Unicode down people's throats. (Damn.)

-- 
"Even more amazing was the realization that God has Internet access.  I
wonder if He has a full newsfeed?"
(By Matt Welsh)



Re: Unicode handling

2001-03-23 Thread Damien Neil

On Fri, Mar 23, 2001 at 06:16:58PM -0500, Dan Sugalski wrote:
> At 11:09 PM 3/23/2001 +, Simon Cozens wrote:
> >For instance, chr() will produce Unicode codepoints. But you can pretend that
> >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope
> >and suspect there'll be an equivalent of "use bytes" which makes chr(256)
> >either blow up or wrap around.
> 
> Actually no it won't. If the string you're doing a chr on is tagged as 
> EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this:
> 
> chr($foo) == chr($bar);
> 
> could evaluate to false if one of the strings is EBCDIC and the other 
> isn't. Odd but I don't see a good reason not to. Otherwise we'd want to 
> force everything to Unicode, and then what do we do if one of the strings 
> is plain binary data?

Are you thinking of ord rather than chr?  I can't seem to make the
above make sense otherwise.  chr takes a number, not a string as its
argument...

Your initial description of character set handling didn't mention
that different strings can be tagged as having different encodings,
and didn't cover the implications of this.  Could you give a list
of the specific occasions when the encoding of a string would be
visible to a programmer?

- Damien



Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 10:56 AM 3/23/2001 -0800, Damien Neil wrote:
>On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote:
> >while () {
> >  $count++ if /bar/;
> >  print OUT $_;
> >}
>
>I would find it surprising for this to have different output
>than input.  Other people's milage may vary.

I can understand that.

>In general, however, I think I would prefer to be required to
>explicitly normalize my data (via a function, pragma, or option
>set on a filehandle) than have data change unexpectedly behind
>my back.

But since the data is equivalent, and more importantly Unicode, it's not 
supposed to matter to you. Whether it *does* or not is a separate 
question... :)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Damien Neil

On Fri, Mar 23, 2001 at 06:31:13PM -0500, Dan Sugalski wrote:
> >Err, perhaps I'm being dumb here - but surely $foo and $bar arent
> >typed strings, they're just numbers (or strings which match /^\d+$/) ???
> 
> D'oh! Too much blood in my caffeine stream. Yeah, I was thinking of ord.
> 
> chr will emit a character of the type appropriate to the current default 
> string context. The default context will probably be settable at compile 
> time, or be the platform native type, alterable somehow. Probably "use 
> blah;" but that's a language design issue. :)

Ah, this answers the puzzlement in the message I just sent. :>

So the results of ord are dependent on a global setting for "current
character set" or some such, not on the encoding of the string that
is passed to it?

  - Damien



Re: Unicode handling

2001-03-23 Thread Hong Zhang

> >I recommend to use 'u' flag, which indicates all operations are performed
> >against unicode grapheme/glyph. By default re is performed on codepoint.
>
> U doesn't really signal "glyph" to me, but we are sort of limited in what
> we have left. We still need a zero-width assertion for glyph boundary
> within regexes themselves.

The 'u' flag means "advanced unicode feature(s)", which includes "always
matching against glyph/grapheme, not codepoint". What it really means is
up to discussion.  I think we probably still need "glyph" or "grapheme"
boundary in some cases.

> >We need the character equivalence construct, such as [[=a=]], which
> >matches "a", "A ACUTE".
>
> Yeah, we really need a big list of these. PDD anyone?

I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2
regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can
implement in Perl 6.

For even advantage equivalence, we can offload the job to collation library.

Hong




Re: Distributive -> and indirect slices

2001-03-23 Thread Rick Welykochy

Simon Cozens wrote:
> 
> On Mon, Mar 19, 2001 at 08:30:31AM -0800, Peter Scott wrote:
> > Seen http://dev.perl.org/rfc/82.pod?
> 
> I hadn't. I'm surprised it didn't give the PDL people screaming fits.
> But no, I wouldn't do it like that. It has:
> 
>  @b = (1,2,3);
>  @c = (2,4,6);
>  @d = @b * @c;   # Returns (2,8,18)
> 
> Where I would have @d = (2,4,6,4,8,12,6,12,18);

The first example above is called the scalar product
of two vectors in APL, and can be generalised for
all arithmetic operators.

The second example is called the outer or cross-product
of two vectors.


> However, this isn't great language design; it's applying a specific solution
> to a specific problem. Better is to solve the general problem, and have all
> operators overloadable even on non-objects, so the user can define how this
> sort of thing works.

Precisely. The operators for arrays (lists) should be defined
orthogonally and completely in a mathematically consistent
manner to be of general purpose use.

Here is one example of such a definition, freely
adapted from the APL syntax:


1. scalar array operators:  @a ? @b

   For any numerical operator '?', two compatible arrays
   can be combined with that operator producing a new array
   as follows:

   @r = @a ? @b 

   where

   $r[i] = $a[i] ? $b[i]

   complexity: if @a and @b are not the same length, the shorter
   is extended with the appropriate scalar value, i.e. 0 for
   additive and 1 for multiplicative operators.

   example:
   @a = (1,2,3);
   @b = (2,4,6);
   @r = @a + @b;   # @r = (3,6,9)

   As well, mixed array and scalar operands are allowed:

   @a + 10 produces (11,12,13)
   20 - @b produces (18,16,14)


2. reduction:  ?/@a

   Reduction introduces the new diglyphs that end in '/',
   and are preceded by a numerical operator, i.e.

   +/  -/  */  //

   Any numerical array can be reduced to a scalar value over
   a given numerical operator '?' as follows:

   $r = ?/@a

   where

   $r = $a[0] ? $a[1] ? $a[2] ...

   example:
   @a = (2,4,6);
   $r = */@a;  # $r = 36
   

3. inner product:  @a ?/! @b

   Inner product introduces the new triglyphs with '/' as the
   middle character, and surrounded by numerical operator, i.e.

   +/* */- -/* -/- etc.

   For any numerical operators '?' and '!', two compatible arrays
   can be combined with those operators producing a scalar
   inner product, as follows:

   $r = @a ?/! @b 

   where

   $r = ?/ (@a ! @b)

   i.e.

   $r = ($a[0] ! $b[0]) ? ($a[1] ! $b[1]) ? ...

   example:
   @a = (1,2,3);
   @b = (2,4,6);
   $r = @a +/* @b;   # @r = +/(2,8,18) = 28


4. outer product:  @a @? @b

   Outer product introduces the new diglyphs that start with  '@',
   and end with a numerical operator, i.e.

   @+ @- @* @/

   For any numerical operator '?', any two arrays can be combined
   into the cross-product with those operators producing a new array
   containing the outer product. Each row of the outer product is
   concatenated to the next to produce the result, best illustrated
   by the following example:

   @a = (1,2,3);
   @b = (2,4,6);
   @r = @a @* @b;

   @r = (2,4,6), (4,8,12), (6,12,18) = (2,4,6,4,8,12,6,12,18)


Of course, there are probably syntactic nightmares in introducing
the diglyphs and triglyphs mentioned above into perl.   

--
Rick Welykochy || Praxis Services Pty Limited



PDD for coding conventions

2001-03-23 Thread Dave Mitchell

About a month ago I started working on a PDD for how code should
be commented; some while later Paolo Molaro <[EMAIL PROTECTED]>
sumitted a draft PDD ('PDD X') on "Perl API conventions".

This gave me to think that, rather than accumulating lots of micro PDDs,
we should have a single one entitled "coding conventions" that includes
sections on naming and API conventions, how to comment code, etc etc.

Then the FAQs can simply state
"before you contribute src code, make sure you have thoroughly read PDD X".

Provisonally I think it should have the following sections:

* Coding style

largely lifted from Porting/patching.pod, eg function names start in
column 0, indent = 4, etc etc

* Naming conventions

how macros, variables (global or othewise), structs, files, APIs, plus
anything else you can think of, should be named.
- based on Paolo's work

* Commenting conventions

how individual items such as functions, macros etc should be commented,
plus how larger scale things (such as src files and implementation decisions)
should be commented.
- based on my work.

* Portability guidelines

The basic dos and donts of writing portable code, epcially with Perl in
mind - eg whether to assume ANSI C, things not to assume about int
sizes, and anything else you can think of.
- someone would need to write this.

* Performance guidelines

The basic dos and donts of writing code that runs well on modern processors,
eg the effect of caches and pipelines (avoid those branches, man!),
are globals Good or Evil (or Chaotic Neutral...).
- someone would need to write this.


Waddayafink? If people dont object, I'll begin drafting.


* Dave Mitchell, Senior Technical Consultant
* Fretwell-Downing Informatics Ltd, UK.  [EMAIL PROTECTED]
* Tel: +44 114 281 6113.The usual disclaimers
*
* Standards (n). Battle insignia or tribal totems






Re: Unicode handling

2001-03-23 Thread Damien Neil

On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote:
>while () {
>  $count++ if /bar/;
>  print OUT $_;
>}

I would find it surprising for this to have different output
than input.  Other people's milage may vary.

In general, however, I think I would prefer to be required to
explicitly normalize my data (via a function, pragma, or option
set on a filehandle) than have data change unexpectedly behind
my back.

 - Damien



Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 11:09 PM 3/23/2001 +, Simon Cozens wrote:
>For instance, chr() will produce Unicode codepoints. But you can pretend that
>they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope
>and suspect there'll be an equivalent of "use bytes" which makes chr(256)
>either blow up or wrap around.

Actually no it won't. If the string you're doing a chr on is tagged as 
EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this:

chr($foo) == chr($bar);

could evaluate to false if one of the strings is EBCDIC and the other 
isn't. Odd but I don't see a good reason not to. Otherwise we'd want to 
force everything to Unicode, and then what do we do if one of the strings 
is plain binary data?

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Safe signals and perl 6

2001-03-23 Thread Uri Guttman

> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:

  DS> Generally speaking, signals will be treated as generic async events in perl 
  DS> 6, since that's what they are. (The ones that aren't, like SIGBUS, really 
  DS> aren't things that perl code can catch...) They're going to be treated 
  DS> pretty much like any other event, or so the plan is at least.

  DS> Uri's working on an event handling PDD for perl 6 IIRC, so when
  DS> that comes we can work from there.

thanx for reminding me to work on it. it has been back burnered for a
little while.

uri

-- 
Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  --  http://www.northernlight.com



Re: Unicode handling

2001-03-23 Thread Dave Mitchell

Dan Sugalski <[EMAIL PROTECTED]> doodled:
> At 11:09 PM 3/23/2001 +, Simon Cozens wrote:
> >For instance, chr() will produce Unicode codepoints. But you can pretend that
> >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope
> >and suspect there'll be an equivalent of "use bytes" which makes chr(256)
> >either blow up or wrap around.
> 
> Actually no it won't. If the string you're doing a chr on is tagged as 
> EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this:
> 
> chr($foo) == chr($bar);
> 
> could evaluate to false if one of the strings is EBCDIC and the other 
> isn't.

Err, perhaps I'm being dumb here - but surely $foo and $bar arent
typed strings, they're just numbers (or strings which match /^\d+$/) ???




Re: Schwartzian Transform

2001-03-23 Thread Mark Koopman

i have to put my 2 cents in...
after reading all the discussion so far about the Schwartz,
i feel that map{} sort map{} is perfect in it's syntax.  
if you code and understand Perl (i've seen situations where
these aren't always both happening at the time) and knowingly 
use the building block functions, sort and map, to create an
abstraction like the Schwartzian transform, then why do you 
need to come up with special syntax or use a Sort::Module, as
it was suggested, to achieve just the same thing.  my point
is that i wonder if it's useful for Perl or people who write
Perl, to bundle a map and sort function into some special 
schwartzian syntax, is the goal just to abstract another layer
above the transform itself?  why not just keep using map{} sort
map {}, if it's a well understand concept?

monty


James Mastros wrote:
> 
> On Thu, Mar 22, 2001 at 11:13:47PM -0500, John Porter wrote:
> > Brent Dax wrote:
> > > Someone else showed a very ugly syntax with an anonymous
> > > hash, and I was out to prove there was a prettier way to do it.
> > Do we want prettier?  Or do we want more useful?
> > Perl is not exactly known for its pretty syntax.
> If you have to explicitly specify both the forward and inverse transforms,
> then it isn't very useful -- it's nothing more then map/sort/map.  OTOH, if
> you only have to specify the forward mapping, it becomes more useful.  Thus,
> I think the best syntax is
> tsort({xform}, {compare}, @list), where the {}s are anon blocks or curried
> expressions (same thing) and xform specifies the forward mapping (IE (lc
> ^_)) and compare specifies the comparator (IE (^_ cmp ^_)).
> 
> This would always (do the equiv to) create a LoL in the inner map, sort on
> the ->[0] elem, and extract the ->[1] elem.  Thus, it might not be as
> effecent as a hand-crafted schwartzian, but will be at least as efficent as
> a naieve straight sort (except in pathalogical cases, like tsort((^_),
> (^_<=>^_), @list)).
> 
>-=- James Mastros
> --
> The most beautiful thing we can experience is the mysterious.  It is the
> source of all true art and science.  He to whom this emotion is a stranger,
> who can no longer pause to wonder and stand wrapt in awe, is as good as dead.
> -=- Albert Einstein
> AIM: theorbtwo   homepage: http://www.rtweb.net/theorb/

-- 
Mark Koopman
Software Engineer

WebSideStory, Inc

10182 Telesis Court
San Diego CA  92121
858.546.1182.##.318
858.546.0480.fax

perl -e '
eval(lc(join("",
map ({chr}(q(
49877273766940
80827378843973
32767986693280
69827639463932
39883673434341
))=~/../g;'



Re: Unicode handling

2001-03-23 Thread Simon Cozens

On Fri, Mar 23, 2001 at 05:56:19PM -0500, Dan Sugalski wrote:
> Nah, they only apply to data that perl's tagged as Unicode, either because 
> its input stream is marked that way or because the program explicitly 
> converted the data.

Oh, colour me dull. I read

4) Data converted to Unicode (from ASCII, EBCDIC, one of the JIS
encodings, or whatever) will be done into NFC.

as meaning that data in ASCII, EBCDIC, or whatever will be converted to
Unicode in NFC.

Now I know what you mean, I think the rules you've described are perfect.

-- 
Heh, heh, heh, heh... the NOISE of a bursar CHEWING Proctors' Memoranda.
- Henry Braun is Oxford Zippy



Re: Unicode handling

2001-03-23 Thread Nicholas Clark

On Fri, Mar 23, 2001 at 03:08:35PM -0500, Dan Sugalski wrote:
> I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII 
> character.
> 
> \SMILEY FACE, perhaps?

that makes it kind of hard to edit perl scripts that use this feature on
any good old fashioned 8 bit xterm.
Let alone some crufty 7 bit serial login.
I think it would be a bad thing to effectively mandate that to use certain
features you had to use a Unicode aware editing system

Nicholas Clark



Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 11:41 PM 3/22/2001 +, Nicholas Clark wrote:
>On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> > 1) All Unicode data perl does regular expressions against will be in
> > Normalization Form C, except for...
> > 2) Regexes tagged to run against a decomposed form will instead be run
> > against data in Normalization Form D. (What the tag is at the perl 
> level is
> > up for grabs. I'd personally choose a D suffix)
> > 3) Perl won't otherwise force any normalization on data already in Unicode
> > format.
>
>So if I understand that correctly, running a regexp against a scalar will
>cause that scalar to become normalized in a defined way (C or D, depending
>on regexp)

It will be run against a normalized version of the data in the scalar, yes. 
Whether that forces the scalar to be normalized or not is something I 
hadn't thought of. If we do, then something as simple as this:

   while () {
 $count++ if /bar/;
 print OUT $_;
   }

would potentially result in the output file being rather different from the 
input file. Equivalent, yes, but different. Whether that's bad or not is an 
open question.

> > 5) Any character-based call (ord, substr, whatever) will deal with 
> whatever
> > code-points are at the location specified. If the string is LATIN SMALL
> > LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on
> > it, you get back the single character COMBINING ACUTE ACCENT, and an ord
> > would return the value 796.
>
>So if you do (ord, substr, whatever) on a scalar without knowing where it has
>been, you have no idea whether you're working on normalised or not.

Potentially, yes. If it's important, you force normalization on it.

>And in fact the same scalar may be come denormalised:
>
>   $bar = substr $foo, 3, 1;
>   &frob ($foo);
>   $baz = substr $foo, 3, 1;
>
>[so $bar and $baz differ] if someone runs it against a regular expression
>[in this case inside the subroutine &frob. Hmm, but currently you can
>make changes to parameters as they are pass-by-reference]
>
>   $bar = substr $foo, 3, 1;
>   $foo =~ /foo/;# This is not read only in perl6
>   $baz = substr $foo, 3, 1;
>
>But this is documented - if you want (ord, substr, whatever) on a string
>to make sense, you must explicitly normalized it to the form you want before
>hand, and not use any of the documented-as-normalizing operators on it
>without normalizing it again.

It's generally safe to say that if you want data to make sense period, you 
need to make sure it's sensible first. Unicode with combining characters 
does tend to exacerbate things, but it's not a new problem.

>And by implication of the above (particularly rule 3), eq compares
>codepoints, not normalized forms.

I hadn't thought about eq, gt, or lt. (Or sort, for that matter)

eq probably ought to work against codepoints, and be done with it. 
gt/lt/sort should normalize and use the Unicode sorting stuff to determine 
where things stand. I don't think they should alter the data, as we may be 
promoting non-unicode data to unicode for comparisons. (If we're comparing 
ASCII and Unicode scalars, or even something odd like Shift-JIS and EBCDIC 
scalars)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Distributive -> and indirect slices

2001-03-23 Thread Simon Cozens

On Mon, Mar 19, 2001 at 08:30:31AM -0800, Peter Scott wrote:
> Seen http://dev.perl.org/rfc/82.pod?

I hadn't. I'm surprised it didn't give the PDL people screaming fits.
But no, I wouldn't do it like that. It has:

 @b = (1,2,3);
 @c = (2,4,6);
 @d = @b * @c;   # Returns (2,8,18)

Where I would have @d = (2,4,6,4,8,12,6,12,18);

However, this isn't great language design; it's applying a specific solution
to a specific problem. Better is to solve the general problem, and have all
operators overloadable even on non-objects, so the user can define how this
sort of thing works.

-- 
I want you to know that I create nice things like this because it
pleases the Author of my story.  If this bothers you, then your notion
of Authorship needs some revision.  But you can use perl anyway. :-)
- Larry Wall



Re: Unicode handling

2001-03-23 Thread Larry Wall

Jarkko Hietaniemi writes:
: *cough* \C *is* taken.
: 
: > >also \U has a meaning in double quotish strings.
: 
: "\Uindeed."

Bear in mind we are redesigning the language.  If there's a botch we
can think about fixing it.

Though maybe not on -internals...   :-)

Larry



Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote:
> > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:
>
>   DS> U doesn't really signal "glyph" to me, but we are sort of limited
>   DS> in what we have left. We still need a zero-width assertion for
>   DS> glyph boundary within regexes themselves.
>
>how about \C? it doesn't seem to be taken and would mean char boundary (not
>exactly a glyph but close enough).

That's got the unfortunate mental association with C's char for lots of 
folks, and I know I'd probably get it stuck to codepoint rather than glyph 
if I didn't use it much.

>also \U has a meaning in double quotish strings.
>
>uri
>
>--
>Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
>SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
>The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
>The Best Search Engine on the Net  --  http://www.northernlight.com


Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Safe signals and perl 6

2001-03-23 Thread Dan Sugalski

Generally speaking, signals will be treated as generic async events in perl 
6, since that's what they are. (The ones that aren't, like SIGBUS, really 
aren't things that perl code can catch...) They're going to be treated 
pretty much like any other event, or so the plan is at least.

Uri's working on an event handling PDD for perl 6 IIRC, so when that comes 
we can work from there.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 02:06 PM 3/23/2001 -0600, Jarkko Hietaniemi wrote:
>On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote:
> > At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote:
> > > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:
> > >
> > >   DS> U doesn't really signal "glyph" to me, but we are sort of limited
> > >   DS> in what we have left. We still need a zero-width assertion for
> > >   DS> glyph boundary within regexes themselves.
> > >
> > >how about \C? it doesn't seem to be taken and would mean char boundary 
> (not
> > >exactly a glyph but close enough).
> >
> > That's got the unfortunate mental association with C's char for lots of
> > folks, and I know I'd probably get it stuck to codepoint rather than glyph
> > if I didn't use it much.
>
>*cough* \C *is* taken.
>
> > >also \U has a meaning in double quotish strings.
>
>"\Uindeed."

I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII 
character.

\SMILEY FACE, perhaps?

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 08:14 PM 3/23/2001 +, Nicholas Clark wrote:
>On Fri, Mar 23, 2001 at 03:08:35PM -0500, Dan Sugalski wrote:
> > I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII
> > character.
> >
> > \SMILEY FACE, perhaps?
>
>that makes it kind of hard to edit perl scripts that use this feature on
>any good old fashioned 8 bit xterm.
>Let alone some crufty 7 bit serial login.
>I think it would be a bad thing to effectively mandate that to use certain
>features you had to use a Unicode aware editing system

Point. Never mind--lots of folks with non-Unicode aware terminals and 
editors will be writing code that handles unicode data.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Unicode handling

2001-03-23 Thread Hong Zhang

> > >We need the character equivalence construct, such as [[=a=]], which
> > >matches "a", "A ACUTE".
> >
> > Yeah, we really need a big list of these. PDD anyone?
> >
> 
> But surely this is a locale issue, and not an encoding one?  Not every 
> language recognizes the same character equivalences.

Let me clarify it. The "character equivalence", assuming [[~a~]] syntax,
means matching a sequence of a single letter 'a' followed any number of
combining characters. I believe we can handle this without considering
locale. Whether it is still useful is up to discussion. At least it is
trivial to implement.

Hong




Re: Distributive -> and indirect slices

2001-03-23 Thread John Porter

Simon Cozens wrote:
> Better is to solve the general problem, and have all
> operators overloadable even on non-objects, so the user
> can define how this sort of thing works.

Even better is to let the user have access to the real
objects by which "non-objects", i.e. normal variables,
are implemented.  That is, to remove the "non-object"-ness
of normal variables.

(I'm not disagreeing with Simon, just twisting the idea a
little.)

-- 
John Porter




Re: Unicode handling

2001-03-23 Thread Simon Cozens

On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> Yes, I realize that point 5 may result in someone getting a meaningless 
> Unicode string. Too bad--it is *not* the place of a programming language to 
> enforce validity on data. That's the programmer's job.

But points 4 and 5 do enforce Unicode on everyone. Not that I'm particularly
upset by that idea, but... :) 

open FH, $datafile or die $!; undef $/;
$foo = ;
die "Confusing" if -s $datafile != length $foo;

I'm just not sure it's fair on Old World hackers. Will there be a way to stop
Perl upgrading stuff to Unicode on the way in?

-- 
 i've dreamed in Perl many time, last night i dreamed in Make,
and that just sucks.



Re: Unicode handling

2001-03-23 Thread Dan Sugalski

At 01:07 PM 3/23/2001 -0800, Larry Wall wrote:
>Jarkko Hietaniemi writes:
>: *cough* \C *is* taken.
>:
>: > >also \U has a meaning in double quotish strings.
>:
>: "\Uindeed."
>
>Bear in mind we are redesigning the language.  If there's a botch we
>can think about fixing it.
>
>Though maybe not on -internals...   :-)

Good point. It's enough for us to say the regex engine will do glyph breaks 
and glyph instead of character semantics. We can pass on the language-level 
bits to someone else. (I hear we have someone doing that language thing... :)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk