Re: Schwartzian Transform

2001-03-22 Thread Dan Brian

Could someone summarize the arguments for such an operator? Doing so, to
me, seems to subtrack from the scripting domain something which belongs
there. Teaching the transform in classes is a wonderful way to both
illustrate the power of Perl's map, and more importantly, help programmers
understand the beauty of compact Perl. I'd hate to see that relegated to
the "how-we-used-to-do-it" column in the name of making it easier.

IMO the very quest for a name would be reason enough to not do it.
"map_sort_map"? That begs the question. And since Randal asks that it not
be named after him ... (I heard he filed a trademark on Schwartzian, so
that's out. :)

On 22 Mar 2001, Randal L. Schwartz wrote:

> > "Brent" == Brent Dax <[EMAIL PROTECTED]> writes:
> 
> Brent>   @s = schwartzian(
> 
> Please, if we're going to add an operator, let's not call it schwartzian!
> I have enough trouble already telling people how to spell my name. :)
> 
> Maybe I should have a kid named "Ian", so I can see on a roster some day:
> 
> Schwartz,Ian
> 
> :-)
> 
> -- 
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
> <[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
> See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
> 




Re: PDD 4: Internal data types

2001-03-22 Thread Hong Zhang

> > The normalization has something to do with encoding. If you compare two
> > strings with the same encoding, of course you don't have to care about
it.
>
> Of course you do. Think about it.

I said "you don't have to". You can use "==" for codepoint comparison, and
something like "Normalizer.compare(a, b)" for lexical comparison, like Java.
It may not be the best solution, but it is doable and acceptable.

> If I'm comparing "(Greek letter lower case alpha with tonos)" with "(Greek
> letter lower case alpha)(+tonos)" I want them to compare equal. One string
is
> normalized, the other isn't; how they're encoded is irrelevant, you still
have
> to care about normalization. (This is where Perl 5 currently falls over)
>
> Normalization has utterly nothing at all to do with encoding. Nothing.

Please not fight on wording. For most encodings I know of, the concept of
normalization does not even exist. What is your definition of normalization?

> Now, since we have to normalize strings in some cases (like the comparison
> above) when the user hasn't explicitly asked for it, let's not make things
> like length() and substr() dependent on whether or not the string is
> normalized, eh? The *last* thing I want to happen is this:
>
> $a = "(Greek letter lower case alpha with tonos)"
> print length $a; # 1
> if ($a eq "(Greek letter lower case alpha)(+tonos)") {
> # (Which it damned well ought to)
>
> print length $a; # 2! HA! Surprise! $a had to be normalized!
> }

I fully understand this. This is one of the reasons I propose sole UTF-8
encoding. If length() and substr() depend on string internal encoding,
are they still useful? Who can handle this magic length().

I still believe UTF-8 is the best choice. Random string access is just
not important, at least, to me.

Let's not fight on string encoding. I like to see some suggestions about
how to handle normalization transparently. Making length()/substr() depend
on encoding/normalization (whatever they are) does not make sense to me.

Hong




Re: Schwartzian Transform

2001-03-22 Thread Dan Brian

> this would have to be a proper module and not a builtin op. there is no
> reason to make this built in.

This was essentially my point with regards to naming this op
"map_sort_map". Just explaining the function of the op negates its
usefulness *as* an op, because of the complexity of extracting the keys in
order, and the subsequent comparisons. Imagine the perldoc entry.




Re: Schwartzian Transform

2001-03-22 Thread John Porter

Zenon Zabinski wrote:
> Personally, I have never used the Schwartzian Transform ...
> so I may not be fully knowledgeable of its usefulness. 
> 
> do you need to understand the 
> intricacies if you can just cut and paste and just change a few 
> variables? 

Not to be harsh, but you probably *do* need to understand
the "intracacies" of the ST if you want to be able to
contribute usefully to the resolution of this issue.

-- 
John Porter




Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Hong Zhang

Here is some of my experience with HotSpot for Linux port.

>  I've read, in the glibc info manuals, the the similar situation
>  exists in C programming -- you don't want to do a lot inside the
>  signal handler; just set a flag and return, then check that flag from
>  your main loop, and run a "bottom half".

It is much more limited than you read. Even the sprintf() does not
work well. The sprintf() support "%m", which means errno. The errno
is "#define errno *__errno_location()", which uses thread_self().
If you install signal handler with alternative signal stack.
The sprintf() will crash immediately, even you use empty format string.

>  I've looked, a little, (and months ago at that) at the LibREP (ala
>  "sawfish") virtual machine.  It's a pretty good indirect threaded VM
>  that uses techniques pioneered by Forth engines.  It utilizes the GCC
>  ability to take the address of a label to build a jump table indexed
>  by opcode.  Very efficient.

It is not very portable. I don't believe it will be any faster than
switch case.

>  What if, at the C level, you had a signal handler that sets or
>  increments a flag or counter, stuffs a struct with information about
>  the signal's context, then pushes (by "push", I mean "(cons v ls)",
>  not "(append! ls v)" 'whatever ;-) that struct on a stack...

I don't believe there is any way to push anything on the stack inside
signal handler without breaking the interpreter. Remember the signal
context is not useful outside signal handler.

For synchronous signal, we can use regular signal handler or win32
structured exception handler, for things like SIGSEGV etc.

For asynchronous singal handler, we have to do some magic things.
If you don't need signal context (most of time), you can use a generic
signal handler,

void perl_signal_handler(int sig) {
/* this is thread safe, SMP safe, nested signal safe */
atomic_increment(&signal_table[sig]);
/* set general flag for all async event */
async_flag = 1;
}

If you really need signal context, you have to use a dedicated thread

void* thread_function(void* arg) {
while (sigwaitinfo(&sigset, &siginfo)) {
/* handle signal here */
}
/* something wrong */
}

Hong




Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Karl M. Hegbloom

> "Hong" == Hong Zhang <[EMAIL PROTECTED]> writes:

>> What if, at the C level, you had a signal handler that sets or
>> increments a flag or counter, stuffs a struct with information about
>> the signal's context, then pushes (by "push", I mean "(cons v ls)",
>> not "(append! ls v)" 'whatever ;-) that struct on a stack...

Hong> I don't believe there is any way to push anything on the stack inside
Hong> signal handler without breaking the interpreter. Remember the signal
Hong> context is not useful outside signal handler.

 I don't mean "the stack", but "a stack"; one created just for this purpose.

 I am a beginner, and likely very naive.

-- 
mailto: (Karl M. Hegbloom) [EMAIL PROTECTED]
http://www.microsharp.com
phone://USA/WA/360-260-2066



Re: Schwartzian Transform

2001-03-22 Thread Randal L. Schwartz

> "Dan" == Dan Brian <[EMAIL PROTECTED]> writes:

Dan> IMO the very quest for a name would be reason enough to not do it.
Dan> "map_sort_map"? That begs the question. And since Randal asks that it not
Dan> be named after him ... (I heard he filed a trademark on Schwartzian, so
Dan> that's out. :)

It's just that it means nothing, except to those who probably already
know how to open-code the map-sort-map.  So there's no gain.  Maybe we
could just teach sort how to do this neater, so there's no new
operator.  Like maybe

   sort { $a/$b expression } { transforming expression, glued with $_ } @list

so $a->[0] is guaranteed to be the original element, and the list-return
value of the second block becomes $a->[1]... $a->[$#$a].

So, to sort case insensitive (bad example :):

@sorted = sort { $a->[1] cmp $b->[1] } { uc } @list;

or to sort on GCOS and then username of password lines:

@sorted = sort { $a->[5] cmp $b->[5] or $a->[1] cmp $b->[1] }
  { split /:/ } `cat /etc/passwd`;

That captures the canonical ST pretty well, where $a->[0] is always
the original element.

Maybe we can call this operator "schwart". :)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Keisuke Nishida

At Thu, 22 Mar 2001 13:37:29 -0800,
John Harper wrote:
> 
> |>  I've looked, a little, (and months ago at that) at the LibREP (ala
> |>  "sawfish") virtual machine.  It's a pretty good indirect threaded VM
> |>  that uses techniques pioneered by Forth engines.  It utilizes the GCC
> |>  ability to take the address of a label to build a jump table indexed
> |>  by opcode.  Very efficient.
> |
> |It is not very portable. I don't believe it will be any faster than
> |switch case.
> 
> IIRC it was at least 10% faster; portability is maintained by falling
> back to switch on non-gcc systems

In my experiment with Guile VM (which uses the same technique as rep's),
it is 50% faster:

 [ labels as values ]
 gcil@user> ,time (fib 30)
 $1 = 1346269
 clock utime stime cutime cstime gctime
  3.14  3.02  0.12   0.00   0.00   0.00

 [ switch ]
 gcil@user> ,time (fib 30)
 $1 = 1346269
 clock utime stime cutime cstime gctime
  4.76  4.70  0.06   0.00   0.00   0.00

This is a result of 30,000,000 VM instruction calls and purely
reflects the difference between the two, I believe.

Kei



Re: Schwartzian Transform

2001-03-22 Thread John Porter

Brent Dax wrote:
> 
> Someone else showed a very ugly syntax with an anonymous
> hash, and I was out to prove there was a prettier way to do it.

Do we want prettier?  Or do we want more useful?
Perl is not exactly known for its pretty syntax.


-- 
John Porter




RE: Schwartzian Transform

2001-03-22 Thread Brent Dax


>> "Brent" == Brent Dax <[EMAIL PROTECTED]> writes:

> Brent>   @s = schwartzian(

> Please, if we're going to add an operator, let's not call it schwartzian!
> I have enough trouble already telling people how to spell my name. :)

Which is why my real suggestion was a 'tsort' ('tsort' eq 'transform and
sort') operator.  Someone else showed a very ugly syntax with an anonymous
hash, and I was out to prove there was a prettier way to do it.

BTW, I don't think 'schwartz' as the function name would be a good idea
either.  Then I'd have to write something silly like
schwartz {$a <=> $b} {s/foo/bar/} @ary; #I see your Schwartz is as big as
mine... --Dark Helmet

> Maybe I should have a kid named "Ian", so I can see on a roster some day:
>
> Schwartz,Ian
>
> :-)

:^)

--Brent Dax
[EMAIL PROTECTED]

This e-mail is a circumvention device as defined by the Digital Millennium
Copyright Act.

#qrpff
s''$/=\2048;while(<>){G=29;R=142;if((@a=unqT="C*",_)[20]&48){D=89;_=unqb24,q
T,@
b=map{ord
qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9
|256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7
^O)
^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,"\xb\ntd\xbz\x14d")[_/16%8]
);E
^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>
>=8
)+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/\$$&/g;s/q/pack+/g;eva
l




Re: Unicode handling

2001-03-22 Thread Hong Zhang

> 6) There will be a glyph boundary/non-glyph boundary pair of regex
> characters to match the word/non-word boundary ones we already have.
(While
> I'd personally like \g and \G, that won't work as \G is already taken)
>
> I also realize that the decomposition flag on regexes would mean that
> s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the
> previous paragraph.

I recommend to use 'u' flag, which indicates all operations are performed
against unicode grapheme/glyph. By default re is performed on codepoint.
We need the character equivalence construct, such as [[=a=]], which
matches "a", "A ACUTE".

Hong




Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Hong Zhang

> >> What if, at the C level, you had a signal handler that sets or
> >> increments a flag or counter, stuffs a struct with information
about
> >> the signal's context, then pushes (by "push", I mean "(cons v ls)",
> >> not "(append! ls v)" 'whatever ;-) that struct on a stack...
>
> Hong> I don't believe there is any way to push anything on the stack
inside
> Hong> signal handler without breaking the interpreter. Remember the
signal
> Hong> context is not useful outside signal handler.
>
>  I don't mean "the stack", but "a stack"; one created just for this
purpose.

"a stack" is still too easy to get overflow. And will be difficult to manage
in threaded environment, malloc() is not allowed inside signal handler.
A simple signal count will be much easier to deal with.

I tried to give a concrete solution here. I have used this solution for
HotSpot java virtual machine for Linux, and it works fine.

Hong




Re: Schwartzian Transform

2001-03-22 Thread Randal L. Schwartz

> "Brent" == Brent Dax <[EMAIL PROTECTED]> writes:

Brent>   @s = schwartzian(

Please, if we're going to add an operator, let's not call it schwartzian!
I have enough trouble already telling people how to spell my name. :)

Maybe I should have a kid named "Ian", so I can see on a roster some day:

Schwartz,Ian

:-)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Re: PDD 4: Internal data types

2001-03-22 Thread Buddha Buck

At 11:14 AM 03-22-2001 -0800, Hong Zhang wrote:

>Please not fight on wording. For most encodings I know of, the concept of
>normalization does not even exist. What is your definition of normalization?

To me, the usual definition of "normalization' is conversion of something 
into a standard form, especially when there are multiple equivilant forms 
it could be in.

Since there are multiple ways within Unicode to express a single character 
that are considered (by Unicode) to be identical, conversion into  single 
common form is necessary for comparison purposes.

Example:  The sequence of Unicode code points 006E 0061 0069 0308 0076 0065 
and the sequence 006E 0061 00EF 0076 0065 both represent the same string in 
Unicode (the english word "naive", with a diaeresis over the i).  Both 
represent 5-character strings, and both are supposed to compare 
identically.  However, they use a different sequence of code points to 
represent one particular character: the 'i' with a diaeresis: 0069 0308 
versus 00EF.

If we have $naive5 and $naive6 be variable containing the two example 
strings, what do we want as the value of the following expressions?

   $naive5 eq $naive 6;
   length($naive5);
   length($naive6);
and so forth.

As far as my very limited understanding of the Unicode standard goes, they 
should compare equal, and both have a length of 5.  But their encoded byte 
sequences may not be identical.

>I fully understand this. This is one of the reasons I propose sole UTF-8
>encoding. If length() and substr() depend on string internal encoding,
>are they still useful? Who can handle this magic length().

UTF-8 encoding doesn't fix the above problem.  UTF-8 would still encode the 
two strings differently, because they have different code point 
sequences.  For that matter, so would any of the other encoding 
suggestions.   As such, for the above problem, encoding is pretty much a 
non-issue.




Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Karl M. Hegbloom


 I've not researched this at all... perhaps it's a "known" way of
 doing things and there is research writing out there already, etc...
 I've not even looked at this point.  I have about 30 minutes to
 outline this and bounce it off of you all this morning. 8-)

 I was reading Lincoln D. Stein's "Network Programming with Perl" (the
 one with that very interesting and nicely done Yggdrasil on the
 cover), and got into the part about signal handling.  He says that
 it's not safe to do a lot of stuff inside a signal handler; that it
 can even cause Perl to crash.

 I've read, in the glibc info manuals, the the similar situation
 exists in C programming -- you don't want to do a lot inside the
 signal handler; just set a flag and return, then check that flag from
 your main loop, and run a "bottom half".

 I've looked, a little, (and months ago at that) at the LibREP (ala
 "sawfish") virtual machine.  It's a pretty good indirect threaded VM
 that uses techniques pioneered by Forth engines.  It utilizes the GCC
 ability to take the address of a label to build a jump table indexed
 by opcode.  Very efficient.

 What if, at the C level, you had a signal handler that sets or
 increments a flag or counter, stuffs a struct with information about
 the signal's context, then pushes (by "push", I mean "(cons v ls)",
 not "(append! ls v)" 'whatever ;-) that struct on a stack...

 When, in your language, you do the equivalent of Perl:

 $SIG{' (lambda (v) (do-something-with v)))
 (else 'whatever))

-- 
mailto: (Karl M. Hegbloom) [EMAIL PROTECTED]
http://www.microsharp.com
phone://USA/WA/360-260-2066



Re: PDD 4: Internal data types

2001-03-22 Thread Simon Cozens

On Thu, Mar 22, 2001 at 11:14:53AM -0800, Hong Zhang wrote:
> Please not fight on wording. For most encodings I know of, the concept of
> normalization does not even exist.

*boggle*. I don't think we're talking about the same Unicode.

> What is your definition of normalization?
 
Well, either canonical or compatibility decomposition, followed by optional
canonical composition. (I'm expecting us to use normalisation form C, which is
canon. decomp + canon. comp.)

Again, decomposition and composition are thoroughly and utterly
encoding-independent.

UTF#15 and the Unicode Standard covers this. See also the Normalization FAQ at
http://www.unicode.org/unicode/faq/normalization.html

> I still believe UTF-8 is the best choice. Random string access is just
> not important, at least, to me.

Fine, it may not be important to you and, if it makes you happy, UTF8 encoded
is supported in Dan's variable type PDD. But I predict that random string
access is a *huge* part of Perl's operation.

-- 
As the saying goes, if you give a man a fish, he eats for a day. If you
teach him to grep for fish, he'll leave you alone all weekend. If you
encourage him to beg for fish, pretty soon c.l.p.misc will smell like a
three-week-dead trout. -- Tom Phoenix, c.l.p.misc.



Re: Unicode handling

2001-03-22 Thread Nicholas Clark

On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> 1) All Unicode data perl does regular expressions against will be in 
> Normalization Form C, except for...
> 2) Regexes tagged to run against a decomposed form will instead be run 
> against data in Normalization Form D. (What the tag is at the perl level is 
> up for grabs. I'd personally choose a D suffix)
> 3) Perl won't otherwise force any normalization on data already in Unicode 
> format.

So if I understand that correctly, running a regexp against a scalar will
cause that scalar to become normalized in a defined way (C or D, depending
on regexp)

> 5) Any character-based call (ord, substr, whatever) will deal with whatever 
> code-points are at the location specified. If the string is LATIN SMALL 
> LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on 
> it, you get back the single character COMBINING ACUTE ACCENT, and an ord 
> would return the value 796.

So if you do (ord, substr, whatever) on a scalar without knowing where it has
been, you have no idea whether you're working on normalised or not.
And in fact the same scalar may be come denormalised:

  $bar = substr $foo, 3, 1;
  &frob ($foo);
  $baz = substr $foo, 3, 1;

[so $bar and $baz differ] if someone runs it against a regular expression
[in this case inside the subroutine &frob. Hmm, but currently you can
make changes to parameters as they are pass-by-reference]

  $bar = substr $foo, 3, 1;
  $foo =~ /foo/;# This is not read only in perl6
  $baz = substr $foo, 3, 1;

But this is documented - if you want (ord, substr, whatever) on a string
to make sense, you must explicitly normalized it to the form you want before
hand, and not use any of the documented-as-normalizing operators on it
without normalizing it again.

And by implication of the above (particularly rule 3), eq compares
codepoints, not normalized forms.
Hmm. So

 $foo =~ /^$bar$/;  # did I need to \Q \E this?

might be true at the same time as

 $foo ne $bar

I'm in too minds about this. It feels like it would be hard to implement the
internals to make eq work on normalized forms without
either

1: causing it to not be read only, hence UTF8 in might not be UTF8 out
   because it had been part of an eq

or

2: having to double buffer almost every scalar, with both the original UTF8
   and a (cached copy) normalized form

but at this point I'll shut up as I expect I'm ignorant of an RFC on how this
works without hitting either of the above problems.

Nicholas Clark



Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread John Harper

Hong Zhang writes:
|>  I've looked, a little, (and months ago at that) at the LibREP (ala
|>  "sawfish") virtual machine.  It's a pretty good indirect threaded VM
|>  that uses techniques pioneered by Forth engines.  It utilizes the GCC
|>  ability to take the address of a label to build a jump table indexed
|>  by opcode.  Very efficient.
|
|It is not very portable. I don't believe it will be any faster than
|switch case.

IIRC it was at least 10% faster; portability is maintained by falling
back to switch on non-gcc systems

One of the problems with switch is that it checks that the switched
value is within the range of the jump table (even if you switch on an
eight bit value, and have exactly 256 cases)

See this paper for more details:

@InProceedings{ertl93,
  author =   "M. Anton Ertl",
  title ="A Portable {Forth} Engine",
  booktitle ="EuroFORTH '93 conference proceedings",
  year = "1993",
  address =  "Mari\'ansk\'e L\'azn\`e (Marienbad)",
  url =  "http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z",
}

John




Unicode handling

2001-03-22 Thread Dan Sugalski

At the moment, I'm not particularly inclined to argue unicode. Short of 
Larry handing down an edict and invoking Rule #1, the following rules will 
be in effect:

1) All Unicode data perl does regular expressions against will be in 
Normalization Form C, except for...
2) Regexes tagged to run against a decomposed form will instead be run 
against data in Normalization Form D. (What the tag is at the perl level is 
up for grabs. I'd personally choose a D suffix)
3) Perl won't otherwise force any normalization on data already in Unicode 
format.
4) Data converted to Unicode (from ASCII, EBCDIC, one of the JIS encodings, 
or whatever) will be done into NFC.
5) Any character-based call (ord, substr, whatever) will deal with whatever 
code-points are at the location specified. If the string is LATIN SMALL 
LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on 
it, you get back the single character COMBINING ACUTE ACCENT, and an ord 
would return the value 796.
6) There will be a glyph boundary/non-glyph boundary pair of regex 
characters to match the word/non-word boundary ones we already have. (While 
I'd personally like \g and \G, that won't work as \G is already taken)
7) There will be a unicode package shipped standard with nfc() and nfd() 
calls to put things in normalization form C and D, respectively.
8) We will provide an I/O filter to convert into some unicode normalization 
form or other, as well as to convert to and from UTF8, 16, and 32. (Both 
big and little endian for UTF16 and 32, though perl internally will handle 
the 16 and 32 bit integers in whatever's native for the platform)

All of this is completely independent of whether the Unicode data is in 
UTF8, UTF16, UTF32, Morse Code, or trinary.

Yes, I realize that point 5 may result in someone getting a meaningless 
Unicode string. Too bad--it is *not* the place of a programming language to 
enforce validity on data. That's the programmer's job.

I also realize that the decomposition flag on regexes would mean that 
s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the 
previous paragraph.

To be really blunt, unless someone forsees the world coming to an end or 
becoming really annoying because of one of the above rules, I think that 
should put an end to discussions of "whether or not". Discussions of "how" 
are now in order.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: PDD 4: Internal data types

2001-03-22 Thread Simon Cozens

On Tue, Mar 06, 2001 at 01:21:20PM -0800, Hong Zhang wrote:
> The normalization has something to do with encoding. If you compare two
> strings with the same encoding, of course you don't have to care about it.

Of course you do. Think about it.

If I'm comparing "(Greek letter lower case alpha with tonos)" with "(Greek
letter lower case alpha)(+tonos)" I want them to compare equal. One string is
normalized, the other isn't; how they're encoded is irrelevant, you still have
to care about normalization. (This is where Perl 5 currently falls over)

Normalization has utterly nothing at all to do with encoding. Nothing.

Now, since we have to normalize strings in some cases (like the comparison
above) when the user hasn't explicitly asked for it, let's not make things
like length() and substr() dependent on whether or not the string is
normalized, eh? The *last* thing I want to happen is this:

$a = "(Greek letter lower case alpha with tonos)"
print length $a; # 1
if ($a eq "(Greek letter lower case alpha)(+tonos)") {
# (Which it damned well ought to)

print length $a; # 2! HA! Surprise! $a had to be normalized!
}

Please see my Unicode RFCs.

-- 
Hanlon's Razor:
Never attribute to malice that which is adequately explained
by stupidity.



Re: Schwartzian Transform

2001-03-22 Thread Uri Guttman

> "RLS" == Randal L Schwartz <[EMAIL PROTECTED]> writes:

  RLS> sort { $a/$b expression } { transforming expression, glued with $_ } @list

  RLS> so $a->[0] is guaranteed to be the original element, and the list-return
  RLS> value of the second block becomes $a->[1]... $a->[$#$a].

  RLS> So, to sort case insensitive (bad example :):

  RLS> @sorted = sort { $a->[1] cmp $b->[1] } { uc } @list;

  RLS> or to sort on GCOS and then username of password lines:

  RLS> @sorted = sort { $a->[5] cmp $b->[5] or $a->[1] cmp $b->[1] }
  RLS> { split /:/ } `cat /etc/passwd`;

  RLS> That captures the canonical ST pretty well, where $a->[0] is always
  RLS> the original element.

what everyone is missing in this thread, is that the real and sometime
tricky work is in key extraction and not the map/sort/map. there is no
easy way to describe how to extract multiple keys in the correct order
and then how to do the proper (ascending/descending, string/numeric)
comparisons on them. there are too many possibilities. i explored this
in depth as i designed the Sort::Records module. i had to invent a mini
language to describe all the possible key extractions and
comparisons. think along the lines of getopt::long but more powerful and
you can see the issues. try to describe how to extract this without
using perl code:

(split( ':', $rec->[2]{foo} ))[2]

and sort that numerically in descending order.

now add 2 more keys.

this would have to be a proper module and not a builtin op. there is no
reason to make this built in.

uri

-- 
Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  --  http://www.northernlight.com