Re: Idea for safe signal handling by a byte code interpreter
> "Karl" == Karl M Hegbloom <[EMAIL PROTECTED]> writes: Karl> Then, from strategic points within the VM, just as the Karl> emacsen check for QUIT, you'd check for that signal flag or Karl> counter, and run the signal handlers from a bottom half of Karl> some kind. This way, you know that the interpretter is in a Karl> consistent state and that you're not halfway through Karl> something that ought be more atomic, or whatever. Karl> There should be a way to say that "this is a critical Karl> section; don't call signal handler here" in the scripting Karl> language. As far as I understand (both your point and the Guile code), this is what Guile does already. The signal handler for signal SIG sets got_signal[SIG] to 1, and marks a system async to run at the next safe point. (A system async is simply a thunk that can be marked for asynchronous execution.) Guile checks whether it is safe to run system asyncs in the macros SCM_ALLOW_INTS and SCM_REALLOW_INTS; thus, in the Guile C code, the code between SCM_DEFER_INTS and SCM_ALLOW_INTS, or between SCM_REDEFER_INTS and SCM_REALLOW_INTS, is a critical section as you suggest. Best regards, Neil
Re: Schwartzian Transform
On Thu, Mar 22, 2001 at 11:13:47PM -0500, John Porter wrote: > Brent Dax wrote: > > Someone else showed a very ugly syntax with an anonymous > > hash, and I was out to prove there was a prettier way to do it. > Do we want prettier? Or do we want more useful? > Perl is not exactly known for its pretty syntax. If you have to explicitly specify both the forward and inverse transforms, then it isn't very useful -- it's nothing more then map/sort/map. OTOH, if you only have to specify the forward mapping, it becomes more useful. Thus, I think the best syntax is tsort({xform}, {compare}, @list), where the {}s are anon blocks or curried expressions (same thing) and xform specifies the forward mapping (IE (lc ^_)) and compare specifies the comparator (IE (^_ cmp ^_)). This would always (do the equiv to) create a LoL in the inner map, sort on the ->[0] elem, and extract the ->[1] elem. Thus, it might not be as effecent as a hand-crafted schwartzian, but will be at least as efficent as a naieve straight sort (except in pathalogical cases, like tsort((^_), (^_<=>^_), @list)). -=- James Mastros -- The most beautiful thing we can experience is the mysterious. It is the source of all true art and science. He to whom this emotion is a stranger, who can no longer pause to wonder and stand wrapt in awe, is as good as dead. -=- Albert Einstein AIM: theorbtwo homepage: http://www.rtweb.net/theorb/
Re: Unicode handling
On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote: > At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: > > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > > > DS> U doesn't really signal "glyph" to me, but we are sort of limited > > DS> in what we have left. We still need a zero-width assertion for > > DS> glyph boundary within regexes themselves. > > > >how about \C? it doesn't seem to be taken and would mean char boundary (not > >exactly a glyph but close enough). > > That's got the unfortunate mental association with C's char for lots of > folks, and I know I'd probably get it stuck to codepoint rather than glyph > if I didn't use it much. *cough* \C *is* taken. > >also \U has a meaning in double quotish strings. "\Uindeed." > > > >uri > > > >-- > >Uri Guttman - [EMAIL PROTECTED] -- http://www.sysarch.com > >SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting > >The Perl Books Page --- http://www.sysarch.com/cgi-bin/perl_books > >The Best Search Engine on the Net -- http://www.northernlight.com > > > Dan > > --"it's like this"--- > Dan Sugalski even samurai > [EMAIL PROTECTED] have teddy bears and even > teddy bears get drunk -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Unicode handling
At 11:26 PM 3/23/2001 +, Dave Mitchell wrote: >Dan Sugalski <[EMAIL PROTECTED]> doodled: > > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > > >For instance, chr() will produce Unicode codepoints. But you can > pretend that > > >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. > I hope > > >and suspect there'll be an equivalent of "use bytes" which makes chr(256) > > >either blow up or wrap around. > > > > Actually no it won't. If the string you're doing a chr on is tagged as > > EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this: > > > > chr($foo) == chr($bar); > > > > could evaluate to false if one of the strings is EBCDIC and the other > > isn't. > >Err, perhaps I'm being dumb here - but surely $foo and $bar arent >typed strings, they're just numbers (or strings which match /^\d+$/) ??? D'oh! Too much blood in my caffeine stream. Yeah, I was thinking of ord. chr will emit a character of the type appropriate to the current default string context. The default context will probably be settable at compile time, or be the platform native type, alterable somehow. Probably "use blah;" but that's a language design issue. :) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: Unicode handling
At 11:05 AM 3/23/2001 -0600, Garrett Goebel wrote: >From: Nicholas Clark [mailto:[EMAIL PROTECTED]] > > > > On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > > 1) All Unicode data perl does regular expressions against > > >will be in Normalization Form C, except for... > > > 2) Regexes tagged to run against a decomposed form will > > >instead be run against data in Normalization Form D. > > > (What the tag is at the perl level is up for grabs. I'd > > > personally choose a D suffix) > > > 3) Perl won't otherwise force any normalization on data > > >already in Unicode format. > > > > So if I understand that correctly, running a regexp against a > > scalar will cause that scalar to become normalized in a > > defined way (C or D, depending on regexp) > >I'm not sure whether to read that as resulting in scalar being normalized, >or if the "data perl does the regular expressions against" would be a >normalized copy of that scalar's value. It could be either way. >Wouldn't normalizing the scalar lose information? I don't know Unicode, >but surely someone must have a use for storing strings in both NFC and >NFD. Is it valid to intermix both forms? Isn't there a need to preserve >the data in its original encoding? I don't like the idea of the language >losing information without the programmer's permission. Whether normalizing loses information seems to depend on your definition of "lose". When you take a Unicode string and put it into either NFC or NFD, the result is equivalent, but not the same. The Unicode standard specifies what characters and character sequences are equivalent. When you're dealing with Unicode data, you're not supposed to care about the actual code points, as far as I can tell. (With the possible exception of general things like "must be NFC" or "must be NFD") Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: Unicode handling
At 01:26 PM 3/23/2001 -0500, NeonEdge wrote: >Dan Sugalski wrote: > >If we do, then something as simple as this: > > > > while () { > > $count++ if /bar/; > > print OUT $_; > > } > > > >would potentially result in the output file being rather different from the > >input file. Equivalent, yes, but different. Whether that's bad or not is an > >open question. > >I don't believe that any scalar defined within the parsed application space >should be transformed permanently. There shouldn't be any difference between >the input file and the output file in the above example (it could cause >issues with non-Perl apps). This is not unreasonable. I'd prefer not to carry around multiple versions of strings if it can be avoided, though. That can add up really quickly. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 11:52 AM 3/23/2001 -0800, Hong Zhang wrote: > > >I recommend to use 'u' flag, which indicates all operations are performed > > >against unicode grapheme/glyph. By default re is performed on codepoint. > > > > U doesn't really signal "glyph" to me, but we are sort of limited in what > > we have left. We still need a zero-width assertion for glyph boundary > > within regexes themselves. > >The 'u' flag means "advanced unicode feature(s)", which includes "always >matching against glyph/grapheme, not codepoint". What it really means is >up to discussion. I think we probably still need "glyph" or "grapheme" >boundary in some cases. Fair enough. I think there are some cases where there's a base/combining pair of codepoints that don't map to a single combined-character code point. Not matching on a glyph boundary could make things really odd, but I'd hate to have the checking code on by default, since that'd slow down the common case where the string in NFC won't have those. > > >We need the character equivalence construct, such as [[=a=]], which > > >matches "a", "A ACUTE". > > > > Yeah, we really need a big list of these. PDD anyone? > >I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2 >regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can >implement in Perl 6. That's a separate issue I think I'll dodge for right now. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 10:48 PM 3/23/2001 +, Simon Cozens wrote: >On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > Yes, I realize that point 5 may result in someone getting a meaningless > > Unicode string. Too bad--it is *not* the place of a programming > language to > > enforce validity on data. That's the programmer's job. > >But points 4 and 5 do enforce Unicode on everyone. Not that I'm particularly >upset by that idea, but... :) Nah, they only apply to data that perl's tagged as Unicode, either because its input stream is marked that way or because the program explicitly converted the data. A plain: open FOO, "some.file"; while () { whatever($_)} probably won't be dealing with Unicode data. (Unless for some reason perl's been told that all files are Unicode by default) I expect the default data types for data that comes from files will be either binary or ascii for most systems, EBCDIC on OS/390 systems, and potentially Unicode on Windows and non-US systems. Dealing with typed strings might make binmode (and perhaps corresponding asciimode, unicodemode, or ebcdicmode) more frequently used. Or not. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
On Friday 23 March 2001 14:48, you wrote > In Unicode, there's theoretically no locale. Theoretically... Well, yes, but Unicode makes no pretenses about encoding the world's languages - just the various symbols use by the world's languages. If you want to orient Perl so that it remains(?) data-oriented, even when processing text, as an independent underlayer to locale processing, then that's fine, I guess, as long as you aren't forcing *someone's* locale onto it at that lower layer. If you want to orient it so that it processes the text as... well, a textual respresentation of a language, then you'll have to consider locale issues. At some point. Okay, now I see Hong's response. Yes, I'm understanding now. Not character equivalence from a linguistic perspective, but simply /({base glyph}{combining glyphs}*)/ Okay, I'll go back to lurking. -- Bryan C. Warnock [EMAIL PROTECTED]
RE: Unicode handling
Dan Sugalski wrote: >If we do, then something as simple as this: > > while () { > $count++ if /bar/; > print OUT $_; > } > >would potentially result in the output file being rather different from the >input file. Equivalent, yes, but different. Whether that's bad or not is an >open question. I don't believe that any scalar defined within the parsed application space should be transformed permanently. There shouldn't be any difference between the input file and the output file in the above example (it could cause issues with non-Perl apps). I think the rule should be to store normalized scalars as separate from the original and leave the original unaffected. There are specific cases where it would be OK to normalize the original, such as error strings, and other scalars used internally by Perl. Perhaps the developer could 'use normalize' to force the scalars to be normalized for optimization purposes, but Perl shouldn't force normalization. Grant M.
Re: Unicode handling
At 02:31 PM 3/23/2001 -0500, Bryan C. Warnock wrote: >On Friday 23 March 2001 14:18, Dan Sugalski wrote: > > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > > 6) There will be a glyph boundary/non-glyph boundary pair of regex > > > > characters to match the word/non-word boundary ones we already have. > > > > > >(While > > > > > > > I'd personally like \g and \G, that won't work as \G is already taken) > > > > > > > > I also realize that the decomposition flag on regexes would mean that > > > > s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the > > > > previous paragraph. > > > > > >I recommend to use 'u' flag, which indicates all operations are performed > > >against unicode grapheme/glyph. By default re is performed on codepoint. > > > > U doesn't really signal "glyph" to me, but we are sort of limited in what > > we have left. We still need a zero-width assertion for glyph boundary > > within regexes themselves. > > > > >We need the character equivalence construct, such as [[=a=]], which > > >matches "a", "A ACUTE". > > > > Yeah, we really need a big list of these. PDD anyone? > > > >But surely this is a locale issue, and not an encoding one? Not every >language recognizes the same character equivalences. In Unicode, there's theoretically no locale. Theoretically... Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > 6) There will be a glyph boundary/non-glyph boundary pair of regex > > characters to match the word/non-word boundary ones we already have. >(While > > I'd personally like \g and \G, that won't work as \G is already taken) > > > > I also realize that the decomposition flag on regexes would mean that > > s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the > > previous paragraph. > >I recommend to use 'u' flag, which indicates all operations are performed >against unicode grapheme/glyph. By default re is performed on codepoint. U doesn't really signal "glyph" to me, but we are sort of limited in what we have left. We still need a zero-width assertion for glyph boundary within regexes themselves. >We need the character equivalence construct, such as [[=a=]], which >matches "a", "A ACUTE". Yeah, we really need a big list of these. PDD anyone? Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: Unicode handling
From: Nicholas Clark [mailto:[EMAIL PROTECTED]] > > On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > 1) All Unicode data perl does regular expressions against > >will be in Normalization Form C, except for... > > 2) Regexes tagged to run against a decomposed form will > >instead be run against data in Normalization Form D. > > (What the tag is at the perl level is up for grabs. I'd > > personally choose a D suffix) > > 3) Perl won't otherwise force any normalization on data > >already in Unicode format. > > So if I understand that correctly, running a regexp against a > scalar will cause that scalar to become normalized in a > defined way (C or D, depending on regexp) I'm not sure whether to read that as resulting in scalar being normalized, or if the "data perl does the regular expressions against" would be a normalized copy of that scalar's value. Wouldn't normalizing the scalar lose information? I don't know Unicode, but surely someone must have a use for storing strings in both NFC and NFD. Is it valid to intermix both forms? Isn't there a need to preserve the data in its original encoding? I don't like the idea of the language losing information without the programmer's permission. > > 5) Any character-based call (ord, substr, whatever) will > >deal with whatever code-points are at the location > >specified. If the string is LATIN SMALL LETTER A, > >COMBINING ACUTE ACCENT and someone does a > >substr($foo, 1, 1) on it, you get back the single > >character COMBINING ACUTE ACCENT, and an ord would > >return the value 796. > > So if you do (ord, substr, whatever) on a scalar without > knowing where it has been, you have no idea whether you're > working on normalised or not. And in fact the same scalar > may be come denormalised: > > $bar = substr $foo, 3, 1; > &frob ($foo); > $baz = substr $foo, 3, 1; Hmm... if I put on my "everything is an object in Perl 6" blinders, wouldn't that be: $foo : utf8d = "timtowtdi"; $bar : utf8 = substr $foo, 3, 1; $baz : char8 = substr($foo,0,3) . substr($bar,3,3) . "tdi"; o $foo would be normalized to NFD o substr would know what $foo is and operate on it per NFD o $bar would be normalized to NFC. o $baz would work with byte characters indeterminantly i.e., substr, ord, length would DWIM based on what type of string it is. > $foo =~ /^$bar$/;# did I need to \Q \E this? > > might be true at the same time as > > $foo ne $bar Have the match operate on a copy $bar normalized to whatever $foo is. > I'm in too minds about this. It feels like it would be hard > to implement the internals to make eq work on normalized > forms without either > > 1: causing it to not be read only, hence UTF8 in might not be UTF8 out >because it had been part of an eq > > or > > 2: having to double buffer almost every scalar, with both the > original UTF8 >and a (cached copy) normalized form I really don't want to see #1. Do my naive suggestions get around #2? Garrett
Re: Unicode handling
On Friday 23 March 2001 14:18, Dan Sugalski wrote: > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > 6) There will be a glyph boundary/non-glyph boundary pair of regex > > > characters to match the word/non-word boundary ones we already have. > > > >(While > > > > > I'd personally like \g and \G, that won't work as \G is already taken) > > > > > > I also realize that the decomposition flag on regexes would mean that > > > s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the > > > previous paragraph. > > > >I recommend to use 'u' flag, which indicates all operations are performed > >against unicode grapheme/glyph. By default re is performed on codepoint. > > U doesn't really signal "glyph" to me, but we are sort of limited in what > we have left. We still need a zero-width assertion for glyph boundary > within regexes themselves. > > >We need the character equivalence construct, such as [[=a=]], which > >matches "a", "A ACUTE". > > Yeah, we really need a big list of these. PDD anyone? > But surely this is a locale issue, and not an encoding one? Not every language recognizes the same character equivalences. -- Bryan C. Warnock [EMAIL PROTECTED]
Re: Unicode handling
> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: DS> U doesn't really signal "glyph" to me, but we are sort of limited DS> in what we have left. We still need a zero-width assertion for DS> glyph boundary within regexes themselves. how about \C? it doesn't seem to be taken and would mean char boundary (not exactly a glyph but close enough). also \U has a meaning in double quotish strings. uri -- Uri Guttman - [EMAIL PROTECTED] -- http://www.sysarch.com SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting The Perl Books Page --- http://www.sysarch.com/cgi-bin/perl_books The Best Search Engine on the Net -- http://www.northernlight.com
Re: Unicode handling
On Fri, Mar 23, 2001 at 03:15:41PM -0800, Brad Hughes wrote: > Simon Cozens wrote: > [...] > > I'm just not sure it's fair on Old World hackers. Will there be a way to stop > > Perl upgrading stuff to Unicode on the way in? > > and I'm probably not the only Old World hacker that would > prefer a build option to simply eliminate Unicode support altogether... Eh, no, read it again. (I had to.) It won't interfere with Old World hackers at all. Data coming in won't be implicitly converted to Unicode, so programs under Perl 6 should see legacy data the same way as they do under 5.6.0; if you pretend that Unicode isn't there, it won't bother you. For instance, chr() will produce Unicode codepoints. But you can pretend that they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope and suspect there'll be an equivalent of "use bytes" which makes chr(256) either blow up or wrap around. So we're not exactly forcing Unicode down people's throats. (Damn.) -- "Even more amazing was the realization that God has Internet access. I wonder if He has a full newsfeed?" (By Matt Welsh)
Re: Unicode handling
On Fri, Mar 23, 2001 at 06:16:58PM -0500, Dan Sugalski wrote: > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > >For instance, chr() will produce Unicode codepoints. But you can pretend that > >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope > >and suspect there'll be an equivalent of "use bytes" which makes chr(256) > >either blow up or wrap around. > > Actually no it won't. If the string you're doing a chr on is tagged as > EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this: > > chr($foo) == chr($bar); > > could evaluate to false if one of the strings is EBCDIC and the other > isn't. Odd but I don't see a good reason not to. Otherwise we'd want to > force everything to Unicode, and then what do we do if one of the strings > is plain binary data? Are you thinking of ord rather than chr? I can't seem to make the above make sense otherwise. chr takes a number, not a string as its argument... Your initial description of character set handling didn't mention that different strings can be tagged as having different encodings, and didn't cover the implications of this. Could you give a list of the specific occasions when the encoding of a string would be visible to a programmer? - Damien
Re: Unicode handling
At 10:56 AM 3/23/2001 -0800, Damien Neil wrote: >On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote: > >while () { > > $count++ if /bar/; > > print OUT $_; > >} > >I would find it surprising for this to have different output >than input. Other people's milage may vary. I can understand that. >In general, however, I think I would prefer to be required to >explicitly normalize my data (via a function, pragma, or option >set on a filehandle) than have data change unexpectedly behind >my back. But since the data is equivalent, and more importantly Unicode, it's not supposed to matter to you. Whether it *does* or not is a separate question... :) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
On Fri, Mar 23, 2001 at 06:31:13PM -0500, Dan Sugalski wrote: > >Err, perhaps I'm being dumb here - but surely $foo and $bar arent > >typed strings, they're just numbers (or strings which match /^\d+$/) ??? > > D'oh! Too much blood in my caffeine stream. Yeah, I was thinking of ord. > > chr will emit a character of the type appropriate to the current default > string context. The default context will probably be settable at compile > time, or be the platform native type, alterable somehow. Probably "use > blah;" but that's a language design issue. :) Ah, this answers the puzzlement in the message I just sent. :> So the results of ord are dependent on a global setting for "current character set" or some such, not on the encoding of the string that is passed to it? - Damien
Re: Unicode handling
> >I recommend to use 'u' flag, which indicates all operations are performed > >against unicode grapheme/glyph. By default re is performed on codepoint. > > U doesn't really signal "glyph" to me, but we are sort of limited in what > we have left. We still need a zero-width assertion for glyph boundary > within regexes themselves. The 'u' flag means "advanced unicode feature(s)", which includes "always matching against glyph/grapheme, not codepoint". What it really means is up to discussion. I think we probably still need "glyph" or "grapheme" boundary in some cases. > >We need the character equivalence construct, such as [[=a=]], which > >matches "a", "A ACUTE". > > Yeah, we really need a big list of these. PDD anyone? I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2 regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can implement in Perl 6. For even advantage equivalence, we can offload the job to collation library. Hong
Re: Distributive -> and indirect slices
Simon Cozens wrote: > > On Mon, Mar 19, 2001 at 08:30:31AM -0800, Peter Scott wrote: > > Seen http://dev.perl.org/rfc/82.pod? > > I hadn't. I'm surprised it didn't give the PDL people screaming fits. > But no, I wouldn't do it like that. It has: > > @b = (1,2,3); > @c = (2,4,6); > @d = @b * @c; # Returns (2,8,18) > > Where I would have @d = (2,4,6,4,8,12,6,12,18); The first example above is called the scalar product of two vectors in APL, and can be generalised for all arithmetic operators. The second example is called the outer or cross-product of two vectors. > However, this isn't great language design; it's applying a specific solution > to a specific problem. Better is to solve the general problem, and have all > operators overloadable even on non-objects, so the user can define how this > sort of thing works. Precisely. The operators for arrays (lists) should be defined orthogonally and completely in a mathematically consistent manner to be of general purpose use. Here is one example of such a definition, freely adapted from the APL syntax: 1. scalar array operators: @a ? @b For any numerical operator '?', two compatible arrays can be combined with that operator producing a new array as follows: @r = @a ? @b where $r[i] = $a[i] ? $b[i] complexity: if @a and @b are not the same length, the shorter is extended with the appropriate scalar value, i.e. 0 for additive and 1 for multiplicative operators. example: @a = (1,2,3); @b = (2,4,6); @r = @a + @b; # @r = (3,6,9) As well, mixed array and scalar operands are allowed: @a + 10 produces (11,12,13) 20 - @b produces (18,16,14) 2. reduction: ?/@a Reduction introduces the new diglyphs that end in '/', and are preceded by a numerical operator, i.e. +/ -/ */ // Any numerical array can be reduced to a scalar value over a given numerical operator '?' as follows: $r = ?/@a where $r = $a[0] ? $a[1] ? $a[2] ... example: @a = (2,4,6); $r = */@a; # $r = 36 3. inner product: @a ?/! @b Inner product introduces the new triglyphs with '/' as the middle character, and surrounded by numerical operator, i.e. +/* */- -/* -/- etc. For any numerical operators '?' and '!', two compatible arrays can be combined with those operators producing a scalar inner product, as follows: $r = @a ?/! @b where $r = ?/ (@a ! @b) i.e. $r = ($a[0] ! $b[0]) ? ($a[1] ! $b[1]) ? ... example: @a = (1,2,3); @b = (2,4,6); $r = @a +/* @b; # @r = +/(2,8,18) = 28 4. outer product: @a @? @b Outer product introduces the new diglyphs that start with '@', and end with a numerical operator, i.e. @+ @- @* @/ For any numerical operator '?', any two arrays can be combined into the cross-product with those operators producing a new array containing the outer product. Each row of the outer product is concatenated to the next to produce the result, best illustrated by the following example: @a = (1,2,3); @b = (2,4,6); @r = @a @* @b; @r = (2,4,6), (4,8,12), (6,12,18) = (2,4,6,4,8,12,6,12,18) Of course, there are probably syntactic nightmares in introducing the diglyphs and triglyphs mentioned above into perl. -- Rick Welykochy || Praxis Services Pty Limited
PDD for coding conventions
About a month ago I started working on a PDD for how code should be commented; some while later Paolo Molaro <[EMAIL PROTECTED]> sumitted a draft PDD ('PDD X') on "Perl API conventions". This gave me to think that, rather than accumulating lots of micro PDDs, we should have a single one entitled "coding conventions" that includes sections on naming and API conventions, how to comment code, etc etc. Then the FAQs can simply state "before you contribute src code, make sure you have thoroughly read PDD X". Provisonally I think it should have the following sections: * Coding style largely lifted from Porting/patching.pod, eg function names start in column 0, indent = 4, etc etc * Naming conventions how macros, variables (global or othewise), structs, files, APIs, plus anything else you can think of, should be named. - based on Paolo's work * Commenting conventions how individual items such as functions, macros etc should be commented, plus how larger scale things (such as src files and implementation decisions) should be commented. - based on my work. * Portability guidelines The basic dos and donts of writing portable code, epcially with Perl in mind - eg whether to assume ANSI C, things not to assume about int sizes, and anything else you can think of. - someone would need to write this. * Performance guidelines The basic dos and donts of writing code that runs well on modern processors, eg the effect of caches and pipelines (avoid those branches, man!), are globals Good or Evil (or Chaotic Neutral...). - someone would need to write this. Waddayafink? If people dont object, I'll begin drafting. * Dave Mitchell, Senior Technical Consultant * Fretwell-Downing Informatics Ltd, UK. [EMAIL PROTECTED] * Tel: +44 114 281 6113.The usual disclaimers * * Standards (n). Battle insignia or tribal totems
Re: Unicode handling
On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote: >while () { > $count++ if /bar/; > print OUT $_; >} I would find it surprising for this to have different output than input. Other people's milage may vary. In general, however, I think I would prefer to be required to explicitly normalize my data (via a function, pragma, or option set on a filehandle) than have data change unexpectedly behind my back. - Damien
Re: Unicode handling
At 11:09 PM 3/23/2001 +, Simon Cozens wrote: >For instance, chr() will produce Unicode codepoints. But you can pretend that >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope >and suspect there'll be an equivalent of "use bytes" which makes chr(256) >either blow up or wrap around. Actually no it won't. If the string you're doing a chr on is tagged as EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this: chr($foo) == chr($bar); could evaluate to false if one of the strings is EBCDIC and the other isn't. Odd but I don't see a good reason not to. Otherwise we'd want to force everything to Unicode, and then what do we do if one of the strings is plain binary data? Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Safe signals and perl 6
> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: DS> Generally speaking, signals will be treated as generic async events in perl DS> 6, since that's what they are. (The ones that aren't, like SIGBUS, really DS> aren't things that perl code can catch...) They're going to be treated DS> pretty much like any other event, or so the plan is at least. DS> Uri's working on an event handling PDD for perl 6 IIRC, so when DS> that comes we can work from there. thanx for reminding me to work on it. it has been back burnered for a little while. uri -- Uri Guttman - [EMAIL PROTECTED] -- http://www.sysarch.com SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting The Perl Books Page --- http://www.sysarch.com/cgi-bin/perl_books The Best Search Engine on the Net -- http://www.northernlight.com
Re: Unicode handling
Dan Sugalski <[EMAIL PROTECTED]> doodled: > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > >For instance, chr() will produce Unicode codepoints. But you can pretend that > >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope > >and suspect there'll be an equivalent of "use bytes" which makes chr(256) > >either blow up or wrap around. > > Actually no it won't. If the string you're doing a chr on is tagged as > EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this: > > chr($foo) == chr($bar); > > could evaluate to false if one of the strings is EBCDIC and the other > isn't. Err, perhaps I'm being dumb here - but surely $foo and $bar arent typed strings, they're just numbers (or strings which match /^\d+$/) ???
Re: Schwartzian Transform
i have to put my 2 cents in... after reading all the discussion so far about the Schwartz, i feel that map{} sort map{} is perfect in it's syntax. if you code and understand Perl (i've seen situations where these aren't always both happening at the time) and knowingly use the building block functions, sort and map, to create an abstraction like the Schwartzian transform, then why do you need to come up with special syntax or use a Sort::Module, as it was suggested, to achieve just the same thing. my point is that i wonder if it's useful for Perl or people who write Perl, to bundle a map and sort function into some special schwartzian syntax, is the goal just to abstract another layer above the transform itself? why not just keep using map{} sort map {}, if it's a well understand concept? monty James Mastros wrote: > > On Thu, Mar 22, 2001 at 11:13:47PM -0500, John Porter wrote: > > Brent Dax wrote: > > > Someone else showed a very ugly syntax with an anonymous > > > hash, and I was out to prove there was a prettier way to do it. > > Do we want prettier? Or do we want more useful? > > Perl is not exactly known for its pretty syntax. > If you have to explicitly specify both the forward and inverse transforms, > then it isn't very useful -- it's nothing more then map/sort/map. OTOH, if > you only have to specify the forward mapping, it becomes more useful. Thus, > I think the best syntax is > tsort({xform}, {compare}, @list), where the {}s are anon blocks or curried > expressions (same thing) and xform specifies the forward mapping (IE (lc > ^_)) and compare specifies the comparator (IE (^_ cmp ^_)). > > This would always (do the equiv to) create a LoL in the inner map, sort on > the ->[0] elem, and extract the ->[1] elem. Thus, it might not be as > effecent as a hand-crafted schwartzian, but will be at least as efficent as > a naieve straight sort (except in pathalogical cases, like tsort((^_), > (^_<=>^_), @list)). > >-=- James Mastros > -- > The most beautiful thing we can experience is the mysterious. It is the > source of all true art and science. He to whom this emotion is a stranger, > who can no longer pause to wonder and stand wrapt in awe, is as good as dead. > -=- Albert Einstein > AIM: theorbtwo homepage: http://www.rtweb.net/theorb/ -- Mark Koopman Software Engineer WebSideStory, Inc 10182 Telesis Court San Diego CA 92121 858.546.1182.##.318 858.546.0480.fax perl -e ' eval(lc(join("", map ({chr}(q( 49877273766940 80827378843973 32767986693280 69827639463932 39883673434341 ))=~/../g;'
Re: Unicode handling
On Fri, Mar 23, 2001 at 05:56:19PM -0500, Dan Sugalski wrote: > Nah, they only apply to data that perl's tagged as Unicode, either because > its input stream is marked that way or because the program explicitly > converted the data. Oh, colour me dull. I read 4) Data converted to Unicode (from ASCII, EBCDIC, one of the JIS encodings, or whatever) will be done into NFC. as meaning that data in ASCII, EBCDIC, or whatever will be converted to Unicode in NFC. Now I know what you mean, I think the rules you've described are perfect. -- Heh, heh, heh, heh... the NOISE of a bursar CHEWING Proctors' Memoranda. - Henry Braun is Oxford Zippy
Re: Unicode handling
On Fri, Mar 23, 2001 at 03:08:35PM -0500, Dan Sugalski wrote: > I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII > character. > > \SMILEY FACE, perhaps? that makes it kind of hard to edit perl scripts that use this feature on any good old fashioned 8 bit xterm. Let alone some crufty 7 bit serial login. I think it would be a bad thing to effectively mandate that to use certain features you had to use a Unicode aware editing system Nicholas Clark
Re: Unicode handling
At 11:41 PM 3/22/2001 +, Nicholas Clark wrote: >On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > 1) All Unicode data perl does regular expressions against will be in > > Normalization Form C, except for... > > 2) Regexes tagged to run against a decomposed form will instead be run > > against data in Normalization Form D. (What the tag is at the perl > level is > > up for grabs. I'd personally choose a D suffix) > > 3) Perl won't otherwise force any normalization on data already in Unicode > > format. > >So if I understand that correctly, running a regexp against a scalar will >cause that scalar to become normalized in a defined way (C or D, depending >on regexp) It will be run against a normalized version of the data in the scalar, yes. Whether that forces the scalar to be normalized or not is something I hadn't thought of. If we do, then something as simple as this: while () { $count++ if /bar/; print OUT $_; } would potentially result in the output file being rather different from the input file. Equivalent, yes, but different. Whether that's bad or not is an open question. > > 5) Any character-based call (ord, substr, whatever) will deal with > whatever > > code-points are at the location specified. If the string is LATIN SMALL > > LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on > > it, you get back the single character COMBINING ACUTE ACCENT, and an ord > > would return the value 796. > >So if you do (ord, substr, whatever) on a scalar without knowing where it has >been, you have no idea whether you're working on normalised or not. Potentially, yes. If it's important, you force normalization on it. >And in fact the same scalar may be come denormalised: > > $bar = substr $foo, 3, 1; > &frob ($foo); > $baz = substr $foo, 3, 1; > >[so $bar and $baz differ] if someone runs it against a regular expression >[in this case inside the subroutine &frob. Hmm, but currently you can >make changes to parameters as they are pass-by-reference] > > $bar = substr $foo, 3, 1; > $foo =~ /foo/;# This is not read only in perl6 > $baz = substr $foo, 3, 1; > >But this is documented - if you want (ord, substr, whatever) on a string >to make sense, you must explicitly normalized it to the form you want before >hand, and not use any of the documented-as-normalizing operators on it >without normalizing it again. It's generally safe to say that if you want data to make sense period, you need to make sure it's sensible first. Unicode with combining characters does tend to exacerbate things, but it's not a new problem. >And by implication of the above (particularly rule 3), eq compares >codepoints, not normalized forms. I hadn't thought about eq, gt, or lt. (Or sort, for that matter) eq probably ought to work against codepoints, and be done with it. gt/lt/sort should normalize and use the Unicode sorting stuff to determine where things stand. I don't think they should alter the data, as we may be promoting non-unicode data to unicode for comparisons. (If we're comparing ASCII and Unicode scalars, or even something odd like Shift-JIS and EBCDIC scalars) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Distributive -> and indirect slices
On Mon, Mar 19, 2001 at 08:30:31AM -0800, Peter Scott wrote: > Seen http://dev.perl.org/rfc/82.pod? I hadn't. I'm surprised it didn't give the PDL people screaming fits. But no, I wouldn't do it like that. It has: @b = (1,2,3); @c = (2,4,6); @d = @b * @c; # Returns (2,8,18) Where I would have @d = (2,4,6,4,8,12,6,12,18); However, this isn't great language design; it's applying a specific solution to a specific problem. Better is to solve the general problem, and have all operators overloadable even on non-objects, so the user can define how this sort of thing works. -- I want you to know that I create nice things like this because it pleases the Author of my story. If this bothers you, then your notion of Authorship needs some revision. But you can use perl anyway. :-) - Larry Wall
Re: Unicode handling
Jarkko Hietaniemi writes: : *cough* \C *is* taken. : : > >also \U has a meaning in double quotish strings. : : "\Uindeed." Bear in mind we are redesigning the language. If there's a botch we can think about fixing it. Though maybe not on -internals... :-) Larry
Re: Unicode handling
At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > DS> U doesn't really signal "glyph" to me, but we are sort of limited > DS> in what we have left. We still need a zero-width assertion for > DS> glyph boundary within regexes themselves. > >how about \C? it doesn't seem to be taken and would mean char boundary (not >exactly a glyph but close enough). That's got the unfortunate mental association with C's char for lots of folks, and I know I'd probably get it stuck to codepoint rather than glyph if I didn't use it much. >also \U has a meaning in double quotish strings. > >uri > >-- >Uri Guttman - [EMAIL PROTECTED] -- http://www.sysarch.com >SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting >The Perl Books Page --- http://www.sysarch.com/cgi-bin/perl_books >The Best Search Engine on the Net -- http://www.northernlight.com Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Safe signals and perl 6
Generally speaking, signals will be treated as generic async events in perl 6, since that's what they are. (The ones that aren't, like SIGBUS, really aren't things that perl code can catch...) They're going to be treated pretty much like any other event, or so the plan is at least. Uri's working on an event handling PDD for perl 6 IIRC, so when that comes we can work from there. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 02:06 PM 3/23/2001 -0600, Jarkko Hietaniemi wrote: >On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote: > > At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: > > > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > > > > > DS> U doesn't really signal "glyph" to me, but we are sort of limited > > > DS> in what we have left. We still need a zero-width assertion for > > > DS> glyph boundary within regexes themselves. > > > > > >how about \C? it doesn't seem to be taken and would mean char boundary > (not > > >exactly a glyph but close enough). > > > > That's got the unfortunate mental association with C's char for lots of > > folks, and I know I'd probably get it stuck to codepoint rather than glyph > > if I didn't use it much. > >*cough* \C *is* taken. > > > >also \U has a meaning in double quotish strings. > >"\Uindeed." I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII character. \SMILEY FACE, perhaps? Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 08:14 PM 3/23/2001 +, Nicholas Clark wrote: >On Fri, Mar 23, 2001 at 03:08:35PM -0500, Dan Sugalski wrote: > > I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII > > character. > > > > \SMILEY FACE, perhaps? > >that makes it kind of hard to edit perl scripts that use this feature on >any good old fashioned 8 bit xterm. >Let alone some crufty 7 bit serial login. >I think it would be a bad thing to effectively mandate that to use certain >features you had to use a Unicode aware editing system Point. Never mind--lots of folks with non-Unicode aware terminals and editors will be writing code that handles unicode data. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
> > >We need the character equivalence construct, such as [[=a=]], which > > >matches "a", "A ACUTE". > > > > Yeah, we really need a big list of these. PDD anyone? > > > > But surely this is a locale issue, and not an encoding one? Not every > language recognizes the same character equivalences. Let me clarify it. The "character equivalence", assuming [[~a~]] syntax, means matching a sequence of a single letter 'a' followed any number of combining characters. I believe we can handle this without considering locale. Whether it is still useful is up to discussion. At least it is trivial to implement. Hong
Re: Distributive -> and indirect slices
Simon Cozens wrote: > Better is to solve the general problem, and have all > operators overloadable even on non-objects, so the user > can define how this sort of thing works. Even better is to let the user have access to the real objects by which "non-objects", i.e. normal variables, are implemented. That is, to remove the "non-object"-ness of normal variables. (I'm not disagreeing with Simon, just twisting the idea a little.) -- John Porter
Re: Unicode handling
On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > Yes, I realize that point 5 may result in someone getting a meaningless > Unicode string. Too bad--it is *not* the place of a programming language to > enforce validity on data. That's the programmer's job. But points 4 and 5 do enforce Unicode on everyone. Not that I'm particularly upset by that idea, but... :) open FH, $datafile or die $!; undef $/; $foo = ; die "Confusing" if -s $datafile != length $foo; I'm just not sure it's fair on Old World hackers. Will there be a way to stop Perl upgrading stuff to Unicode on the way in? -- i've dreamed in Perl many time, last night i dreamed in Make, and that just sucks.
Re: Unicode handling
At 01:07 PM 3/23/2001 -0800, Larry Wall wrote: >Jarkko Hietaniemi writes: >: *cough* \C *is* taken. >: >: > >also \U has a meaning in double quotish strings. >: >: "\Uindeed." > >Bear in mind we are redesigning the language. If there's a botch we >can think about fixing it. > >Though maybe not on -internals... :-) Good point. It's enough for us to say the regex engine will do glyph breaks and glyph instead of character semantics. We can pass on the language-level bits to someone else. (I hear we have someone doing that language thing... :) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk