Re: PDD for code comments ????
David L . Nicol <[EMAIL PROTECTED]> writes: >Jarkko Hietaniemi wrote: > >> Some sort of simple markup embedded within the C comments. Hey, let's >> extend pod! Hey, let's use XML! Hey, let's use SGML! Hey, let's use >> XHTML! Hey, let's use lout! Hey, ... > >Either run pod through a pod puller before the C preprocessor gets to >the code, or figure out a set of macros that can quote and ignore pod. > >The second is Yet Another Halting Problem so we go with the first? > >Which means a little program to depod the source before building it, >or a -HASPOD extension to gcc > >Or just getting in the habit of writing > >/* >=pod > > >and > >=cut >*/ Perhaps we could teach pod that /* was alias for =pod and */ an alias for =cut ? -- Nick Ing-Simmons <[EMAIL PROTECTED]> Via, but not speaking for: Texas Instruments Ltd.
Re: Unicode handling
Simon Cozens wrote: [...] > I'm just not sure it's fair on Old World hackers. Will there be a way to stop > Perl upgrading stuff to Unicode on the way in? and I'm probably not the only Old World hacker that would prefer a build option to simply eliminate Unicode support altogether...
Re: PDD for code comments ????
Nick Ing-Simmons <[EMAIL PROTECTED]> opined: > >Either run pod through a pod puller before the C preprocessor gets to > >the code, or figure out a set of macros that can quote and ignore pod. > > > >The second is Yet Another Halting Problem so we go with the first? > > > >Which means a little program to depod the source before building it, > >or a -HASPOD extension to gcc > > > >Or just getting in the habit of writing > > > >/* > >=pod > > > > > >and > > > >=cut > >*/ > > Perhaps we could teach pod that /* was alias for =pod > and */ an alias for =cut ? or possibly /*=foo is an alias for =foo, and */ is an alias for =cut only after a /*= has been encountered. EG /*=for apidoc sv_upgrade Upgrade an SV to a more complex form. Use C. See C. */ rather than /* =for apidoc sv_upgrade Upgrade an SV to a more complex form. Use C. See C. =cut */
RE: Unicode handling
From: Dan Sugalski [mailto:[EMAIL PROTECTED]] > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > > > > For instance, chr() will produce Unicode codepoints. But > > you can pretend that they're ASCII codepoints, it's only > > the EBCDIC folk that'll get hurt. I hope and suspect > > there'll be an equivalent of "use bytes" which makes > > chr(256) either blow up or wrap around. > > Actually no it won't. If the string you're doing a chr on is > tagged as EBCDIC, you'll get the EBCDIC value. Yes, it does > mean that this: > > chr($foo) == chr($bar); > > could evaluate to false if one of the strings is EBCDIC and > the other isn't. Odd but I don't see a good reason not to. > Otherwise we'd want to force everything to Unicode, and then > what do we do if one of the strings is plain binary data? Someone please clue me in. A pointer to an RFC which defines the use of colons in Perl6 among other things would help. Why not have subsequent uses of : on the same variable name perform a cast? Or perhaps better returned the casted value? $foo : EBCDIC = 'A'; # declares $foo as EDCDIC typed string $bar = $foo : UTF8-C; # assigns cast value of $foo to $bar $bar := UTF8-D; # casts $bar to different string type ord($foo) != ord($bar); ord($foo : utf8) == ord($bar : utf8); If there is a default string type... perhaps it could be: ord($foo:) == ord($bar:); Garrett
RE: Unicode handling
At 09:09 AM 3/26/2001 -0600, Garrett Goebel wrote: >Someone please clue me in. A pointer to an RFC which defines the use of >colons in Perl6 among other things would help. > >Why not have subsequent uses of : on the same variable name perform a cast? >Or perhaps better returned the casted value? We're sort of tossing around pseudo-code. Nailing down the syntax isn't really the place of the -internals list, as Larry's rightly pointed out. The ord opcode will be handed a string tagged with its type, and can Do The Right Thing given that tag. How you tag it at the perl level's another issue, and one we can dodge as its not our decision really. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: >So the results of ord are dependent on a global setting for "current >character set" or some such, not on the encoding of the string that >is passed to it? Nope, ord is dependent on the string it gets, as those strings know what their encoding is. chr is the one dependent on the current default encoding. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: PDD for code comments ????
At 02:58 PM 3/26/2001 +0100, Dave Mitchell wrote: >Nick Ing-Simmons <[EMAIL PROTECTED]> opined: > > Perhaps we could teach pod that /* was alias for =pod > > and */ an alias for =cut ? > >or possibly > >/*=foo is an alias for =foo, >and */ is an alias for =cut only after a /*= has been encountered. I think I'd rather we built a simple extractor rather than teach pod about how to extract itself from C comments, but if we're going to do that, tell it that */ is an alias for =cut, and that it should strip out leading /* on lines... Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
Dan Sugalski <[EMAIL PROTECTED]> writes: >At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: >>So the results of ord are dependent on a global setting for "current >>character set" or some such, not on the encoding of the string that >>is passed to it? > >Nope, ord is dependent on the string it gets, as those strings know what >their encoding is. And the code knows what it wants: if I am in an EBCDIC context then I am going to expect ord to be ones I am used to. This the main pain with 5.7.*'s EBCDIC scheme - making ord('A') == 193 true :-/ > chr is the one dependent on the current default encoding. You are going to see both used in legacy stuff. -- Nick Ing-Simmons
Re: Unicode handling
At 05:45 PM 3/26/2001 +, [EMAIL PROTECTED] wrote: >Dan Sugalski <[EMAIL PROTECTED]> writes: > >At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > >>So the results of ord are dependent on a global setting for "current > >>character set" or some such, not on the encoding of the string that > >>is passed to it? > > > >Nope, ord is dependent on the string it gets, as those strings know what > >their encoding is. > >And the code knows what it wants: if I am in an EBCDIC context >then I am going to expect ord to be ones I am used to. >This the main pain with 5.7.*'s EBCDIC scheme - making > >ord('A') == 193 > >true :-/ That would be true if EBCDIC was the default encoding, otherwise false. If the code cares, it can do a "use encoding qw(Unicode);" to force things. > > chr is the one dependent on the current default encoding. > >You are going to see both used in legacy stuff. No doubt. Hopefully that'll go away as people are in a position to force things the way they want them. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: > At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > >So the results of ord are dependent on a global setting for "current > >character set" or some such, not on the encoding of the string that > >is passed to it? > > Nope, ord is dependent on the string it gets, as those strings know what > their encoding is. chr is the one dependent on the current default encoding. So $c = chr(ord($c)) could change $c? That seems odd. In what other circumstances will the encoding of a string be visible to the programmer? Not when printing the string to a file handle, I would think -- that should be controlled by the encoding on the handle. Are there any other cases where encoding matters? - Damien
Re: Unicode handling
At 04:34 PM 3/24/2001 -0800, Dave Storrs wrote: > I'll just toss my 0.01 cents in...my thought here is that this >thread has now tied up a lot of cycles from a lot of very smart, very >experienced people without resulting in an answer that is clearly The >Right Thing. Whatever we do, there is a problem at some point...if we do >normalizations internally for some functions, then you end up with a >situation like the code above, which looks like it should produce >identical input and output files, but won't necessarily. OTOH, if we >don't do normalizations, then (e.g.) length() can return different values >for different representations of the same string. For length, I'd as soon it returned the number of code points, but glyphs and bytes are also valid return values. Part of the problem isn't so much an argument over functionality as one of mapping. We have a number of things we're trying to wedge into one or two functions, and that's always going to cause problems. (It might just be that we haven't wrapped our brains completely around the possibility that we can actually change the language... :) There's also the problem of extending the functionality without inconveniencing the current crop of perl programmers. Getting Unicode working is grand, but we can't make life more difficult for folks that don't use or need it. And I'm a touch nervous about arranging things so that perl just happens to Do The Right Thing in most circumstances, like defining length to return the number of code points, since that'll bite folks when the accidental functionality fails for some reason. > My suggestion is, let's punt on this one...make it the >programmer's responsibility to ensure that Unicode strings are represented >in the desired way. For a good bit of this I'd agree completely. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
Dan Sugalski <[EMAIL PROTECTED]> writes: >>This the main pain with 5.7.*'s EBCDIC scheme - making >> >>ord('A') == 193 >> >>true :-/ > >That would be true if EBCDIC was the default encoding, otherwise false. But what about our $var; { use encoding 'US-ascii'; $var = 'A'; } {use Encoding 'ibm-1047'; # EBCDIC if (ord($var) == 193) { } else { } } >>You are going to see both used in legacy stuff. > >No doubt. Hopefully that'll go away as people are in a position to force >things the way they want them. A. World has been de-facto ASCII for years - even Japanese codings make it easy - but EBCDIC is still there. B. But do they want them the way we want them? >I don't see any reason not to have the encoding lexically scoped and >settable via use. Probably either "use encoding qw(EBCDIC);" or "use >ebcdic;". The former would be easier to extend, but it's Larry's call. Remember that Larry's initial idea was 'use utf8' was going to be lexical. Then reality kicked in and we had to tag the data to retain our sanity and "zillions" of spots needed tweaking to translate on demand. So if you want lexical scoped encoding make sure the infrastructure can scale to cope... -- Nick Ing-Simmons
Re: Unicode handling
At 02:52 AM 3/25/2001 -0500, Philip Newton wrote: >On Fri, 23 Mar 2001, Dan Sugalski wrote: > > > At 02:31 PM 3/23/2001 -0500, Bryan C. Warnock wrote: > > >On Friday 23 March 2001 14:18, Dan Sugalski wrote: > > > > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > > >We need the character equivalence construct, such as [[=a=]], which > > > > >matches "a", "A ACUTE". > > > > > > > > Yeah, we really need a big list of these. PDD anyone? > > > > > >But surely this is a locale issue, and not an encoding one? Not every > > >language recognizes the same character equivalences. > > > > In Unicode, there's theoretically no locale. Theoretically... > >But it still has special-case mappings such as LATIN SMALL LETTER I can >map to either LATIN CAPITAL LETTER I or LATIN CAPITAL LETTER I WITH DOT >ABOVE, depending on whether your is Turkish-y or not. That's >kind of like locale, even if you don't call it that. (And IIRC, the >mapping of uppercase(LATIN LETTER SHARP S) to "SS" is also a special case >for German.) Gack. Well, that's still not nearly as bad as the current "what character does this 8-bit byte map to" stuff we have with locales now. I can deal with switching mapping tables. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
Damien Neil <[EMAIL PROTECTED]> writes: >On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: >> At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: >> >So the results of ord are dependent on a global setting for "current >> >character set" or some such, not on the encoding of the string that >> >is passed to it? >> >> Nope, ord is dependent on the string it gets, as those strings know what >> their encoding is. chr is the one dependent on the current default encoding. > >So $c = chr(ord($c)) could change $c? That seems odd. It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC) but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness. Then of course someone will want it to be the number 0x45 and not do that 'cos they are using chr/ord to mess with JPEG image data... So there needs to be a 'binary' encoding which they can use. > >In what other circumstances will the encoding of a string be >visible to the programmer? One of the snags in perl5.7.* is that there isn't an easy way for programmer to get at the encoding. 'use bytes' exposes it but does not tell you what it is. >Not when printing the string to >a file handle, I would think -- that should be controlled by >the encoding on the handle. Are there any other cases where >encoding matters? > > - Damien -- Nick Ing-Simmons
Re: Unicode handling
Dan Sugalski <[EMAIL PROTECTED]> writes: > >For length, I'd as soon it returned the number of code points, but glyphs >and bytes are also valid return values. And that may be where it belongs - at the language level chars($s) == 120 bytes($s) == 480 glyphs($s) == 360 length($s) is 17.34 femto furlongs on a 24" 1023x767 display in 'Courier' ;-) -- Nick Ing-Simmons
Re: Unicode handling
On Mon, Mar 26, 2001 at 06:16:00PM +, [EMAIL PROTECTED] wrote: > Damien Neil <[EMAIL PROTECTED]> writes: > >On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: > >> At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > >> >So the results of ord are dependent on a global setting for "current > >> >character set" or some such, not on the encoding of the string that > >> >is passed to it? > >> > >> Nope, ord is dependent on the string it gets, as those strings know what > >> their encoding is. chr is the one dependent on the current default encoding. > > > >So $c = chr(ord($c)) could change $c? That seems odd. > > It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC) > but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness. > Then of course someone will want it to be the number 0x45 and not do > that 'cos they are using chr/ord to mess with JPEG image data... > So there needs to be a 'binary' encoding which they can use. That doesn't seem to be what Dan was saying, however. It would make perfect sense to me for chr(ord($c)) to return $c in a different encoding. (Assuming, of course, that $c is a single character.) Assume ord is dependent on the current default encoding. use utf8; # set default encoding. my $e : ebcdic = 'a'; my $u = chr(ord($e)); If ord is dependent on the current default encoding, I would expect the above to leave the UTF-8 string "a" in $u. This makes sense to me. If ord is dependent on the encoding of the string it gets, as Dan was saying, than ord($e) is 0x81, and $u is "\x81". This seems strange. Hmm. It suddenly occurs to me that I may have been misinterpreting: ord is dependent on both the encoding of its argument (to determine the logical character containing in that argument) and the current default encoding (to determine the value in the current character set representing that character). - Damien
RE: Unicode handling
From: Dan Sugalski [mailto:[EMAIL PROTECTED]] > At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > > So the results of ord are dependent on a global setting for > > "current character set" or some such, not on the encoding > > of the string that is passed to it? > > Nope, ord is dependent on the string it gets, as those > strings know what their encoding is. chr is the one dependent > on the current default encoding. Are built-ins like chr going to be nailed to one encoding at compile time, or will we be able to toggle the default encoding setting at runtime? Besides having the ord opcode dispatched by the string tag, will it be possible to have the chr opcode dispatched by the type of return value wanted? $foo:ASCII = chr(65); @foo:ASCII = map chr($_), 80, 69, 82, 76; I assume internals-wise that this is similar to whether a function was called in a scalar, array, or void context... But further raises the spectar of multiple dispatch to include typing. Garrett
RE: Unicode handling
At 11:42 AM 3/26/2001 -0600, Garrett Goebel wrote: >From: Dan Sugalski [mailto:[EMAIL PROTECTED]] > > At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > > > So the results of ord are dependent on a global setting for > > > "current character set" or some such, not on the encoding > > > of the string that is passed to it? > > > > Nope, ord is dependent on the string it gets, as those > > strings know what their encoding is. chr is the one dependent > > on the current default encoding. > >Are built-ins like chr going to be nailed to one encoding at compile time, >or will we be able to toggle the default encoding setting at runtime? I don't see any reason not to have the encoding lexically scoped and settable via use. Probably either "use encoding qw(EBCDIC);" or "use ebcdic;". The former would be easier to extend, but it's Larry's call. >Besides having the ord opcode dispatched by the string tag, will it be >possible to have the chr opcode dispatched by the type of return value >wanted? > >$foo:ASCII = chr(65); >@foo:ASCII = map chr($_), 80, 69, 82, 76; > >I assume internals-wise that this is similar to whether a function was >called in a scalar, array, or void context... But further raises the spectar >of multiple dispatch to include typing. That's an interesting question, and it depends on how the chr operator's defined. If it's done via the variable vtables it's sort of easy (sort of) in some ways, and if not it's sort of easy in others. There's no reason it can't be done, the question is whether it's useful enough to justify the speed penalty. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: PDD for code comments ????
Nick Ing-Simmons wrote: > Perhaps we could teach pod that /* was alias for =pod > and */ an alias for =cut ? that won't work because pod/cut is strictly line-based and C-style comments are strictly stream-based. -- David Nicol 816.235.1187 [EMAIL PROTECTED] He who says it's impossible shouldn't interrupt the one doing it.
Re: PDD for code comments ????
On Mon, Mar 26, 2001 at 01:23:36PM -0600, David L. Nicol wrote: > Nick Ing-Simmons wrote: > > Perhaps we could teach pod that /* was alias for =pod > > and */ an alias for =cut ? > > that won't work because pod/cut is strictly line-based and C-style > comments are strictly stream-based. Damn. Could you patch embed.pl in the Perl 5 distribution, because it does more or less what Nick suggested perfectly well right now. :) -- "Having just ordered 40 books and discovered I have no change out of a grand, I'm thinking of getting a posse together and going after some publishers. I'd walk into a petrol station and buy lots of petrol on Monday, too, but I think I'd get funny looks. More funny looks." - Mark Dickerson
Re: Unicode handling
Damien Neil <[EMAIL PROTECTED]> writes: >> >So $c = chr(ord($c)) could change $c? That seems odd. >> >> It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC) >> but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness. >> Then of course someone will want it to be the number 0x45 and not do >> that 'cos they are using chr/ord to mess with JPEG image data... >> So there needs to be a 'binary' encoding which they can use. > >That doesn't seem to be what Dan was saying, however. And Dan is the one "in charge" on this list - so my perl5.7-ish view may be wrong. >It would make >perfect sense to me for chr(ord($c)) to return $c in a different >encoding. (Assuming, of course, that $c is a single character.) > >Assume ord is dependent on the current default encoding. > > use utf8; # set default encoding. > my $e : ebcdic = 'a'; > my $u = chr(ord($e)); > >If ord is dependent on the current default encoding, I would expect >the above to leave the UTF-8 string "a" in $u. This makes sense to >me. Good. > >If ord is dependent on the encoding of the string it gets, as Dan >was saying, than ord($e) is 0x81, It it could still be 0x81 (from ebcdic) with the encoding carried along with the _number_ if we thought that worth the trouble. (It isn't too bad for assignment but is far from clear what 2 (ebcdic) * 0xA1(iso_8859_7) might mean - perhaps we drop the tag if anything other the + or - happens. >and $u is "\x81". This seems >strange. > >Hmm. It suddenly occurs to me that I may have been misinterpreting: >ord is dependent on both the encoding of its argument (to determine >the logical character containing in that argument) and the current >default encoding (to determine the value in the current character set >representing that character). > > - Damien -- Nick Ing-Simmons
Re: PDD for code comments ????
David L . Nicol <[EMAIL PROTECTED]> writes: >Nick Ing-Simmons wrote: > >> Perhaps we could teach pod that /* was alias for =pod >> and */ an alias for =cut ? > >that won't work because pod/cut is strictly line-based and C-style >comments are strictly stream-based. I was not suggesting we hunt down 2 ** /* */ 14 / /* */ 17 type comments I was just suggesting that pod could recognise line oriented block comments. /*=for ... */ vs /* =for ... =cut */ -- Nick Ing-Simmons
Re: Unicode handling
On Mon, Mar 26, 2001 at 08:37:05PM +, [EMAIL PROTECTED] wrote: > >If ord is dependent on the encoding of the string it gets, as Dan > >was saying, than ord($e) is 0x81, > > It it could still be 0x81 (from ebcdic) with the encoding carried > along with the _number_ if we thought that worth the trouble. I'm going to go away and whimper in pain for a bit, now. "I thought chr(0x61) was 'a'." "It is, but that's an EBCDIC number." - Damien
RE: Unicode handling
From: Damien Neil [mailto:[EMAIL PROTECTED]] > On Mon, Mar 26, 2001 at 08:37:05PM +, [EMAIL PROTECTED] wrote: > > > > > > If ord is dependent on the encoding of the string it gets, as Dan > > > was saying, than ord($e) is 0x81, > > > > It it could still be 0x81 (from ebcdic) with the encoding carried > > along with the _number_ if we thought that worth the trouble. > > I'm going to go away and whimper in pain for a bit, now. > > "I thought chr(0x61) was 'a'." "It is, but that's an EBCDIC number." That's assuming the default string encoding has been set to EBCDIC... so that chr would be translating numbers into EBCDIC tagged strings. So far we've only been talking about global encoding defaults which tag how an opcode is dispatched. Is anyone interested typed 'want' contexts? use Encoding 'EBCDIC'; $foo:ASCII = chr(0x61); Garrett