Re: question regarding rules and bytes vs characters

2004-07-11 Thread Ph. Marek
> : Hello everybody,
> :
> : I'm about to learn myself perl6 (after using perl5 for some time).
>
> I'm also trying to learn perl6 after using perl5 for some time.  :-)
I wouldn't even try to compare you and me  :-)

> Pretty close.  The way it's set up currently, $len is a reference
> to a variable external to the rule, so $len is likely to fail under
> stricture unless you've declared "my $len" somewhere.  To make the
> variable automatically scope to the rule, you have to use $?len
> these days.
ok.

> : And furthermore is perl6 said to be unicode-ready.
> : So I put the :u0-modifier in the data-regex; will that DWIM if I try to
> : match a unicode-string with that rule?
>
> It should.  However (and this is a really big however), you'll have
> to be very careful that something earlier hasn't converted one form
> of Unicode to another on you.  For instance, if your string came in
> as UTF-8, and your I/O layer translated it internally to UTF-32 or
> some such, you're just completely hosed.  When you're working at the
> bytes level, you must know the encoding of your string.
>
> So the natural reaction is to open your I/O handle :raw to get binary
> data into your string.  Then you try to match Unicode graphemes with [
> :u2 . ] and discover that *that* doesn't work.  Which is obvious when
> you consider that Perl has no way of knowing which Unicode encoding
> the binary data is in, so it's gonna consider it to be something like
> Latin-1 unless you tell it otherwise.  So you'll probably have to
> cast the binary string to whatever its actual encoding is (potentially
> lying about the binary parts, which we may or may not get away with,
> depending on who validates the string when), or maybe we just need
> to define rules like  and  for use
> under the :u0 regime.
Of course the file must be opened in binary mode - else the line-endings etc. 
can be destroyed in the binary data, which is bad.

So Perl/Parrot can't autodetect the kind of encoding.
But maybe it should be possible to do something like
[:utf16be_codepoint]? Len: $?len:=(\d+) \n
$?data:=([:raw .]<$len>) \n
ie. say that the conversion to unicode is optional??

> : Is anything known about the internals of pattern matching whether the
> : hypothetical variables will consume (double) space?
> : I'm asking because I imagine getting a tag like "Len: 2" and then
> : having problems with 256MB RAM. Matching shouldn't be a problem according
> : to apo 5 (see the chapter "RFC 093: Regex: Support for incremental
> : pattern matching") but I'll maybe have troubles using the matched data?
>
> My understanding is that Parrot implements copy-on-write, so you should
> be okay there.
ok, thank you.

> Even the late ones?  :-)
even them - this is the *only* answer I received.

Again:
> : Thank you for all answers!

> Larry
Phil


RE: The .bytes/.codepoints/.graphemes methods

2004-07-11 Thread Austin Hastings


> -Original Message-
> From: Jonadab the Unsightly One [mailto:[EMAIL PROTECTED]
> Austin Hastings <[EMAIL PROTECTED]> writes:
>
> > I think this is something that we'll want as a "mode", a la
> > case-insensitivity. Think of it as "mark insensitivity."
>
> Makes sense to me, but...
>
> > Maybe it can just roll into :i?
>
> It will probably get used in _conjunction_ with
> case-insensitivity quite a lot, but I suspect people will want
to be able
> to use one without the other.
>
> Since mark-insensitivity is probably mostly a non-issue
> in the ASCII world, it would probably be a better candidate than
> average for being turned on using a unicode character, if we're
running
> low on letters for designating these rules.

How about :i ?

:) :) :)

=Austin