Re: "ICU - International Components for Unicode"

Elizabeth Mattijsen Thu, 12 Dec 2024 21:58:50 -0800

FWIW

Since release 2023.02, there's a Unicode class, with a class method .version:


$ raku -e 'say Unicode.version'
v15.0

> On 9 Dec 2024, at 15:36, William Michels <w...@caa.columbia.edu> wrote:
> 
> Nudging this conversation, ...to follow progress since 2020.
> 
> Anyone want to chime it? 
> 
> Is a $*UNICODE dynamic variable a possibility?
> 
> Related:  I'm re-reading Matéu's comment, which (I think) says to let ICU 
> live in a module somewhere.
> 
> Best Regards, Bill.
> 
>> On Sep 29, 2020, at 21:19, Matthew Stuckwisch <ma...@softastur.org> wrote:
>> 
>> In #raku it was mentioned that it would be nice to have a $*UNICODE variable 
>> of sorts that reports back the version, but not sure how that would be from 
>> an implementation POV.
>> 
>> I'm also late to the discussion, so pardon me jumping back a bit.  
>> Basically, ICU is something that lets you quickly add in robust Unicode 
>> support.  But it's also a swiss army knife and overkill for what Raku 
>> generally needs (at whichever its implemented in), and also limiting in some 
>> ways because you become beholden to their structures which as Samantha 
>> pointed out, doesn't work for MoarVM's approach.  Rolling your own has a lot 
>> of advantages.
>> 
>> Beyond UCD and UAC (sorting), everything else really should go into module 
>> land since they're heavily based on an ever changing and growing CLDR, and 
>> even then, there can be good arguments made for putting sorting in module 
>> space too.  For reasons like performance, code clarity, data size, etc, 
>> companies have rolled their own ICU-like libraries (Google's Closure for JS, 
>> TwitterCLDR in Ruby, etc) running on the same CLDR data.  In Raku (shameless 
>> selfplug), a lot is already available in the Intl namespace.  There are 
>> actually some very cool things that can be done mixing CLDR and Raku like 
>> creating new character-class-like tokens, or even extending built ins — they 
>> just don't have any business being near core, just... core-like :-)
>> 
>> Matéu
>> 
>> 
>> PS: For understanding some of Samantha's incredible work, her talks at the 
>> Amsterdam convention are really great, and Perl Weekly has an archive of her 
>> grant write ups:
>> Articles: https://perlweekly.com/a/samantha-mcvey.html
>> High End Unicode in Perl 6: https://www.youtube.com/watch?v=Oj_lgf7A2LM
>> Unicode Internals of Perl 6: https://www.youtube.com/watch?v=9Vv7nUUDdeA
>> 
>> 
>>> On Sep 29, 2020, at 3:14 PM, William Michels via perl6-users 
>>> <perl6-us...@perl.org> wrote:
>>> 
>>> Thank you, Samantha!
>>> 
>>> An outstanding question is one posed by Joseph Brenner--that
>>> is--knowing which version of the Unicode standard is supported by
>>> Raku. I grepped through two files, one called "unicode.c" and the
>>> other called "unicode_db.c". They're both located in rakudo at:
>>> /rakudo/rakudo-2020.06/nqp/MoarVM/src/strings/ .
>>> 
>>> Below are the first 4 lines of my grep results. As you can see
>>> (above/below), rakudo-2020.06 supports Unicode12.1.0:
>>> 
>>> ~$ raku -ne '.say if .grep(/unicode/)'
>>> ~/rakudo/rakudo-2020.06/nqp/MoarVM/src/strings/unicode_db.c
>>> # For terms of use, see http://www.unicode.org/terms_of_use.html
>>> # The UAXes can be accessed at 
>>> http://www.unicode.org/versions/Unicode12.1.0/
>>> From http://unicode.org/copyright.html#Exhibit1 on 2017-11-28:
>>> Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
>>> <TRUNCATED>
>>> 
>>> It would be really interesting to follow your Unicode work, Samantha.
>>> The ideas you propose are interesting and everyone hopes for speed
>>> improvements. Is there any place Raku-uns can go to read
>>> updates--maybe a grant report, blog, or Github issue? Or maybe right
>>> here, on the Perl6-Users mailing list? Thanks in advance.
>>> 
>>> Best, Bill.
>>> 
>>> W. Michels, Ph.D.
>>> 
>>> 
>>> 
>>> On Sun, Sep 27, 2020 at 4:03 AM Samantha McVey <samant...@posteo.net> wrote:
>>>> 
>>>> So MoarVM uses its own database of the UCD. One nice thing is this can
>>>> probably be faster than calling to the ICU to look up information of each
>>>> codepoint in a long string. Secondly it implements its own text data
>>>> structures, so the nice features of the UCD to do that would be difficult 
>>>> to
>>>> use.
>>>> 
>>>> In my opinion, it could make sense to use ICU for things like localized
>>>> collation (sorting). It also could make sense to use ICU for unicode
>>>> properties lookup for properties that don't have to do with grapheme
>>>> segmentation or casing. This would be a lot of work but if something like 
>>>> this
>>>> were implemented it would probably happen in the context of a larger
>>>> rethinking of how we use unicode. Though everything is complicated by that 
>>>> we
>>>> support lots of complicated regular expressions on different unicode
>>>> properties. I guess first I'd start by benchmarking the speed of ICU and
>>>> comparing to the current implementation.
>>>> 
>>>> 
>> 
>

Re: "ICU - International Components for Unicode"

Reply via email to