Nice. Well, happy to discuss how I might be helpful — implementation, API
design, etc.

For the work I’m doing on UAX 29, the key API is unicode.Is. I am satisfied
with the perf so far. unicode.Is dominates the profiling, but that’s to be
expected, as my scanner is basically a tight loop evaluating rune
categories. Certainly open to using a different trie-driven API.

On Fri, Apr 17, 2020 at 1:47 AM <m...@golang.org> wrote:

> Most of the x/text packages use tries and not rangetables. These allow
> arbitrary data (as long as it fits in an int) to be associated with runes
> and allow operating on utf8 without having to convert to tunes.
> https://godoc.org/golang.org/x/text/internal/triegen. But that’s not a
> requirement.
>
> The package
> https://godoc.org/golang.org/x/text/internal/gen/bitfield converts Go
> structs to ints and can be used to pack the rune data in a convenient way.
>
> Furthermore Package
> https://godoc.org/golang.org/x/text/internal/ucd
> can be used for reading UCD files
>
> And Package
> https://godoc.org/golang.org/x/text/internal/gen
> can be used to generate Go tables other than the trie and include
> utilities to generate canonical x/text files, such as including the Unicode
> and CLDR versions.
>
> The top-level file gen.go is used to orchestrate building x/text and
> captured dependencies between packages.
>
> I may have some designs laying around for the API.
>
> On Thu, 16 Apr 2020 at 21:46 Matt Sherman <mwsher...@gmail.com> wrote:
>
>> Great. Yes, the data files are here:
>> https://unicode.org/reports/tr41/tr41-26.html#Props0
>>
>> I’ve done a proof of concept here: https://github.com/clipperhouse/uax29
>>
>> To do it properly, I assume we’d want to use the house style here?
>> https://github.com/golang/text/blob/master/unicode/rangetable/gen.go
>>
>> On Thu, Apr 16, 2020 at 1:52 PM <m...@golang.org> wrote:
>>
>>> Yes that would be interesting. Especially if it can be generated from
>>> the Unicode raw data upon updates.
>>>
>>> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor <i...@golang.org> wrote:
>>>
>>>> [ +mpvl ]
>>>>
>>>> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman <mwsher...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi, I am working on a tokenizer based on Unicode text segmentation
>>>> (UAX 29). I am wondering if there would be an interest in adding range
>>>> tables for word break categories to the x/text or unicode packages. It
>>>> appears they could be code-gen’d alongside the rest of the range tables.
>>>> >
>>>> > Pardon if this is already being done and I have missed it. I see some
>>>> mention of those categories (e.g. ALetter) in other places.
>>>> >
>>>> > My code is here. Thanks.
>>>> >
>>>> > --
>>>> > You received this message because you are subscribed to the Google
>>>> Groups "golang-nuts" group.
>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>> send an email to golang-nuts+unsubscr...@googlegroups.com.
>>>> > To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAMPnbukOfdaV_D9P1cChmWrN%2BT1kf2OSOAgyXmRf-3PBakbOSw%40mail.gmail.com.

Reply via email to