Nice. Well, happy to discuss how I might be helpful — implementation, API design, etc.
For the work I’m doing on UAX 29, the key API is unicode.Is. I am satisfied with the perf so far. unicode.Is dominates the profiling, but that’s to be expected, as my scanner is basically a tight loop evaluating rune categories. Certainly open to using a different trie-driven API. On Fri, Apr 17, 2020 at 1:47 AM <m...@golang.org> wrote: > Most of the x/text packages use tries and not rangetables. These allow > arbitrary data (as long as it fits in an int) to be associated with runes > and allow operating on utf8 without having to convert to tunes. > https://godoc.org/golang.org/x/text/internal/triegen. But that’s not a > requirement. > > The package > https://godoc.org/golang.org/x/text/internal/gen/bitfield converts Go > structs to ints and can be used to pack the rune data in a convenient way. > > Furthermore Package > https://godoc.org/golang.org/x/text/internal/ucd > can be used for reading UCD files > > And Package > https://godoc.org/golang.org/x/text/internal/gen > can be used to generate Go tables other than the trie and include > utilities to generate canonical x/text files, such as including the Unicode > and CLDR versions. > > The top-level file gen.go is used to orchestrate building x/text and > captured dependencies between packages. > > I may have some designs laying around for the API. > > On Thu, 16 Apr 2020 at 21:46 Matt Sherman <mwsher...@gmail.com> wrote: > >> Great. Yes, the data files are here: >> https://unicode.org/reports/tr41/tr41-26.html#Props0 >> >> I’ve done a proof of concept here: https://github.com/clipperhouse/uax29 >> >> To do it properly, I assume we’d want to use the house style here? >> https://github.com/golang/text/blob/master/unicode/rangetable/gen.go >> >> On Thu, Apr 16, 2020 at 1:52 PM <m...@golang.org> wrote: >> >>> Yes that would be interesting. Especially if it can be generated from >>> the Unicode raw data upon updates. >>> >>> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor <i...@golang.org> wrote: >>> >>>> [ +mpvl ] >>>> >>>> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman <mwsher...@gmail.com> >>>> wrote: >>>> > >>>> > Hi, I am working on a tokenizer based on Unicode text segmentation >>>> (UAX 29). I am wondering if there would be an interest in adding range >>>> tables for word break categories to the x/text or unicode packages. It >>>> appears they could be code-gen’d alongside the rest of the range tables. >>>> > >>>> > Pardon if this is already being done and I have missed it. I see some >>>> mention of those categories (e.g. ALetter) in other places. >>>> > >>>> > My code is here. Thanks. >>>> > >>>> > -- >>>> > You received this message because you are subscribed to the Google >>>> Groups "golang-nuts" group. >>>> > To unsubscribe from this group and stop receiving emails from it, >>>> send an email to golang-nuts+unsubscr...@googlegroups.com. >>>> > To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAMPnbukOfdaV_D9P1cChmWrN%2BT1kf2OSOAgyXmRf-3PBakbOSw%40mail.gmail.com.