Re: [go-nuts] How to test for unicode property "Variation Selector"

Konstantin Khomoutov Fri, 10 Mar 2017 07:03:56 -0800

On Fri, 10 Mar 2017 05:03:25 -0800 (PST)
JohnGB <jgbeck...@gmail.com> wrote:


> I have some text that I'm processing which includes emoji.  I'm
> trying to strip out all the emoji by using checking whether a rune
> fulfils unicode.IsSymbol().  This strips out the emoji, but it seems
> that some emoji include a unicode variation selector that is not
> stripped out.  So I also need to check if a rune is a variation
> selector, but I don't see any sane way of doing that.  Is there a
> direct way to check if a rune is a variation selector, or is it
> possibly a bug that the variation selector is not being grouped with
> the emoji as part of the rune?  
> 
> Either way, the only robust method that I can think of now would be
> to create my own function to  thing that I can think of now would be
> to parse the unicode.Variation_Selector RangeTable to check for
> this.  But I'm hoping there is a simpler way that I'm missing.
> 
> The variation selector that I'm currently getting is 0xFE0F, but I'd
> like my code to not rely on only ever getting this variation selector.

NFD is supposed to force decomposition of the base and combining
characters so that the latter follow the former, in certain order.
So, to me, it appears that to remove all or certain combining
characters, you should roll like this:

1) Get the NFD form of your text.
2) Iterate over its runes.
3) As soon as you detect a non-combining-character rune,
   examine it, and if it's deemed as requiring removal of some of
   the CCs which may follow it, start looking for the next runes
   which has the property of being a CC.
   When you detect a non-CC, goto (3).

So I tried to walk over an NFD form of a Unicode string including
an emoji symbol followed by U+FE0F, as well as characters 'õ' and 'ß',
to see what properties the decomposed runes will have.

Here's the program:

----------------8<----------------
package main

import (
        "fmt"
        "golang.org/x/text/unicode/norm"
        "unicode/utf8"
)

const s = "\u231B\uFE0E Hõla, Straße!";

func main() {
        var di norm.Iter

        di.InitString(norm.NFD, s)

        for !di.Done() {
                b := di.Next()

                fmt.Printf("L=%d: %#v\n", len(b), b)

                at := 0
                for {
                        rb := b[at:]

                        r, n := utf8.DecodeRune(rb)

                        pr := norm.NFD.Properties(rb)
                        ccc := pr.CCC()

                        fmt.Printf("\tr=U+%04X, n=%d; ccc=%d\n", r, n,
ccc)

                        at += n
                        if at == len(b) {
                                break
                        }
                }
        }

        fmt.Println("OK")
}
----------------8<----------------

And with the recent "master" of golang.org/x/text, I got the following
printout:

L=3: []byte{0xe2, 0x8c, 0x9b}
        r=U+231B, n=3; ccc=0
L=3: []byte{0xef, 0xb8, 0x8e}
        r=U+FE0E, n=3; ccc=0
L=1: []byte{0x20}
        r=U+0020, n=1; ccc=0
L=1: []byte{0x48}
        r=U+0048, n=1; ccc=0
L=3: []byte{0x6f, 0xcc, 0x83}
        r=U+006F, n=1; ccc=0
        r=U+0303, n=2; ccc=230
L=1: []byte{0x6c}
        r=U+006C, n=1; ccc=0
L=1: []byte{0x61}
        r=U+0061, n=1; ccc=0
L=1: []byte{0x2c}
        r=U+002C, n=1; ccc=0
L=1: []byte{0x20}
        r=U+0020, n=1; ccc=0
L=1: []byte{0x53}
        r=U+0053, n=1; ccc=0
L=1: []byte{0x74}
        r=U+0074, n=1; ccc=0
L=1: []byte{0x72}
        r=U+0072, n=1; ccc=0
L=1: []byte{0x61}
        r=U+0061, n=1; ccc=0
L=2: []byte{0xc3, 0x9f}
        r=U+00DF, n=2; ccc=0
L=1: []byte{0x65}
        r=U+0065, n=1; ccc=0
L=1: []byte{0x21}
        r=U+0021, n=1; ccc=0

That is, NFD had decomposed 'õ' into 'o' (U+006F) and '~' U+0303,
the 'ß' was left as is, and the U+231B U+FE0E sequence was left as is.

Notice that the only rune on which ccc != 0 is U+0303.

If my assumption is correct, the properties of the U+FE0E should have
included the fact it's a CC because [1] states those "variation
selectors" are CCs.  So I'd say currently the Unicode data in
golang.org/x/text has a bug.

1. http://www.unicode.org/charts/PDF/UFE00.pdf

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] How to test for unicode property "Variation Selector"

Reply via email to