On Fri, 10 Mar 2017 05:03:25 -0800 (PST) JohnGB <jgbeck...@gmail.com> wrote:
> I have some text that I'm processing which includes emoji. I'm > trying to strip out all the emoji by using checking whether a rune > fulfils unicode.IsSymbol(). This strips out the emoji, but it seems > that some emoji include a unicode variation selector that is not > stripped out. So I also need to check if a rune is a variation > selector, but I don't see any sane way of doing that. Is there a > direct way to check if a rune is a variation selector, or is it > possibly a bug that the variation selector is not being grouped with > the emoji as part of the rune? > > Either way, the only robust method that I can think of now would be > to create my own function to thing that I can think of now would be > to parse the unicode.Variation_Selector RangeTable to check for > this. But I'm hoping there is a simpler way that I'm missing. > > The variation selector that I'm currently getting is 0xFE0F, but I'd > like my code to not rely on only ever getting this variation selector. NFD is supposed to force decomposition of the base and combining characters so that the latter follow the former, in certain order. So, to me, it appears that to remove all or certain combining characters, you should roll like this: 1) Get the NFD form of your text. 2) Iterate over its runes. 3) As soon as you detect a non-combining-character rune, examine it, and if it's deemed as requiring removal of some of the CCs which may follow it, start looking for the next runes which has the property of being a CC. When you detect a non-CC, goto (3). So I tried to walk over an NFD form of a Unicode string including an emoji symbol followed by U+FE0F, as well as characters 'õ' and 'ß', to see what properties the decomposed runes will have. Here's the program: ----------------8<---------------- package main import ( "fmt" "golang.org/x/text/unicode/norm" "unicode/utf8" ) const s = "\u231B\uFE0E Hõla, Straße!"; func main() { var di norm.Iter di.InitString(norm.NFD, s) for !di.Done() { b := di.Next() fmt.Printf("L=%d: %#v\n", len(b), b) at := 0 for { rb := b[at:] r, n := utf8.DecodeRune(rb) pr := norm.NFD.Properties(rb) ccc := pr.CCC() fmt.Printf("\tr=U+%04X, n=%d; ccc=%d\n", r, n, ccc) at += n if at == len(b) { break } } } fmt.Println("OK") } ----------------8<---------------- And with the recent "master" of golang.org/x/text, I got the following printout: L=3: []byte{0xe2, 0x8c, 0x9b} r=U+231B, n=3; ccc=0 L=3: []byte{0xef, 0xb8, 0x8e} r=U+FE0E, n=3; ccc=0 L=1: []byte{0x20} r=U+0020, n=1; ccc=0 L=1: []byte{0x48} r=U+0048, n=1; ccc=0 L=3: []byte{0x6f, 0xcc, 0x83} r=U+006F, n=1; ccc=0 r=U+0303, n=2; ccc=230 L=1: []byte{0x6c} r=U+006C, n=1; ccc=0 L=1: []byte{0x61} r=U+0061, n=1; ccc=0 L=1: []byte{0x2c} r=U+002C, n=1; ccc=0 L=1: []byte{0x20} r=U+0020, n=1; ccc=0 L=1: []byte{0x53} r=U+0053, n=1; ccc=0 L=1: []byte{0x74} r=U+0074, n=1; ccc=0 L=1: []byte{0x72} r=U+0072, n=1; ccc=0 L=1: []byte{0x61} r=U+0061, n=1; ccc=0 L=2: []byte{0xc3, 0x9f} r=U+00DF, n=2; ccc=0 L=1: []byte{0x65} r=U+0065, n=1; ccc=0 L=1: []byte{0x21} r=U+0021, n=1; ccc=0 That is, NFD had decomposed 'õ' into 'o' (U+006F) and '~' U+0303, the 'ß' was left as is, and the U+231B U+FE0E sequence was left as is. Notice that the only rune on which ccc != 0 is U+0303. If my assumption is correct, the properties of the U+FE0E should have included the fact it's a CC because [1] states those "variation selectors" are CCs. So I'd say currently the Unicode data in golang.org/x/text has a bug. 1. http://www.unicode.org/charts/PDF/UFE00.pdf -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.