It actually appears that Swift already links against ICU. I'll see if I can hook Swift up to ICU's grapheme separation code.
On Sun, Dec 20, 2015 at 10:41 PM, Michael Buckley <mich...@buckleyisms.com> wrote: > After reading through the ICU sources, if I understand them correctly, ICU > uses the Aho–Corasick algorithm to determine grapheme breaks, word breaks > and line breaks, and then does some post-processing after matching using > the algorithm. > > This allows ICU to solve the regional indicator problem by including a > pattern that matches 3 regional indicator characters in a row and inserts a > grapheme break after the second. This does not actually modify the string > by adding a zero-width space or something. > > While this approach can solve the regional indicator problem efficiently, > it cannot solve the problem with the zero-width joiner emoji sequences as > easily. This is because Aho-Corasick is liner on the length of the text + > the number of patterns + the number of matches, and the emoji problem would > require a pattern for every emoji sequence we want to support. > > However, after reading UTR #51 again, we may want to treat all emoji > joined by a ZWJ as a single extended grapheme cluster, whether they form a > known sequence or not. That's because UTR#51 leaves the exact sequences as > implementation-defined. It includes a list of currently-known implemented > sequences, but allows for implementers to add their own sequences. > > Which means that Ubuntu could, for example, support a sequence of DOG FACE > + ZWJ + PILE OF POO, and represent it with a glyph of a dog doing its > business. We basically have two options here. We could treat Swift as an > Apple-platform centric language and implement only the sequences that > appear on Apple platforms, or we could implement a rule of any emoji + ZWJ > + any emoji has no break. As Dmitri pointed out, this would mean Swift > would mean Swift would report strings of invalid sequences as a single > character, which could be confusing. But I posit that the situation we have > now, reporting valid strings as multiple characters is also confusing, and > much more likely. It's unlikely that anyone is going to stick a ZWJ between > emoji unless they intend to make a sequence from it. > > Incidentally, this is what ICU does. You can test this yourself in > TextEdit by typing HEAVY BLACK HEART followed by ZWJ ad infinitum, then > press the left arrow key once and watch TextEdit treat the sequence as a > single character, causing the cursor to jump to the beginning of the > string. ICU, however, does hard-code the emoji that are currently used by > Apple emoji sequences, so you can't do the same thing with PILE OF POO. > This makes sense in an ICU context, since it's only implementing the Apple > sequences, but if we want Swift to be more platform-agnostic, we would want > this behavior for any emoji. > > > ICU's implementation fixes the regional indicator problem, but the > implementation is large and moderately complicated. Just throwing this out > there, but would it be possible to add ICU as a dependency to Swift and > just use its implementation? I'm sure this would be a nightmare to work out > license and logistics-wise. (It would probably necessitate that ICU > development be opened up to the same degree that other Swift dependencies > are). I also understand that adding any dependencies at all is less than > ideal. But this seems like a perfect situation for some code sharing. We > have a moderately large and complicated library that is being updated with > new Emoji support when new Emoji are added anyway. It's fast, it's already > well-used, and we'd have to duplicate a lot of what it does to solve the > same problems if we didn't use it. > > As a bonus, we could link to the system-supplied libicu on OS X and iOS, > so Swift apps would automatically get the latest emoji support when users > update their OSs. We would still have to bundle it for other OSs. > > I know that there are a lot of downsides to making it a dependency, but I > wanted to throw the idea out there to see if it made sense. > > On Fri, Dec 18, 2015 at 6:22 AM, Michael Buckley <mich...@buckleyisms.com> > wrote: > >> Thanks for the response, Dimitri. My comments inline below. >> >> On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <griboz...@gmail.com> >> wrote: >>> >>> >>> One thing to do would be to check the Apple's ICU implementation, which >>> (I think) implements some extra handling for UTR #51 ( >>> http://opensource.apple.com/release/os-x-1011/) to see how it deals >>> with this, whether it introduces tailoring, and if so, in what way. >>> >> >> I will look into that. I had always thought that would have been part of >> Core Text, and not open sourced. It is great to know that it is >> open-sourced. >> >> >> My primary concern with the fix in the PR is that it seems to change the >>> segmentation behavior for other sequences. The grapheme cluster >>> segmentation algorithm is local and stateless. It only looks at two >>> adjacent Unicode scalars. This means that adding a rule like "ZWJ >>> no_boundary Emoji" will affect all sequences, even those that are not a >>> grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji": >>> the three scalars would be grouped). >>> >> >> Apologies, I forgot to mention that disadvantage. It does change the >> segmentation behavior for other sequences, which was one of the reasons I >> was on the fence about whether this should go through the swift-evolution >> process. >> >> >> >>> This is the same issue as multiple flags pasted together (which are >>> represented as regional indicator characters). The current algorithm just >>> does not have enough information to split them apart, it needs to look at a >>> wider part of the string. >>> >> >> I could be reading the Unicode standard incorrectly, but it appears that >> this might be the intended behavior for the flag characters. I definitely >> agree that it's not ideal. >> >> >> I would be much happier with a solution only changed the segmentation for >>> the cases covered by the TR, but I understand it might have performance >>> implications. I think we should try to add such a tailoring, and benchmark >>> it. >>> >> >> Just so that I understand what you mean by tailoring, you mean switching >> to a possibly stateful algorithm which can consider more than just two >> adjacent characters when grouping, right? >> >> >> >>> The change that adds the first tailoring to the algorithm might be >>> significant enough. But I think it would be a question of whether we want >>> any tailoring at all, not about specific tailoring. >>> >> >> Thanks for the clarification. Just to be sure, if this change wasn't as >> problematic, but still changed the behavior of Swift.String, you're saying >> it would not be important enough for swift-evolution? As a concrete >> example, if I was just proposing to fix the skin tone emoji, but not the >> SWJ sequences, would it be considered just a bug fix? >> > >
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev