RE: Re: [lang] StringUtils.abbreviate is not emoji aware, breaks surrogate pairs

2025-04-14 Thread Carsten Kirschner
Hello Gary, I updated the unit test, and removed the guessing part, I think. This page shows nicely how the Family Grapheme is composed https://utf-8-visualizer.ardis.lu/?q=%F0%9F%91%A8%F0%9F%8F%BB%E2%80%8D%F0%9F%91%A9%F0%9F%8F%BB%E2%80%8D%F0%9F%91%A6%F0%9F%8F%BB%E2%80%8D%F0%9F%91%A6%F0%9F%8F%BB

Re: [lang] StringUtils.abbreviate is not emoji aware, breaks surrogate pairs

2025-04-14 Thread Gary Gregory
Hi Carsten, Could you provide a unit test with the expected behavior? The example you gave has console output and assertions commented out, both of which are undesirable. Instead of me guessing, I'd rather you manage expectations and provide a failing/passing set of assertions. TY! Gary On Sun,

Re: [lang] StringUtils.abbreviate is not emoji aware, breaks surrogate pairs

2025-04-13 Thread Gary Gregory
I created https://issues.apache.org/jira/browse/LANG-1770 to track this report. Gary On Fri, Apr 11, 2025 at 10:15 AM Carsten Kirschner wrote: > > Hello, > > The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will > destroy 4 byte emoji characters and larger grapheme clust

[lang] StringUtils.abbreviate is not emoji aware, breaks surrogate pairs

2025-04-11 Thread Carsten Kirschner
Hello, The current commons lang3 StringUtils.abbreviate (3.17.0) implementation will destroy 4 byte emoji characters and larger grapheme clusters. I know that handling grapheme correctly before java 20 is not possible, but at least a codepoint aware solution with String.offsetByCodPoints could