core.git: lingucomponent/source

Mike Kaganski (via logerrit) Sat, 23 Nov 2024 01:04:09 -0800

 lingucomponent/source/hyphenator/hyphen/hyphenimp.cxx |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


New commits:
commit 9c14ec81b6c25c7932964382f306dadfefeda518
Author:     Mike Kaganski <mike.kagan...@collabora.com>
AuthorDate: Sat Nov 23 09:52:53 2024 +0500
Commit:     Mike Kaganski <mike.kagan...@collabora.com>
CommitDate: Sat Nov 23 10:03:41 2024 +0100

    tdf#164006: Only use original word's positions, ignore extra encoded length
    
    The encoding of the string passed to Hunspell/hyphen service depends on the
    encoding of the dictionary itself. When the usual UTF-8 encoding is used,
    the resulting octet string may be longer than the original UTF-16 code unit
    count. In that case, the length of the buffer receiving the positions will
    be longer, respectively. But on return, the buffer will only contain data
    in positions corresponding to the characters, not code units (it is unclear
    if we even need to pass buffer that large). So just as the following loop
    only iterates up to nWord length, the calculation of hyphen count must use
    its length, too, not the length of encWord.
    
    I suspect that the use of UTF-16 code units as hyphen positions is wrong;
    it will break in SMP surrogate pairs. The proper would be to iterate code
    points. However, I don't have data to test, so let it be TODO/LATER.
    
    Change-Id: Ieed5e696e03cb22e3b48fabc14537372bbe74363
    Reviewed-on: https://gerrit.libreoffice.org/c/core/+/177077
    Reviewed-by: Mike Kaganski <mike.kagan...@collabora.com>
    Tested-by: Jenkins

diff --git a/lingucomponent/source/hyphenator/hyphen/hyphenimp.cxx 
b/lingucomponent/source/hyphenator/hyphen/hyphenimp.cxx
index c528318dc33d..46071a987f5c 100644
--- a/lingucomponent/source/hyphenator/hyphen/hyphenimp.cxx
+++ b/lingucomponent/source/hyphenator/hyphen/hyphenimp.cxx
@@ -785,7 +785,8 @@ Reference< XPossibleHyphens > SAL_CALL 
Hyphenator::createPossibleHyphens( const
 
         sal_Int32 nHyphCount = 0;
 
-        for ( sal_Int32 i = 0; i < encWord.getLength(); i++)
+        // FIXME: shouldn't we iterate code points instead?
+        for (sal_Int32 i = 0; i < nWord.getLength(); i++)
         {
             if (hyphens[i]&1)
                 nHyphCount++;

core.git: lingucomponent/source

Reply via email to