Re: New ICU versions, C API: the details

Herbert Dürr Sun, 04 May 2025 22:15:10 -0700

Hi Damjan, thanks for working on this!

Regarding the ICU layout engine a switch to Harfbuzz is the right option, 
because simply disabling/dropping ICU-LE would mean dropping the support for 
complex-text-layout languages like Arabic, Hebrew, Hindi, Thai, etc.


Herbert

On May 4, 2025 7:43:13 PM GMT+02:00, Damjan Jovanovic <dam...@apache.org> wrote:
>Hi
>
>As promised, here are the details of how I patched OpenOffice to use
>(system) ICU 76.1 on FreeBSD by converting all C++ calls to ICU to use
>ICU's C API instead, the issues that were found, and how you can try it out.
>
>----------------------------------------
>Unnecessary dependencies
>----------------------------------------
>cui didn't actually need ICU as a dependency, and ICU has been deleted from
>its makefile and precompiled header.
>
>sc does not really need the ICU layout engine, and this was removed from
>its makefiles.
>
>There may still be other modules that specify ICU in their prj/build.lst,
>header files, and/or makefiles, but don't actually need it.
>
>-----------------------
>GNU make bugs
>-----------------------
>Building with system ICU broke in i18npool, and probably never really
>worked since I ported it to gbuild, due to what looks like a bug in GNU
>make, because "-include" isn't really optional like it should be. I patched
>it to use a conditional "include" instead.
>
>---------------------------------------------------------------
>ICU layout engine ("icule") no longer exists
>---------------------------------------------------------------
>Building vcl broke for several reasons, one of which is that the ICU layout
>engine stopped existing! It was poorly maintained and previously deprecated:
>https://web.archive.org/web/20160130092624/http://userguide.icu-project.org/layoutengine
>https://unicode-org.atlassian.net/browse/ICU-10530
>Then in 2016 it was completely dropped from the ICU project:
>https://unicode-org.atlassian.net/browse/ICU-12708
>They recommend HarfBuzz as a replacement, and HarfBuzz does have a ICU
>Layout Engine compatibility API we could use (of unknown quality and
>completeness). HarfBuzz has become quite a popular library. Many
>applications and frameworks use it, including LibreOffice and Java. It's
>MIT licensed.
>
>However at the moment we don't need to do anything. Just commenting out
>this line in main/vcl/source/glyphs/gcach_layout.cxx:
>#define ENABLE_ICU_LAYOUT
>and removing "icule" from vcl makefiles, was enough to get AOO to build and
>run. I think it falls back to using Freetype.
>
>Another good thing about the layout engine removal is that the ICU patch to
>misc/icu/source/layout/ArabicShaping.cpp will no longer be necessary - that
>file was in the layout engine, and whatever problem that patch fixed,
>either won't exist in whatever is doing layout now (Freetype?), or will be
>that project's responsibility. Also it's worth noting: HarfBuzz was
>developed by an Iranian developer, so correct rendering of Arabic fonts was
>probably a top priority.
>
>------------------------------------
>i18npool
>------------------------------------
>i18npool needed more patching than everything else put together, and had to
>be broken up into several sections:
>
>---------------------------------------------------------------
>i18npool: break iterator rule files don't compile
>---------------------------------------------------------------
>OpenOffice doesn't just use the break iterators provided by ICU, but also
>makes custom ones, by compiling its own break rule files in
>i18npool/source/breakiterator/data/*.
>
>The file i18npool/source/breakiterator/data/LICENSE_INFO describes how
>these files were forked from ICU 3.2 in 2004, and then modified. ICU
>changed since then, and now line.txt doesn't compile because of the
>"!!LBCMNoChain;" option which ICU dropped, while sent.txt doesn't compile
>due to some other error.
>
>Using "diff" to compare our line.txt and sent.txt to ICU's different
>versions, they were forked from (or later updated to) what seem more like
>ICU 3.6, which used the rules from Unicode version 5, while the latest ICU
>are based on Unicode 12 for sent.txt and Unicode 14 for line.txt.
>
>sent.txt was customized by:
>- adding "$Thai         = [:Script = Thai:];"
>- adding "$LettersEx = [$OLetter $Upper $Lower $Numeric $Close $STerm]
>($Extend | $Format)*;"
>- adding "$LettersEx* $Thai $LettersEx* ($ATermEx | $SpEx)*;"
>- changing the first line of rule 12 to:
>"[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend $Thai]{bof}] ($Extend |
>$Format | $Close | $Sp)* [^$Thai];"
>
>which seems like quite a hack, adding Thai-specific rules to a file
>intended to find general sentence boundaries, which doesn't mention other
>languages.
>
>line.txt has more changes, but I still trust the Unicode 14 rules more than
>our (questionably) patched Unicode 5 rules.
>
>The code in main/i18npool/source/breakiterator/breakiterator_unicode.cxx
>method BreakIterator_Unicode::loadICUBreakIterator() will fall back to
>general break iterators when custom ones cannot be loaded, so I've deleted
>our sent.txt and line.txt, so that it uses ICU's own ones.
>
>-------------------------------------------------------------------------------------------------
>i18npool: unknown data length when loading custom break iterator rules
>-------------------------------------------------------------------------------------------------
>The custom break rules are stored in ICU's binary rule format, and get
>loaded with udata_open(), which returns a UDataMemory*, which OpenOffice
>was ultimately passing to the RuleBasedBreakIterator constructor. Now we
>can't call that constructor from C, the only RuleBasedBreakIterator
>constructor we can call (through ubrk_openBinaryRules()) is a different one
>that takes a pointer and a length ("ruleLength"). [ICU is really a C++
>library with a C wrapper around some of its features. The C++ API is more
>general and flexible.] While it's possible to get the data pointer from
>UDataMemory* by using udata_getMemory(), there is no public function to get
>the data length.
>
>By reading the ICU code for the RuleBasedBreakIterator constructor, I see
>that, at present, that ruleLength is not really used. That constructor only
>checks that it is large enough to hold its internal RBBIDataHeader data
>structure, and that it is large enough to hold the file data specified in
>RBBIDataHeader. Therefore what I've done is pass 1 billion as the
>ruleLength, to get past these checks, which has allowed it to work, but
>could break later if the ICU internals change.
>
>------------------------------------------------------------------------------------------------------------
>i18npool: subclassing RuleBasedBreakIterator to call protected
>setBreakType()
>------------------------------------------------------------------------------------------------------------
>OpenOffice subclasses RuleBasedBreakIterator in order to call the protected
>setBreakType() method inside it, an ugly hack that messes with ICU
>internals and only happened to work by pure luck. This subclassing cannot
>be done from C, but even if it could, in recent ICU, setBreakType() has
>been deleted. What was OpenOffice using setBreakType() for? When the custom
>break rules are loaded, it uses setBreakType() to set whether to break on
>characters, or words, or sentences, etc., presumably allowing one set of
>break rules to be used for multiple types of breaks, instead of needing a
>unique set of break rules for each type of break.
>
>This was the ICU commit in which fBreakType was removed:
>---snip---
>commit 8640bee541ac93bb79cef32ad0643b2ffd09dc3a
>Merge: 44b2617d449 d7f2cd98d3a
>Author: Andy Heninger <andy.henin...@gmail.com>
>Date:   Wed Feb 21 23:10:10 2018 +0000
>
>    ICU-10688 Remove redundant break type logic from BreakIterators. Merge
>to trunk.
>
>    X-SVN-Rev: 40967
>---snip---
>
>and as per https://unicode-org.atlassian.net/browse/ICU-10688 this was
>first released in ICU 61.1.
>
>I have stopped trying to subclass RuleBasedBreakIterator or set fBreakType,
>and just left it to use the break type from the rules, but we would
>presumably need ICU >= 61.1 for that to work correctly.
>
>-----------------------------------------------------------------
>i18npool: collator rule files don't compile
>-----------------------------------------------------------------
>Just like OpenOffice customizes break iterators a lot, it also customizes
>freaking collators a lot too...
>
>The following 2 files fail to build:
>main/i18npool/source/collator/data/ko_charset.txt
>main/i18npool/source/collator/data/zh_TW_charset.txt
>because the first causes gencoll_rule to fail with "Rule parsering error",
>and the second causes an infinite loop in gencoll_rule, in the new version
>of ICU.
>
>The za_TW_charset.txt was the only file using "\uNNNN" style escape
>sequences to form unicode characters, and when I replaced them by the
>actual unicode characters using the Java code below, it began working again:
>
>---snip---
>        try (BufferedReader reader = new BufferedReader(
>                new InputStreamReader(
>                        new
>FileInputStream("/path/to/openoffice/openoffice-git/main/i18npool/source/collator/data/zh_TW_charset.txt"),
>                        Charsets.UTF_8));
>            FileWriter fileWriter = new FileWriter("/tmp/zh",
>Charsets.UTF_8)) {
>            String line;
>            while ((line = reader.readLine()) != null) {
>                for (int idx = line.indexOf("\\u"); idx >= 0; idx =
>line.indexOf("\\u")) {
>                    int codePoint = Integer.parseInt(line.substring(idx +
>2, idx + 6), 16);
>                    line = line.substring(0, idx) +
>Character.toString(codePoint) + line.substring(idx + 6);
>                }
>                fileWriter.write(line);
>                fileWriter.write(0xa);
>            }
>        }
>---snip---
>
>ko_charset.txt was deleted as it couldn't be fixed (even if only 1
>character is left, it still breaks), and I doubt we still need it.
>
>
>------------------
>Reflection
>------------------
>I don't understand why such invasive customizations of ICU were done within
>OpenOffice. Couldn't the original OpenOffice developers have sent ICU
>patches upstream instead?
>
>-------------------------
>Remaining tasks
>-------------------------
>1. The ICU module itself has to be upgraded to a newer ICU. I recommend the
>very latest version, as we need at least 61.1 for the
>RuleBasedBreakIterator stuff, and we know 76.1 works from my system ICU
>test. This upgrade will be difficult, as Windows's old MSVC version cannot
>compile the new ICU, so we might have to package prebuilt binaries instead,
>or use Clang to compile it. For other platforms, where we have greater
>compiler freedom, we would need to patch dmake to use a different -std=...
>for ICU relative to other modules.
>2. The changes are significant and could break many things. There are no
>tests for the i18npool module of any kind. Under main/qadevOOo, there are
>some "i18n" tests, including the very tests we need for the Korean
>collation. However as I mentioned on previous occasions, these tests are
>not run during the build, and just sitting there unused. They used some
>ancient custom testing framework. I really want to get the tests running
>and see the results before we merge this, so I'll probably work on those
>next.
>
>------------------
>Trying it out
>------------------
>While further work remains, I've pushed my changes into the "icu-c-api"
>branch, if anyone wants to try it out. Just remember to pass
>--with-system-icu to ./configure, to use the system ICU on *nix. I believe
>it should still work with the internal ICU 4.2.1 we use now too, but
>haven't tested in a while (the only place it previously broke was the data
>files - the C code works equally well with ICU 4.2.1 and 76.1).
>
>Regards
>Damjan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org

Re: New ICU versions, C API: the details

Reply via email to