New ICU versions, C API: the details

Damjan Jovanovic Sun, 04 May 2025 10:44:17 -0700

Hi

As promised, here are the details of how I patched OpenOffice to use
(system) ICU 76.1 on FreeBSD by converting all C++ calls to ICU to use
ICU's C API instead, the issues that were found, and how you can try it out.


----------------------------------------
Unnecessary dependencies
----------------------------------------
cui didn't actually need ICU as a dependency, and ICU has been deleted from
its makefile and precompiled header.

sc does not really need the ICU layout engine, and this was removed from
its makefiles.

There may still be other modules that specify ICU in their prj/build.lst,
header files, and/or makefiles, but don't actually need it.

-----------------------
GNU make bugs
-----------------------
Building with system ICU broke in i18npool, and probably never really
worked since I ported it to gbuild, due to what looks like a bug in GNU
make, because "-include" isn't really optional like it should be. I patched
it to use a conditional "include" instead.

---------------------------------------------------------------
ICU layout engine ("icule") no longer exists
---------------------------------------------------------------
Building vcl broke for several reasons, one of which is that the ICU layout
engine stopped existing! It was poorly maintained and previously deprecated:
https://web.archive.org/web/20160130092624/http://userguide.icu-project.org/layoutengine
https://unicode-org.atlassian.net/browse/ICU-10530
Then in 2016 it was completely dropped from the ICU project:
https://unicode-org.atlassian.net/browse/ICU-12708
They recommend HarfBuzz as a replacement, and HarfBuzz does have a ICU
Layout Engine compatibility API we could use (of unknown quality and
completeness). HarfBuzz has become quite a popular library. Many
applications and frameworks use it, including LibreOffice and Java. It's
MIT licensed.

However at the moment we don't need to do anything. Just commenting out
this line in main/vcl/source/glyphs/gcach_layout.cxx:
#define ENABLE_ICU_LAYOUT
and removing "icule" from vcl makefiles, was enough to get AOO to build and
run. I think it falls back to using Freetype.

Another good thing about the layout engine removal is that the ICU patch to
misc/icu/source/layout/ArabicShaping.cpp will no longer be necessary - that
file was in the layout engine, and whatever problem that patch fixed,
either won't exist in whatever is doing layout now (Freetype?), or will be
that project's responsibility. Also it's worth noting: HarfBuzz was
developed by an Iranian developer, so correct rendering of Arabic fonts was
probably a top priority.

------------------------------------
i18npool
------------------------------------
i18npool needed more patching than everything else put together, and had to
be broken up into several sections:

---------------------------------------------------------------
i18npool: break iterator rule files don't compile
---------------------------------------------------------------
OpenOffice doesn't just use the break iterators provided by ICU, but also
makes custom ones, by compiling its own break rule files in
i18npool/source/breakiterator/data/*.

The file i18npool/source/breakiterator/data/LICENSE_INFO describes how
these files were forked from ICU 3.2 in 2004, and then modified. ICU
changed since then, and now line.txt doesn't compile because of the
"!!LBCMNoChain;" option which ICU dropped, while sent.txt doesn't compile
due to some other error.

Using "diff" to compare our line.txt and sent.txt to ICU's different
versions, they were forked from (or later updated to) what seem more like
ICU 3.6, which used the rules from Unicode version 5, while the latest ICU
are based on Unicode 12 for sent.txt and Unicode 14 for line.txt.

sent.txt was customized by:
- adding "$Thai         = [:Script = Thai:];"
- adding "$LettersEx = [$OLetter $Upper $Lower $Numeric $Close $STerm]
($Extend | $Format)*;"
- adding "$LettersEx* $Thai $LettersEx* ($ATermEx | $SpEx)*;"
- changing the first line of rule 12 to:
"[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend $Thai]{bof}] ($Extend |
$Format | $Close | $Sp)* [^$Thai];"

which seems like quite a hack, adding Thai-specific rules to a file
intended to find general sentence boundaries, which doesn't mention other
languages.

line.txt has more changes, but I still trust the Unicode 14 rules more than
our (questionably) patched Unicode 5 rules.

The code in main/i18npool/source/breakiterator/breakiterator_unicode.cxx
method BreakIterator_Unicode::loadICUBreakIterator() will fall back to
general break iterators when custom ones cannot be loaded, so I've deleted
our sent.txt and line.txt, so that it uses ICU's own ones.

-------------------------------------------------------------------------------------------------
i18npool: unknown data length when loading custom break iterator rules
-------------------------------------------------------------------------------------------------
The custom break rules are stored in ICU's binary rule format, and get
loaded with udata_open(), which returns a UDataMemory*, which OpenOffice
was ultimately passing to the RuleBasedBreakIterator constructor. Now we
can't call that constructor from C, the only RuleBasedBreakIterator
constructor we can call (through ubrk_openBinaryRules()) is a different one
that takes a pointer and a length ("ruleLength"). [ICU is really a C++
library with a C wrapper around some of its features. The C++ API is more
general and flexible.] While it's possible to get the data pointer from
UDataMemory* by using udata_getMemory(), there is no public function to get
the data length.

By reading the ICU code for the RuleBasedBreakIterator constructor, I see
that, at present, that ruleLength is not really used. That constructor only
checks that it is large enough to hold its internal RBBIDataHeader data
structure, and that it is large enough to hold the file data specified in
RBBIDataHeader. Therefore what I've done is pass 1 billion as the
ruleLength, to get past these checks, which has allowed it to work, but
could break later if the ICU internals change.

------------------------------------------------------------------------------------------------------------
i18npool: subclassing RuleBasedBreakIterator to call protected
setBreakType()
------------------------------------------------------------------------------------------------------------
OpenOffice subclasses RuleBasedBreakIterator in order to call the protected
setBreakType() method inside it, an ugly hack that messes with ICU
internals and only happened to work by pure luck. This subclassing cannot
be done from C, but even if it could, in recent ICU, setBreakType() has
been deleted. What was OpenOffice using setBreakType() for? When the custom
break rules are loaded, it uses setBreakType() to set whether to break on
characters, or words, or sentences, etc., presumably allowing one set of
break rules to be used for multiple types of breaks, instead of needing a
unique set of break rules for each type of break.

This was the ICU commit in which fBreakType was removed:
---snip---
commit 8640bee541ac93bb79cef32ad0643b2ffd09dc3a
Merge: 44b2617d449 d7f2cd98d3a
Author: Andy Heninger <andy.henin...@gmail.com>
Date:   Wed Feb 21 23:10:10 2018 +0000

    ICU-10688 Remove redundant break type logic from BreakIterators. Merge
to trunk.

    X-SVN-Rev: 40967
---snip---

and as per https://unicode-org.atlassian.net/browse/ICU-10688 this was
first released in ICU 61.1.

I have stopped trying to subclass RuleBasedBreakIterator or set fBreakType,
and just left it to use the break type from the rules, but we would
presumably need ICU >= 61.1 for that to work correctly.

-----------------------------------------------------------------
i18npool: collator rule files don't compile
-----------------------------------------------------------------
Just like OpenOffice customizes break iterators a lot, it also customizes
freaking collators a lot too...

The following 2 files fail to build:
main/i18npool/source/collator/data/ko_charset.txt
main/i18npool/source/collator/data/zh_TW_charset.txt
because the first causes gencoll_rule to fail with "Rule parsering error",
and the second causes an infinite loop in gencoll_rule, in the new version
of ICU.

The za_TW_charset.txt was the only file using "\uNNNN" style escape
sequences to form unicode characters, and when I replaced them by the
actual unicode characters using the Java code below, it began working again:

---snip---
        try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(
                        new
FileInputStream("/path/to/openoffice/openoffice-git/main/i18npool/source/collator/data/zh_TW_charset.txt"),
                        Charsets.UTF_8));
            FileWriter fileWriter = new FileWriter("/tmp/zh",
Charsets.UTF_8)) {
            String line;
            while ((line = reader.readLine()) != null) {
                for (int idx = line.indexOf("\\u"); idx >= 0; idx =
line.indexOf("\\u")) {
                    int codePoint = Integer.parseInt(line.substring(idx +
2, idx + 6), 16);
                    line = line.substring(0, idx) +
Character.toString(codePoint) + line.substring(idx + 6);
                }
                fileWriter.write(line);
                fileWriter.write(0xa);
            }
        }
---snip---

ko_charset.txt was deleted as it couldn't be fixed (even if only 1
character is left, it still breaks), and I doubt we still need it.


------------------
Reflection
------------------
I don't understand why such invasive customizations of ICU were done within
OpenOffice. Couldn't the original OpenOffice developers have sent ICU
patches upstream instead?

-------------------------
Remaining tasks
-------------------------
1. The ICU module itself has to be upgraded to a newer ICU. I recommend the
very latest version, as we need at least 61.1 for the
RuleBasedBreakIterator stuff, and we know 76.1 works from my system ICU
test. This upgrade will be difficult, as Windows's old MSVC version cannot
compile the new ICU, so we might have to package prebuilt binaries instead,
or use Clang to compile it. For other platforms, where we have greater
compiler freedom, we would need to patch dmake to use a different -std=...
for ICU relative to other modules.
2. The changes are significant and could break many things. There are no
tests for the i18npool module of any kind. Under main/qadevOOo, there are
some "i18n" tests, including the very tests we need for the Korean
collation. However as I mentioned on previous occasions, these tests are
not run during the build, and just sitting there unused. They used some
ancient custom testing framework. I really want to get the tests running
and see the results before we merge this, so I'll probably work on those
next.

------------------
Trying it out
------------------
While further work remains, I've pushed my changes into the "icu-c-api"
branch, if anyone wants to try it out. Just remember to pass
--with-system-icu to ./configure, to use the system ICU on *nix. I believe
it should still work with the internal ICU 4.2.1 we use now too, but
haven't tested in a while (the only place it previously broke was the data
files - the C code works equally well with ICU 4.2.1 and 76.1).

Regards
Damjan

New ICU versions, C API: the details

Reply via email to