Hi As promised, here are the details of how I patched OpenOffice to use (system) ICU 76.1 on FreeBSD by converting all C++ calls to ICU to use ICU's C API instead, the issues that were found, and how you can try it out.
---------------------------------------- Unnecessary dependencies ---------------------------------------- cui didn't actually need ICU as a dependency, and ICU has been deleted from its makefile and precompiled header. sc does not really need the ICU layout engine, and this was removed from its makefiles. There may still be other modules that specify ICU in their prj/build.lst, header files, and/or makefiles, but don't actually need it. ----------------------- GNU make bugs ----------------------- Building with system ICU broke in i18npool, and probably never really worked since I ported it to gbuild, due to what looks like a bug in GNU make, because "-include" isn't really optional like it should be. I patched it to use a conditional "include" instead. --------------------------------------------------------------- ICU layout engine ("icule") no longer exists --------------------------------------------------------------- Building vcl broke for several reasons, one of which is that the ICU layout engine stopped existing! It was poorly maintained and previously deprecated: https://web.archive.org/web/20160130092624/http://userguide.icu-project.org/layoutengine https://unicode-org.atlassian.net/browse/ICU-10530 Then in 2016 it was completely dropped from the ICU project: https://unicode-org.atlassian.net/browse/ICU-12708 They recommend HarfBuzz as a replacement, and HarfBuzz does have a ICU Layout Engine compatibility API we could use (of unknown quality and completeness). HarfBuzz has become quite a popular library. Many applications and frameworks use it, including LibreOffice and Java. It's MIT licensed. However at the moment we don't need to do anything. Just commenting out this line in main/vcl/source/glyphs/gcach_layout.cxx: #define ENABLE_ICU_LAYOUT and removing "icule" from vcl makefiles, was enough to get AOO to build and run. I think it falls back to using Freetype. Another good thing about the layout engine removal is that the ICU patch to misc/icu/source/layout/ArabicShaping.cpp will no longer be necessary - that file was in the layout engine, and whatever problem that patch fixed, either won't exist in whatever is doing layout now (Freetype?), or will be that project's responsibility. Also it's worth noting: HarfBuzz was developed by an Iranian developer, so correct rendering of Arabic fonts was probably a top priority. ------------------------------------ i18npool ------------------------------------ i18npool needed more patching than everything else put together, and had to be broken up into several sections: --------------------------------------------------------------- i18npool: break iterator rule files don't compile --------------------------------------------------------------- OpenOffice doesn't just use the break iterators provided by ICU, but also makes custom ones, by compiling its own break rule files in i18npool/source/breakiterator/data/*. The file i18npool/source/breakiterator/data/LICENSE_INFO describes how these files were forked from ICU 3.2 in 2004, and then modified. ICU changed since then, and now line.txt doesn't compile because of the "!!LBCMNoChain;" option which ICU dropped, while sent.txt doesn't compile due to some other error. Using "diff" to compare our line.txt and sent.txt to ICU's different versions, they were forked from (or later updated to) what seem more like ICU 3.6, which used the rules from Unicode version 5, while the latest ICU are based on Unicode 12 for sent.txt and Unicode 14 for line.txt. sent.txt was customized by: - adding "$Thai = [:Script = Thai:];" - adding "$LettersEx = [$OLetter $Upper $Lower $Numeric $Close $STerm] ($Extend | $Format)*;" - adding "$LettersEx* $Thai $LettersEx* ($ATermEx | $SpEx)*;" - changing the first line of rule 12 to: "[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend $Thai]{bof}] ($Extend | $Format | $Close | $Sp)* [^$Thai];" which seems like quite a hack, adding Thai-specific rules to a file intended to find general sentence boundaries, which doesn't mention other languages. line.txt has more changes, but I still trust the Unicode 14 rules more than our (questionably) patched Unicode 5 rules. The code in main/i18npool/source/breakiterator/breakiterator_unicode.cxx method BreakIterator_Unicode::loadICUBreakIterator() will fall back to general break iterators when custom ones cannot be loaded, so I've deleted our sent.txt and line.txt, so that it uses ICU's own ones. ------------------------------------------------------------------------------------------------- i18npool: unknown data length when loading custom break iterator rules ------------------------------------------------------------------------------------------------- The custom break rules are stored in ICU's binary rule format, and get loaded with udata_open(), which returns a UDataMemory*, which OpenOffice was ultimately passing to the RuleBasedBreakIterator constructor. Now we can't call that constructor from C, the only RuleBasedBreakIterator constructor we can call (through ubrk_openBinaryRules()) is a different one that takes a pointer and a length ("ruleLength"). [ICU is really a C++ library with a C wrapper around some of its features. The C++ API is more general and flexible.] While it's possible to get the data pointer from UDataMemory* by using udata_getMemory(), there is no public function to get the data length. By reading the ICU code for the RuleBasedBreakIterator constructor, I see that, at present, that ruleLength is not really used. That constructor only checks that it is large enough to hold its internal RBBIDataHeader data structure, and that it is large enough to hold the file data specified in RBBIDataHeader. Therefore what I've done is pass 1 billion as the ruleLength, to get past these checks, which has allowed it to work, but could break later if the ICU internals change. ------------------------------------------------------------------------------------------------------------ i18npool: subclassing RuleBasedBreakIterator to call protected setBreakType() ------------------------------------------------------------------------------------------------------------ OpenOffice subclasses RuleBasedBreakIterator in order to call the protected setBreakType() method inside it, an ugly hack that messes with ICU internals and only happened to work by pure luck. This subclassing cannot be done from C, but even if it could, in recent ICU, setBreakType() has been deleted. What was OpenOffice using setBreakType() for? When the custom break rules are loaded, it uses setBreakType() to set whether to break on characters, or words, or sentences, etc., presumably allowing one set of break rules to be used for multiple types of breaks, instead of needing a unique set of break rules for each type of break. This was the ICU commit in which fBreakType was removed: ---snip--- commit 8640bee541ac93bb79cef32ad0643b2ffd09dc3a Merge: 44b2617d449 d7f2cd98d3a Author: Andy Heninger <andy.henin...@gmail.com> Date: Wed Feb 21 23:10:10 2018 +0000 ICU-10688 Remove redundant break type logic from BreakIterators. Merge to trunk. X-SVN-Rev: 40967 ---snip--- and as per https://unicode-org.atlassian.net/browse/ICU-10688 this was first released in ICU 61.1. I have stopped trying to subclass RuleBasedBreakIterator or set fBreakType, and just left it to use the break type from the rules, but we would presumably need ICU >= 61.1 for that to work correctly. ----------------------------------------------------------------- i18npool: collator rule files don't compile ----------------------------------------------------------------- Just like OpenOffice customizes break iterators a lot, it also customizes freaking collators a lot too... The following 2 files fail to build: main/i18npool/source/collator/data/ko_charset.txt main/i18npool/source/collator/data/zh_TW_charset.txt because the first causes gencoll_rule to fail with "Rule parsering error", and the second causes an infinite loop in gencoll_rule, in the new version of ICU. The za_TW_charset.txt was the only file using "\uNNNN" style escape sequences to form unicode characters, and when I replaced them by the actual unicode characters using the Java code below, it began working again: ---snip--- try (BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream("/path/to/openoffice/openoffice-git/main/i18npool/source/collator/data/zh_TW_charset.txt"), Charsets.UTF_8)); FileWriter fileWriter = new FileWriter("/tmp/zh", Charsets.UTF_8)) { String line; while ((line = reader.readLine()) != null) { for (int idx = line.indexOf("\\u"); idx >= 0; idx = line.indexOf("\\u")) { int codePoint = Integer.parseInt(line.substring(idx + 2, idx + 6), 16); line = line.substring(0, idx) + Character.toString(codePoint) + line.substring(idx + 6); } fileWriter.write(line); fileWriter.write(0xa); } } ---snip--- ko_charset.txt was deleted as it couldn't be fixed (even if only 1 character is left, it still breaks), and I doubt we still need it. ------------------ Reflection ------------------ I don't understand why such invasive customizations of ICU were done within OpenOffice. Couldn't the original OpenOffice developers have sent ICU patches upstream instead? ------------------------- Remaining tasks ------------------------- 1. The ICU module itself has to be upgraded to a newer ICU. I recommend the very latest version, as we need at least 61.1 for the RuleBasedBreakIterator stuff, and we know 76.1 works from my system ICU test. This upgrade will be difficult, as Windows's old MSVC version cannot compile the new ICU, so we might have to package prebuilt binaries instead, or use Clang to compile it. For other platforms, where we have greater compiler freedom, we would need to patch dmake to use a different -std=... for ICU relative to other modules. 2. The changes are significant and could break many things. There are no tests for the i18npool module of any kind. Under main/qadevOOo, there are some "i18n" tests, including the very tests we need for the Korean collation. However as I mentioned on previous occasions, these tests are not run during the build, and just sitting there unused. They used some ancient custom testing framework. I really want to get the tests running and see the results before we merge this, so I'll probably work on those next. ------------------ Trying it out ------------------ While further work remains, I've pushed my changes into the "icu-c-api" branch, if anyone wants to try it out. Just remember to pass --with-system-icu to ./configure, to use the system ICU on *nix. I believe it should still work with the internal ICU 4.2.1 we use now too, but haven't tested in a while (the only place it previously broke was the data files - the C code works equally well with ICU 4.2.1 and 76.1). Regards Damjan