Hi Damjan, thanks for working on this! Regarding the ICU layout engine a switch to Harfbuzz is the right option, because simply disabling/dropping ICU-LE would mean dropping the support for complex-text-layout languages like Arabic, Hebrew, Hindi, Thai, etc.
Herbert On May 4, 2025 7:43:13 PM GMT+02:00, Damjan Jovanovic <dam...@apache.org> wrote: >Hi > >As promised, here are the details of how I patched OpenOffice to use >(system) ICU 76.1 on FreeBSD by converting all C++ calls to ICU to use >ICU's C API instead, the issues that were found, and how you can try it out. > >---------------------------------------- >Unnecessary dependencies >---------------------------------------- >cui didn't actually need ICU as a dependency, and ICU has been deleted from >its makefile and precompiled header. > >sc does not really need the ICU layout engine, and this was removed from >its makefiles. > >There may still be other modules that specify ICU in their prj/build.lst, >header files, and/or makefiles, but don't actually need it. > >----------------------- >GNU make bugs >----------------------- >Building with system ICU broke in i18npool, and probably never really >worked since I ported it to gbuild, due to what looks like a bug in GNU >make, because "-include" isn't really optional like it should be. I patched >it to use a conditional "include" instead. > >--------------------------------------------------------------- >ICU layout engine ("icule") no longer exists >--------------------------------------------------------------- >Building vcl broke for several reasons, one of which is that the ICU layout >engine stopped existing! It was poorly maintained and previously deprecated: >https://web.archive.org/web/20160130092624/http://userguide.icu-project.org/layoutengine >https://unicode-org.atlassian.net/browse/ICU-10530 >Then in 2016 it was completely dropped from the ICU project: >https://unicode-org.atlassian.net/browse/ICU-12708 >They recommend HarfBuzz as a replacement, and HarfBuzz does have a ICU >Layout Engine compatibility API we could use (of unknown quality and >completeness). HarfBuzz has become quite a popular library. Many >applications and frameworks use it, including LibreOffice and Java. It's >MIT licensed. > >However at the moment we don't need to do anything. Just commenting out >this line in main/vcl/source/glyphs/gcach_layout.cxx: >#define ENABLE_ICU_LAYOUT >and removing "icule" from vcl makefiles, was enough to get AOO to build and >run. I think it falls back to using Freetype. > >Another good thing about the layout engine removal is that the ICU patch to >misc/icu/source/layout/ArabicShaping.cpp will no longer be necessary - that >file was in the layout engine, and whatever problem that patch fixed, >either won't exist in whatever is doing layout now (Freetype?), or will be >that project's responsibility. Also it's worth noting: HarfBuzz was >developed by an Iranian developer, so correct rendering of Arabic fonts was >probably a top priority. > >------------------------------------ >i18npool >------------------------------------ >i18npool needed more patching than everything else put together, and had to >be broken up into several sections: > >--------------------------------------------------------------- >i18npool: break iterator rule files don't compile >--------------------------------------------------------------- >OpenOffice doesn't just use the break iterators provided by ICU, but also >makes custom ones, by compiling its own break rule files in >i18npool/source/breakiterator/data/*. > >The file i18npool/source/breakiterator/data/LICENSE_INFO describes how >these files were forked from ICU 3.2 in 2004, and then modified. ICU >changed since then, and now line.txt doesn't compile because of the >"!!LBCMNoChain;" option which ICU dropped, while sent.txt doesn't compile >due to some other error. > >Using "diff" to compare our line.txt and sent.txt to ICU's different >versions, they were forked from (or later updated to) what seem more like >ICU 3.6, which used the rules from Unicode version 5, while the latest ICU >are based on Unicode 12 for sent.txt and Unicode 14 for line.txt. > >sent.txt was customized by: >- adding "$Thai = [:Script = Thai:];" >- adding "$LettersEx = [$OLetter $Upper $Lower $Numeric $Close $STerm] >($Extend | $Format)*;" >- adding "$LettersEx* $Thai $LettersEx* ($ATermEx | $SpEx)*;" >- changing the first line of rule 12 to: >"[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend $Thai]{bof}] ($Extend | >$Format | $Close | $Sp)* [^$Thai];" > >which seems like quite a hack, adding Thai-specific rules to a file >intended to find general sentence boundaries, which doesn't mention other >languages. > >line.txt has more changes, but I still trust the Unicode 14 rules more than >our (questionably) patched Unicode 5 rules. > >The code in main/i18npool/source/breakiterator/breakiterator_unicode.cxx >method BreakIterator_Unicode::loadICUBreakIterator() will fall back to >general break iterators when custom ones cannot be loaded, so I've deleted >our sent.txt and line.txt, so that it uses ICU's own ones. > >------------------------------------------------------------------------------------------------- >i18npool: unknown data length when loading custom break iterator rules >------------------------------------------------------------------------------------------------- >The custom break rules are stored in ICU's binary rule format, and get >loaded with udata_open(), which returns a UDataMemory*, which OpenOffice >was ultimately passing to the RuleBasedBreakIterator constructor. Now we >can't call that constructor from C, the only RuleBasedBreakIterator >constructor we can call (through ubrk_openBinaryRules()) is a different one >that takes a pointer and a length ("ruleLength"). [ICU is really a C++ >library with a C wrapper around some of its features. The C++ API is more >general and flexible.] While it's possible to get the data pointer from >UDataMemory* by using udata_getMemory(), there is no public function to get >the data length. > >By reading the ICU code for the RuleBasedBreakIterator constructor, I see >that, at present, that ruleLength is not really used. That constructor only >checks that it is large enough to hold its internal RBBIDataHeader data >structure, and that it is large enough to hold the file data specified in >RBBIDataHeader. Therefore what I've done is pass 1 billion as the >ruleLength, to get past these checks, which has allowed it to work, but >could break later if the ICU internals change. > >------------------------------------------------------------------------------------------------------------ >i18npool: subclassing RuleBasedBreakIterator to call protected >setBreakType() >------------------------------------------------------------------------------------------------------------ >OpenOffice subclasses RuleBasedBreakIterator in order to call the protected >setBreakType() method inside it, an ugly hack that messes with ICU >internals and only happened to work by pure luck. This subclassing cannot >be done from C, but even if it could, in recent ICU, setBreakType() has >been deleted. What was OpenOffice using setBreakType() for? When the custom >break rules are loaded, it uses setBreakType() to set whether to break on >characters, or words, or sentences, etc., presumably allowing one set of >break rules to be used for multiple types of breaks, instead of needing a >unique set of break rules for each type of break. > >This was the ICU commit in which fBreakType was removed: >---snip--- >commit 8640bee541ac93bb79cef32ad0643b2ffd09dc3a >Merge: 44b2617d449 d7f2cd98d3a >Author: Andy Heninger <andy.henin...@gmail.com> >Date: Wed Feb 21 23:10:10 2018 +0000 > > ICU-10688 Remove redundant break type logic from BreakIterators. Merge >to trunk. > > X-SVN-Rev: 40967 >---snip--- > >and as per https://unicode-org.atlassian.net/browse/ICU-10688 this was >first released in ICU 61.1. > >I have stopped trying to subclass RuleBasedBreakIterator or set fBreakType, >and just left it to use the break type from the rules, but we would >presumably need ICU >= 61.1 for that to work correctly. > >----------------------------------------------------------------- >i18npool: collator rule files don't compile >----------------------------------------------------------------- >Just like OpenOffice customizes break iterators a lot, it also customizes >freaking collators a lot too... > >The following 2 files fail to build: >main/i18npool/source/collator/data/ko_charset.txt >main/i18npool/source/collator/data/zh_TW_charset.txt >because the first causes gencoll_rule to fail with "Rule parsering error", >and the second causes an infinite loop in gencoll_rule, in the new version >of ICU. > >The za_TW_charset.txt was the only file using "\uNNNN" style escape >sequences to form unicode characters, and when I replaced them by the >actual unicode characters using the Java code below, it began working again: > >---snip--- > try (BufferedReader reader = new BufferedReader( > new InputStreamReader( > new >FileInputStream("/path/to/openoffice/openoffice-git/main/i18npool/source/collator/data/zh_TW_charset.txt"), > Charsets.UTF_8)); > FileWriter fileWriter = new FileWriter("/tmp/zh", >Charsets.UTF_8)) { > String line; > while ((line = reader.readLine()) != null) { > for (int idx = line.indexOf("\\u"); idx >= 0; idx = >line.indexOf("\\u")) { > int codePoint = Integer.parseInt(line.substring(idx + >2, idx + 6), 16); > line = line.substring(0, idx) + >Character.toString(codePoint) + line.substring(idx + 6); > } > fileWriter.write(line); > fileWriter.write(0xa); > } > } >---snip--- > >ko_charset.txt was deleted as it couldn't be fixed (even if only 1 >character is left, it still breaks), and I doubt we still need it. > > >------------------ >Reflection >------------------ >I don't understand why such invasive customizations of ICU were done within >OpenOffice. Couldn't the original OpenOffice developers have sent ICU >patches upstream instead? > >------------------------- >Remaining tasks >------------------------- >1. The ICU module itself has to be upgraded to a newer ICU. I recommend the >very latest version, as we need at least 61.1 for the >RuleBasedBreakIterator stuff, and we know 76.1 works from my system ICU >test. This upgrade will be difficult, as Windows's old MSVC version cannot >compile the new ICU, so we might have to package prebuilt binaries instead, >or use Clang to compile it. For other platforms, where we have greater >compiler freedom, we would need to patch dmake to use a different -std=... >for ICU relative to other modules. >2. The changes are significant and could break many things. There are no >tests for the i18npool module of any kind. Under main/qadevOOo, there are >some "i18n" tests, including the very tests we need for the Korean >collation. However as I mentioned on previous occasions, these tests are >not run during the build, and just sitting there unused. They used some >ancient custom testing framework. I really want to get the tests running >and see the results before we merge this, so I'll probably work on those >next. > >------------------ >Trying it out >------------------ >While further work remains, I've pushed my changes into the "icu-c-api" >branch, if anyone wants to try it out. Just remember to pass >--with-system-icu to ./configure, to use the system ICU on *nix. I believe >it should still work with the internal ICU 4.2.1 we use now too, but >haven't tested in a while (the only place it previously broke was the data >files - the C code works equally well with ICU 4.2.1 and 76.1). > >Regards >Damjan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org