Hi Damjan, I am not a developer, so I can't comment on the technical issues.
But generally I like the idea of reducing dependencies and simplifying the code. Maybe we can create a separate git branch for this task? Regards, Matthias Am 15.10.22 um 10:32 schrieb Damjan Jovanovic: > Hi > > We currently use 2 XML libraries in our C/C++ code, expat and libxml2. > This is unnecessary, one of them can be removed, and I propose we > remove expat. > > expat is a small library that only does SAX parsing, and lacks DOM, > lacks XSLT, lacks schema validation or any more advanced features we > use. By comparison, libxml2 is probably the most complete open-source > XML library in the world for C, with endless features, many users, > stable API, no mandatory dependencies, portable, highly standards > compliant and compatible with other XML libraries, performs well > (https://xmlbench.sourceforge.net/results/benchmark200910/index.html), > is a requirement for libxslt, is permissively licensed, takes security > seriously (the maintainer made money by fixing its security bugs in > Google's bug bounties), actively developed, documented, nice readable > source code, and it's generally of very high quality. > > We generally don't like unnecessary dependencies, and expat has been > difficult to upgrade to newer versions due to compiler issues. Since > expat only implements a small subset of libxml2's features, could we > replace it with libxml2? > > Where is expat used? The output of "grep expat */prj/build.lst" gives > us these modules: > lucene > openssl > sax > shell > tools > > lucene and openssl don't really require expat. Removing expat from > their build.lst and stopping the expat/ module from building still > gets lucene and openssl to build successfully, so clearly their > dependency on expat was always unnecessary. As of commit > 461facd3d94599cc708db50510b7e42483407720 they no longer depend on > expat. > > tools probably also has an unnecessary dependency on expat, but this > is harder to prove as tools -> i18pool -> sax -> expat, so let's > revisit this later. > > Only the sax and shell modules really use expat. > > In sax, it is used to implement the com.sun.star.xml.sax.Parser UNO > service (com.sun.star.comp.extensions.xml.sax.ParserExpat > implementation name), which is used fairly widely throughout AOO, > including for loading OpenDocument files (sax -> svx -> xmloff), but > all the expat usage is contained in a single file: > sax/source/expatwrap/sax_expat.cxx. > > In shell, it is used to implement a SAX parser internal to that > module, which is then consumed by several libraries, including > source/unix/sysshell which deals with the recently used file list in > $HOME/.recently-used. > > -------------------------------- > Porting sax to libxml2 > -------------------------------- > > Patching the sax module to use libxml2 instead of expat was successful > enough to get AOO to build, run and load documents, even though it's > still incomplete. > > Most of the SAX callbacks are compatible between expat and libxml2, > and converting those was easy. String types had to be changed, and > sometimes the parameter list has to be reordered, or expat-only > parameters removed. CDATA is reported with a single event in libxml2, > which needed conversion to startCDATA + characters + endCDATA events > on our UNO callbacks. > > Unfortunately expat's XML_SetExternalEntityRefHandler() works quite > differently on libxml2, and has been commented out for now while I try > to understand what external entities are and how you generally load > them. Also expat's XML_SetDefaultHandlerExpand() is not yet > implemented and I am not sure what to do there. > > One gotcha was that when there are no attributes on an element, > libxml2 calls the startElement() callback with NULL attributes, while > expat called it with attributes to a 1 element char** array containing > NULL. This was crashing i18npool's saxparser during the build, in > SaxExpatParser_Impl::callbackStartElement() during the conversion of > attributes to its attribute list. Luckily it was easy to fix. > > The code for sax_expat.cxx has become quite a bit shorter, because > libxml2 always uses UTF-8 unlike expat which can be built with UTF-16 > or UTF-8, so the XmlNChar2OUString() and XmlChar2OUString() functions > have been removed. Also the getErrorMessage() function no longer has > to have a list of hardcoded error codes with our error message for > each, as libxml2's xmlCtxtGetLastError() returns the official error > message. > > Also the sax module uses the SAX1 API. libxml2 supports both SAX1 and > SAX2, but SAX1 is optional, and will be disabled if > LIBXML_SAX1_ENABLED is undefined in libxml2's > include/libxml/xmlversion.h. Currently it's hardcoded to 1, so it > should always be defined, but when we use the system libxml2, we don't > know that for sure. For now, I've added a check for > LIBXML_SAX1_ENABLED and preprocessor #error if missing. It's also > possible for the sax module to convert SAX2 callbacks to SAX1 on the > UNO side, but that's inefficient (eg. we'd have to concatenate > namespace + ":" + localname, which the consuming code will later just > split up again). Ideally we would make a XDocumentHandler2 UNO > interface based on SAX2, and upstream UNO consumers would use that, so > we could use SAX2 throughout AOO. This is more of a long-term plan > though, for now SAX1 should be fine. > > Even with these issues, it works perfectly, building and running and > loading many documents. I am sure libxml2 is being used to load > documents instead of expat, because a debugger breakpoint on expat's > XML_Parse() doesn't get triggered. > > -------------------------------- > Porting shell to libxml2 > -------------------------------- > > I haven't started this yet (but you can ;). It seems easier, fewer > expat APIs are used, and since the module is internal and we control > both the library and its consumers, it might even be possible to use > SAX2 throughout, instead of the custom parsing of tag names into > namespaces that it currently does. > > -------------- > In closing > -------------- > > They say "release early, release often" is what made open-source > software good, so I am attaching a patch with what I've done so far. > Please review, test, and if you can, help implement > XML_SetExternalEntityRefHandler() and XML_SetDefaultHandlerExpand(), > and port the shell module. > > It would also be good to have benchmarks of how long it takes to load > a large document with expat vs libxml2. > > Thank you > Damjan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org > For additional commands, e-mail: dev-h...@openoffice.apache.org
smime.p7s
Description: S/MIME Cryptographic Signature