Hi Damjan,

I am not a developer, so I can't comment on the technical issues.

But generally I like the idea of reducing dependencies and simplifying
the code.

Maybe we can create a separate git branch for this task?

Regards,

   Matthias

Am 15.10.22 um 10:32 schrieb Damjan Jovanovic:
> Hi
>
> We currently use 2 XML libraries in our C/C++ code, expat and libxml2.
> This is unnecessary, one of them can be removed, and I propose we
> remove expat.
>
> expat is a small library that only does SAX parsing, and lacks DOM,
> lacks XSLT, lacks schema validation or any more advanced features we
> use. By comparison, libxml2 is probably the most complete open-source
> XML library in the world for C, with endless features, many users,
> stable API, no mandatory dependencies, portable, highly standards
> compliant and compatible with other XML libraries, performs well
> (https://xmlbench.sourceforge.net/results/benchmark200910/index.html),
> is a requirement for libxslt, is permissively licensed, takes security
> seriously (the maintainer made money by fixing its security bugs in
> Google's bug bounties), actively developed, documented, nice readable
> source code, and it's generally of very high quality.
>
> We generally don't like unnecessary dependencies, and expat has been
> difficult to upgrade to newer versions due to compiler issues. Since
> expat only implements a small subset of libxml2's features, could we
> replace it with libxml2?
>
> Where is expat used? The output of "grep expat */prj/build.lst" gives
> us these modules:
> lucene
> openssl
> sax
> shell
> tools
>
> lucene and openssl don't really require expat. Removing expat from
> their build.lst and stopping the expat/ module from building still
> gets lucene and openssl to build successfully, so clearly their
> dependency on expat was always unnecessary. As of commit
> 461facd3d94599cc708db50510b7e42483407720 they no longer depend on
> expat.
>
> tools probably also has an unnecessary dependency on expat, but this
> is harder to prove as tools -> i18pool -> sax -> expat, so let's
> revisit this later.
>
> Only the sax and shell modules really use expat.
>
> In sax, it is used to implement the com.sun.star.xml.sax.Parser UNO
> service (com.sun.star.comp.extensions.xml.sax.ParserExpat
> implementation name), which is used fairly widely throughout AOO,
> including for loading OpenDocument files (sax -> svx -> xmloff), but
> all the expat usage is contained in a single file:
> sax/source/expatwrap/sax_expat.cxx.
>
> In shell, it is used to implement a SAX parser internal to that
> module, which is then consumed by several libraries, including
> source/unix/sysshell which deals with the recently used file list in
> $HOME/.recently-used.
>
> --------------------------------
> Porting sax to libxml2
> --------------------------------
>
> Patching the sax module to use libxml2 instead of expat was successful
> enough to get AOO to build, run and load documents, even though it's
> still incomplete.
>
> Most of the SAX callbacks are compatible between expat and libxml2,
> and converting those was easy. String types had to be changed, and
> sometimes the parameter list has to be reordered, or expat-only
> parameters removed. CDATA is reported with a single event in libxml2,
> which needed conversion to startCDATA + characters + endCDATA events
> on our UNO callbacks.
>
> Unfortunately expat's XML_SetExternalEntityRefHandler() works quite
> differently on libxml2, and has been commented out for now while I try
> to understand what external entities are and how you generally load
> them. Also expat's XML_SetDefaultHandlerExpand() is not yet
> implemented and I am not sure what to do there.
>
> One gotcha was that when there are no attributes on an element,
> libxml2 calls the startElement() callback with NULL attributes, while
> expat called it with attributes to a 1 element char** array containing
> NULL. This was crashing i18npool's saxparser during the build, in
> SaxExpatParser_Impl::callbackStartElement() during the conversion of
> attributes to its attribute list. Luckily it was easy to fix.
>
> The code for sax_expat.cxx has become quite a bit shorter, because
> libxml2 always uses UTF-8 unlike expat which can be built with UTF-16
> or UTF-8, so the XmlNChar2OUString() and XmlChar2OUString() functions
> have been removed. Also the getErrorMessage() function no longer has
> to have a list of hardcoded error codes with our error message for
> each, as libxml2's xmlCtxtGetLastError() returns the official error
> message.
>
> Also the sax module uses the SAX1 API. libxml2 supports both SAX1 and
> SAX2, but SAX1 is optional, and will be disabled if
> LIBXML_SAX1_ENABLED is undefined in libxml2's
> include/libxml/xmlversion.h. Currently it's hardcoded to 1, so it
> should always be defined, but when we use the system libxml2, we don't
> know that for sure. For now, I've added a check for
> LIBXML_SAX1_ENABLED and preprocessor #error if missing. It's also
> possible for the sax module to convert SAX2 callbacks to SAX1 on the
> UNO side, but that's inefficient (eg. we'd have to concatenate
> namespace + ":" + localname, which the consuming code will later just
> split up again). Ideally we would make a XDocumentHandler2 UNO
> interface based on SAX2, and upstream UNO consumers would use that, so
> we could use SAX2 throughout AOO. This is more of a long-term plan
> though, for now SAX1 should be fine.
>
> Even with these issues, it works perfectly, building and running and
> loading many documents. I am sure libxml2 is being used to load
> documents instead of expat, because a debugger breakpoint on expat's
> XML_Parse() doesn't get triggered.
>
> --------------------------------
> Porting shell to libxml2
> --------------------------------
>
> I haven't started this yet (but you can ;). It seems easier, fewer
> expat APIs are used, and since the module is internal and we control
> both the library and its consumers, it might even be possible to use
> SAX2 throughout, instead of the custom parsing of tag names into
> namespaces that it currently does.
>
> --------------
> In closing
> --------------
>
> They say "release early, release often" is what made open-source
> software good, so I am attaching a patch with what I've done so far.
> Please review, test, and if you can, help implement
> XML_SetExternalEntityRefHandler() and XML_SetDefaultHandlerExpand(),
> and port the shell module.
>
> It would also be good to have benchmarks of how long it takes to load
> a large document with expat vs libxml2.
>
> Thank you
> Damjan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
> For additional commands, e-mail: dev-h...@openoffice.apache.org

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to