+1 for this move. nice move Damjan.
I read through the patch and it looks nice on first glance.
How about we create a ticket and a (PR) branch for this?
This would help to collaborate on this point. we could maybe push out a
test version quickly to create regression tests
All the best
Peter
Am 15.10.22 um 10:32 schrieb Damjan Jovanovic:
Hi
We currently use 2 XML libraries in our C/C++ code, expat and libxml2.
This is unnecessary, one of them can be removed, and I propose we
remove expat.
expat is a small library that only does SAX parsing, and lacks DOM,
lacks XSLT, lacks schema validation or any more advanced features we
use. By comparison, libxml2 is probably the most complete open-source
XML library in the world for C, with endless features, many users,
stable API, no mandatory dependencies, portable, highly standards
compliant and compatible with other XML libraries, performs well
(https://xmlbench.sourceforge.net/results/benchmark200910/index.html),
is a requirement for libxslt, is permissively licensed, takes security
seriously (the maintainer made money by fixing its security bugs in
Google's bug bounties), actively developed, documented, nice readable
source code, and it's generally of very high quality.
We generally don't like unnecessary dependencies, and expat has been
difficult to upgrade to newer versions due to compiler issues. Since
expat only implements a small subset of libxml2's features, could we
replace it with libxml2?
Where is expat used? The output of "grep expat */prj/build.lst" gives
us these modules:
lucene
openssl
sax
shell
tools
lucene and openssl don't really require expat. Removing expat from
their build.lst and stopping the expat/ module from building still
gets lucene and openssl to build successfully, so clearly their
dependency on expat was always unnecessary. As of commit
461facd3d94599cc708db50510b7e42483407720 they no longer depend on
expat.
tools probably also has an unnecessary dependency on expat, but this
is harder to prove as tools -> i18pool -> sax -> expat, so let's
revisit this later.
Only the sax and shell modules really use expat.
In sax, it is used to implement the com.sun.star.xml.sax.Parser UNO
service (com.sun.star.comp.extensions.xml.sax.ParserExpat
implementation name), which is used fairly widely throughout AOO,
including for loading OpenDocument files (sax -> svx -> xmloff), but
all the expat usage is contained in a single file:
sax/source/expatwrap/sax_expat.cxx.
In shell, it is used to implement a SAX parser internal to that
module, which is then consumed by several libraries, including
source/unix/sysshell which deals with the recently used file list in
$HOME/.recently-used.
--------------------------------
Porting sax to libxml2
--------------------------------
Patching the sax module to use libxml2 instead of expat was successful
enough to get AOO to build, run and load documents, even though it's
still incomplete.
Most of the SAX callbacks are compatible between expat and libxml2,
and converting those was easy. String types had to be changed, and
sometimes the parameter list has to be reordered, or expat-only
parameters removed. CDATA is reported with a single event in libxml2,
which needed conversion to startCDATA + characters + endCDATA events
on our UNO callbacks.
Unfortunately expat's XML_SetExternalEntityRefHandler() works quite
differently on libxml2, and has been commented out for now while I try
to understand what external entities are and how you generally load
them. Also expat's XML_SetDefaultHandlerExpand() is not yet
implemented and I am not sure what to do there.
One gotcha was that when there are no attributes on an element,
libxml2 calls the startElement() callback with NULL attributes, while
expat called it with attributes to a 1 element char** array containing
NULL. This was crashing i18npool's saxparser during the build, in
SaxExpatParser_Impl::callbackStartElement() during the conversion of
attributes to its attribute list. Luckily it was easy to fix.
The code for sax_expat.cxx has become quite a bit shorter, because
libxml2 always uses UTF-8 unlike expat which can be built with UTF-16
or UTF-8, so the XmlNChar2OUString() and XmlChar2OUString() functions
have been removed. Also the getErrorMessage() function no longer has
to have a list of hardcoded error codes with our error message for
each, as libxml2's xmlCtxtGetLastError() returns the official error
message.
Also the sax module uses the SAX1 API. libxml2 supports both SAX1 and
SAX2, but SAX1 is optional, and will be disabled if
LIBXML_SAX1_ENABLED is undefined in libxml2's
include/libxml/xmlversion.h. Currently it's hardcoded to 1, so it
should always be defined, but when we use the system libxml2, we don't
know that for sure. For now, I've added a check for
LIBXML_SAX1_ENABLED and preprocessor #error if missing. It's also
possible for the sax module to convert SAX2 callbacks to SAX1 on the
UNO side, but that's inefficient (eg. we'd have to concatenate
namespace + ":" + localname, which the consuming code will later just
split up again). Ideally we would make a XDocumentHandler2 UNO
interface based on SAX2, and upstream UNO consumers would use that, so
we could use SAX2 throughout AOO. This is more of a long-term plan
though, for now SAX1 should be fine.
Even with these issues, it works perfectly, building and running and
loading many documents. I am sure libxml2 is being used to load
documents instead of expat, because a debugger breakpoint on expat's
XML_Parse() doesn't get triggered.
--------------------------------
Porting shell to libxml2
--------------------------------
I haven't started this yet (but you can ;). It seems easier, fewer
expat APIs are used, and since the module is internal and we control
both the library and its consumers, it might even be possible to use
SAX2 throughout, instead of the custom parsing of tag names into
namespaces that it currently does.
--------------
In closing
--------------
They say "release early, release often" is what made open-source
software good, so I am attaching a patch with what I've done so far.
Please review, test, and if you can, help implement
XML_SetExternalEntityRefHandler() and XML_SetDefaultHandlerExpand(),
and port the shell module.
It would also be good to have benchmarks of how long it takes to load
a large document with expat vs libxml2.
Thank you
Damjan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org