Re: Fast XML parser?

Rob Coops Mon, 29 Oct 2012 01:42:05 -0700

On Mon, Oct 29, 2012 at 9:18 AM, Shlomi Fish <shlo...@shlomifish.org> wrote:

> On Mon, 29 Oct 2012 10:09:53 +0200
> Shlomi Fish <shlo...@shlomifish.org> wrote:
>
> > Hi Octavian,
> >
> > On Sun, 28 Oct 2012 17:45:15 +0200
> > "Octavian Rasnita" <orasn...@gmail.com> wrote:
> >
> > > From: "Shlomi Fish" <shlo...@shlomifish.org>
> > >
> > > Hi Octavian,
> > >
> > >
> > >
> > > Hi Shlomi,
> > >
> > > I tried to use XML::LibXML::Reader which uses the pool parser, and I
> read
> > > that:
> > >
> > > ""
> > > However, it is also possible to mix Reader with DOM. At every point the
> > > user may copy the current node (optionally expanded into a complete
> > > sub-tree) from the processed document to another DOM tree, or to
> > > instruct the Reader to collect sub-document in form of a DOM tree
> > > ""
> > >
> > > So I tried:
> > >
> > > use XML::LibXML::Reader;
> > >
> > > my $xml = 'path/to/xml/file.xml';
> > >
> > > my $reader = XML::LibXML::Reader->new( location => $xml ) or die
> "cannot
> > > read $xml";
> > >
> > > while ( $reader->nextElement( 'Lexem' ) ) {
> > >     my $id = $reader->getAttribute( 'id' ); #works fine
> > >
> > >     my $doc = $reader->document;
> > >
> > >     my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't
> > > work well
> > >     my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't
> work
> > > fine
> > >
> > > }
> > >
> >
> > I'm not sure you should do ->document. I cannot tell you off-hand how to
> do it
> > right, but I can try to investigate when I have some spare cycles.
> >
>
> OK, after a short amount of investigation, I found that this program works:
>
> [CODE]
>
> use strict;
> use warnings;
>
> use XML::LibXML::Reader;
>
> my $xml = 'Lexems.xml';
>
> my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
> read
> $xml";
>
> while ( $reader->nextElement( 'Lexem' ) ) {
>     my $id = $reader->getAttribute( 'id' ); #works fine
>
>     my $doc = $reader->copyCurrentNode(1);
>     my $timestamp = $doc->getElementsByTagName( 'Timestamp' );
>     my @lexem_text = $doc->getElementsByTagName( 'Form' );
> }
>
> [/CODE]
>
> Note that you can also use XPath for looking up XML information.
>
> Regards,
>
>         Shlomi Fish
>
>
> --
> -----------------------------------------------------------------
> Shlomi Fish       http://www.shlomifish.org/
> List of Text Processing Tools - http://shlom.in/text-proc
>
> Sophie: Let’s suppose you have a table with 2^n cups…
> Jack: Wait a second! Is ‘n’ a natural number?
>
> Please reply to list if it's a mailing list post - http://shlom.in/reply .
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>
A little late I know but still...

Last year I was asked to process a large amount of XML files 2x 1.6M files
that needed to be compared on a element by element level and with some
fuzzy logic needed to be the same. Things like floating point precision
could change (1.00 = 1) and in some cases data could show up in a different
order (repeating elements for multiple items on an order). The whole idea
was system A that took flat text output from a mainframe and translated
this to XML for consumption by a web service was being replaced by system B
that did the same thing but on a entirely different software stack.

Of course this needed to go as fast as possible as we simply could not sit
around for a few days while the computer did it's thing. LibXML was my
saviour and using XPath was the fastest solution. Though it is possible to
do the DOM thing you end up with the DOM being translated to XPath under
the hood (at least the performance seemed to indicate that). After a lot of
testing and using pretty much any XML parser I could find using LibXML and
XPath was really the fastest.
If you are going for speed then you will want to avoid any copy operations
you can and you will want to as much as possible use references. Because
even though a memory copy of some 100 bytes is a very fast operation on a
few million files the the little time it takes kind of adds up to a lot
longer then you would like it to.

When you are looking at speed first and foremost try and avoid anything
that would slow you down. A copy of information is slow so don't do it if
you can avoid it. A reference to a memory location is slightly harder to
work with in programming but a lot faster. A translation from DOM to XPath
would take you time to do, the computer needs the same time. If it is pure
speed you are after avoid this as well.
If you are sure you are as fast as you can be add a benchmark to the code
and try individual optimisations that might or might not be faster... you
would be surprised how the perl internals sometimes are a lot faster with
some operations then with others even though feeling wise you would not
have expected this to be the case.

For my case as it was a once in every 25 years kind of major change I
didn't do to much benchmarking as the code would be discarded at the end of
the project. (well stored in a dusty old SVN repository for others to reuse
and never to be looked at again realistically) I got it to go fast enough
for a regular run of 1.6M files on a daily basis for as long as the project
needed to feel comfortable with the tests to go to production. But if your
code is to live a longer life it becomes worth the effort to do more
benchmarking with every few additional months that the code is expected to
be around and in regular use.

Regards,

Rob

Re: Fast XML parser?

Reply via email to