On Mon, Oct 29, 2012 at 9:18 AM, Shlomi Fish <shlo...@shlomifish.org> wrote:
> On Mon, 29 Oct 2012 10:09:53 +0200 > Shlomi Fish <shlo...@shlomifish.org> wrote: > > > Hi Octavian, > > > > On Sun, 28 Oct 2012 17:45:15 +0200 > > "Octavian Rasnita" <orasn...@gmail.com> wrote: > > > > > From: "Shlomi Fish" <shlo...@shlomifish.org> > > > > > > Hi Octavian, > > > > > > > > > > > > Hi Shlomi, > > > > > > I tried to use XML::LibXML::Reader which uses the pool parser, and I > read > > > that: > > > > > > "" > > > However, it is also possible to mix Reader with DOM. At every point the > > > user may copy the current node (optionally expanded into a complete > > > sub-tree) from the processed document to another DOM tree, or to > > > instruct the Reader to collect sub-document in form of a DOM tree > > > "" > > > > > > So I tried: > > > > > > use XML::LibXML::Reader; > > > > > > my $xml = 'path/to/xml/file.xml'; > > > > > > my $reader = XML::LibXML::Reader->new( location => $xml ) or die > "cannot > > > read $xml"; > > > > > > while ( $reader->nextElement( 'Lexem' ) ) { > > > my $id = $reader->getAttribute( 'id' ); #works fine > > > > > > my $doc = $reader->document; > > > > > > my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't > > > work well > > > my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't > work > > > fine > > > > > > } > > > > > > > I'm not sure you should do ->document. I cannot tell you off-hand how to > do it > > right, but I can try to investigate when I have some spare cycles. > > > > OK, after a short amount of investigation, I found that this program works: > > [CODE] > > use strict; > use warnings; > > use XML::LibXML::Reader; > > my $xml = 'Lexems.xml'; > > my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot > read > $xml"; > > while ( $reader->nextElement( 'Lexem' ) ) { > my $id = $reader->getAttribute( 'id' ); #works fine > > my $doc = $reader->copyCurrentNode(1); > my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); > my @lexem_text = $doc->getElementsByTagName( 'Form' ); > } > > [/CODE] > > Note that you can also use XPath for looking up XML information. > > Regards, > > Shlomi Fish > > > -- > ----------------------------------------------------------------- > Shlomi Fish http://www.shlomifish.org/ > List of Text Processing Tools - http://shlom.in/text-proc > > Sophie: Let’s suppose you have a table with 2^n cups… > Jack: Wait a second! Is ‘n’ a natural number? > > Please reply to list if it's a mailing list post - http://shlom.in/reply . > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > > A little late I know but still... Last year I was asked to process a large amount of XML files 2x 1.6M files that needed to be compared on a element by element level and with some fuzzy logic needed to be the same. Things like floating point precision could change (1.00 = 1) and in some cases data could show up in a different order (repeating elements for multiple items on an order). The whole idea was system A that took flat text output from a mainframe and translated this to XML for consumption by a web service was being replaced by system B that did the same thing but on a entirely different software stack. Of course this needed to go as fast as possible as we simply could not sit around for a few days while the computer did it's thing. LibXML was my saviour and using XPath was the fastest solution. Though it is possible to do the DOM thing you end up with the DOM being translated to XPath under the hood (at least the performance seemed to indicate that). After a lot of testing and using pretty much any XML parser I could find using LibXML and XPath was really the fastest. If you are going for speed then you will want to avoid any copy operations you can and you will want to as much as possible use references. Because even though a memory copy of some 100 bytes is a very fast operation on a few million files the the little time it takes kind of adds up to a lot longer then you would like it to. When you are looking at speed first and foremost try and avoid anything that would slow you down. A copy of information is slow so don't do it if you can avoid it. A reference to a memory location is slightly harder to work with in programming but a lot faster. A translation from DOM to XPath would take you time to do, the computer needs the same time. If it is pure speed you are after avoid this as well. If you are sure you are as fast as you can be add a benchmark to the code and try individual optimisations that might or might not be faster... you would be surprised how the perl internals sometimes are a lot faster with some operations then with others even though feeling wise you would not have expected this to be the case. For my case as it was a once in every 25 years kind of major change I didn't do to much benchmarking as the code would be discarded at the end of the project. (well stored in a dusty old SVN repository for others to reuse and never to be looked at again realistically) I got it to go fast enough for a regular run of 1.6M files on a daily basis for as long as the project needed to feel comfortable with the tests to go to production. But if your code is to live a longer life it becomes worth the effort to do more benchmarking with every few additional months that the code is expected to be around and in regular use. Regards, Rob