Re: Extract attribute from huge xml file

Jenda Krynicky Mon, 10 Dec 2007 11:27:11 -0800

From: "Beginner" <[EMAIL PROTECTED]>
> I have a huge XML file, 1.7GB, 53080215 lines. I am trying to extract 
> an attribute from each record (code=). I several problems one of 
> which is the size of the file is making it painful to test my scripts 
> and methods for parsing.
> 
> I would like to extract a few hundred records (by any means) so I can 
> experiment.  I think XPath is the way to go here. The file 
> (currently) sits on a *nix system but I was going to do the parsing 
> to on a Win32 workstation rather than steal all the memory on a 
> server.


If all you want (for now) is to get a smaller file for testing (and 
the file structure is not complex (i.e. the tag immediately bellow 
the root is repeated many times and there is no "footer" that you'd 
need to keep), you should take the first N megabytes, search for the 
last closing tag, throw away the stuff after that tag and append the 
closing root tag.

To process the real file then you'll definitely need something that 
doesn't attempt to parse the whole file and create a huge 
datastructure or maze of objects. You need something stream or chunk 
oriented. XML::Twig, XML::Rules (I promise I'll release a version 
that doesn't waste memory on whitespace around the tags if you don't 
need it within a week (http://www.perlmonks.org/?node_id=654367). The 

changes are done and seem to be working ...), some SAX oriented 
module, XML::Parser, ...
 
And you should definitely do that processing on the same box that 
stores the file ... you do not want to keep wasting network bandwidth 

with 1.7GB if all you need is a few pieces of info.

Jenda
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
        -- Terry Pratchett in Sourcery


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Extract attribute from huge xml file

Reply via email to