Re: Trying to parse a HUGE(1gb) xml file

2011-01-13 Thread Aahz
In article , Stefan Behnel wrote: > >Try > > import xml.etree.cElementTree as etree > >instead. Note the leading "c", which hints at the C implementations of >ElementTree. It's much faster and much more memory friendly than the Python >implementation. Thanks! I updated our codebase this a

Re: Trying to parse a HUGE(1gb) xml file

2010-12-31 Thread dontcare
You should look into vtd-xml, available in c, c++, java and c#. On Dec 20, 11:34 am, spaceman-spiff wrote: > Hi c.l.p folks > > This is a rather long post, but i wanted to include all the details & > everything i have tried so far myself, so please bear with me & read the > entire boringly long

Re: Trying to parse a HUGE(1gb) xml file

2010-12-29 Thread wingoo
maybe you can try http://vtd-xml.sourceforge.net/ -- http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

2010-12-28 Thread Roy Smith
In article , "BartC" wrote: > Still, that's 27 times as much as it need be. Readability is fine, but why > does the full, expanded, human-readable textual format have to be stored on > disk too, and for every single instance? Well, I know the answer to that one. The particular XML feed I'm wo

Re: Trying to parse a HUGE(1gb) xml file

2010-12-28 Thread Sherm Pendley
"BartC" writes: >> Roy Smith, 28.12.2010 00:21: >>> To go back to my earlier example of >>> >>> FALSE >>> > > Isn't it possible for XML to define a shorter alias for these tags? Isn't > there a shortcut available for in simple examples like > this (I seem to remember something like this

Re: Trying to parse a HUGE(1gb) xml file

2010-12-28 Thread BartC
"Stefan Behnel" wrote in message news:mailman.335.1293516506.6505.python-l...@python.org... Roy Smith, 28.12.2010 00:21: To go back to my earlier example of FALSE using 432 bits to store 1 bit of information, stuff like that doesn't happen in marked-up text documents. Most of the

Re: Trying to parse a HUGE(1gb) xml file

2010-12-28 Thread Adam Tauno Williams
On Tue, 2010-12-28 at 07:08 +0100, Stefan Behnel wrote: > Roy Smith, 28.12.2010 00:21: > > To go back to my earlier example of > > FALSE > > using 432 bits to store 1 bit of information, stuff like that doesn't > > happen in marked-up text documents. Most of the file is CDATA (do they > >

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Stefan Behnel
Alan Meyer, 28.12.2010 01:29: On 12/27/2010 4:55 PM, Stefan Behnel wrote: From my experience, SAX is only practical for very simple cases where little state is involved when extracting information from the parse events. A typical example is gathering statistics based on single tags - not a very

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Stefan Behnel
Alan Meyer, 28.12.2010 03:18: By the way Stefan, please don't take any of my comments as complaints. I don't. After all, this discussion is more about the general data format than the specific tools. I use lxml more and more in my work. It's fast, functional and pretty elegant. I've writt

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Stefan Behnel
Roy Smith, 28.12.2010 00:21: To go back to my earlier example of FALSE using 432 bits to store 1 bit of information, stuff like that doesn't happen in marked-up text documents. Most of the file is CDATA (do they still use that term in XML, or was that an SGML-ism only?). The markup i

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Alan Meyer
By the way Stefan, please don't take any of my comments as complaints. I use lxml more and more in my work. It's fast, functional and pretty elegant. I've written a lot of code on a lot of projects in my 35 year career but I don't think I've written anything anywhere near as useful to anywher

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Alan Meyer
On 12/27/2010 6:21 PM, Roy Smith wrote: ... In the old days, they used to say, "Nobody ever got fired for buying IBM". Relational databases have pretty much gotten to that point That's _exactly_ the comparison I had in mind too. I once worked for a company that made a pitch to a big pot

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Alan Meyer
On 12/27/2010 4:55 PM, Stefan Behnel wrote: ... From my experience, SAX is only practical for very simple cases where little state is involved when extracting information from the parse events. A typical example is gathering statistics based on single tags - not a very common use case. Anything

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Roy Smith
Alan Meyer wrote: > On 12/26/2010 3:15 PM, Tim Harig wrote: > I agree with you but, as you say, it has become a defacto standard. As > a result, we often need to use it unless there is some strong reason to > use something else. This is certainly true. In the rarified world of usenet, we can

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Tim Harig
On 2010-12-27, Alan Meyer wrote: > On 12/26/2010 3:15 PM, Tim Harig wrote: > ... >> The problem is that XML has become such a defacto standard that it >> used automatically, without thought, even when there are much better >> alternatives available. > > I agree with you but, as you say, it has bec

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Adam Tauno Williams
On Mon, 2010-12-27 at 22:55 +0100, Stefan Behnel wrote: > Alan Meyer, 27.12.2010 21:40: > > On 12/21/2010 3:16 AM, Stefan Behnel wrote: > >> Adam Tauno Williams, 20.12.2010 20:49: > > ... > >>> You need to process the document as a stream of elements; aka SAX. > >> IMHO, this is the worst advice yo

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Stefan Behnel
Alan Meyer, 27.12.2010 21:40: On 12/21/2010 3:16 AM, Stefan Behnel wrote: Adam Tauno Williams, 20.12.2010 20:49: ... You need to process the document as a stream of elements; aka SAX. IMHO, this is the worst advice you can give. Why do you say that? I would have thought that using SAX in t

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Alan Meyer
On 12/26/2010 3:15 PM, Tim Harig wrote: ... The problem is that XML has become such a defacto standard that it used automatically, without thought, even when there are much better alternatives available. I agree with you but, as you say, it has become a defacto standard. As a result, we often

Re: Trying to parse a HUGE(1gb) xml file

2010-12-27 Thread Alan Meyer
On 12/21/2010 3:16 AM, Stefan Behnel wrote: Adam Tauno Williams, 20.12.2010 20:49: ... You need to process the document as a stream of elements; aka SAX. IMHO, this is the worst advice you can give. Why do you say that? I would have thought that using SAX in this application is an excelle

Re: Trying to parse a HUGE(1gb) xml file

2010-12-26 Thread Tim Harig
On 2010-12-26, Stefan Behnel wrote: > Tim Harig, 26.12.2010 10:22: >> On 2010-12-26, Stefan Behnel wrote: >>> Tim Harig, 26.12.2010 02:05: On 2010-12-25, Nobody wrote: > On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: >> Of course, one advantage of XML is that with so much redund

Re: Trying to parse a HUGE(1gb) xml file

2010-12-26 Thread Stefan Behnel
Tim Harig, 26.12.2010 10:22: On 2010-12-26, Stefan Behnel wrote: Tim Harig, 26.12.2010 02:05: On 2010-12-25, Nobody wrote: On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: Of course, one advantage of XML is that with so much redundant text, it compresses well. We typically see gzip compr

Re: Trying to parse a HUGE(1gb) xml file

2010-12-26 Thread Tim Harig
On 2010-12-26, Stefan Behnel wrote: > Tim Harig, 26.12.2010 02:05: >> On 2010-12-25, Nobody wrote: >>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: Of course, one advantage of XML is that with so much redundant text, it compresses well. We typically see gzip compression ratios

Re: Trying to parse a HUGE(1gb) xml file

2010-12-26 Thread Tim Harig
On 2010-12-26, Nobody wrote: > On Sun, 26 Dec 2010 01:05:53 +, Tim Harig wrote: > >>> XML is typically processed sequentially, so you don't need to create a >>> decompressed copy of the file before you start processing it. >> >> Sometimes XML is processed sequentially. When the markup footpr

Re: Trying to parse a HUGE(1gb) xml file

2010-12-26 Thread Stefan Behnel
Tim Harig, 26.12.2010 02:05: On 2010-12-25, Nobody wrote: On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: Of course, one advantage of XML is that with so much redundant text, it compresses well. We typically see gzip compression ratios of 20:1. But, that just means you can archive them e

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Nobody
On Sun, 26 Dec 2010 01:05:53 +, Tim Harig wrote: >> XML is typically processed sequentially, so you don't need to create a >> decompressed copy of the file before you start processing it. > > Sometimes XML is processed sequentially. When the markup footprint is > large enough it must be. Qu

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Tim Harig
On 2010-12-25, Adam Tauno Williams wrote: > On Sat, 2010-12-25 at 22:34 +, Nobody wrote: >> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: >> XML is typically processed sequentially, so you don't need to create a >> decompressed copy of the file before you start processing it. > > Yep.

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Tim Harig
On 2010-12-25, Nobody wrote: > On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: >>> XML works extremely well for large datasets. > One advantage it has over many legacy formats is that there are no > inherent 2^31/2^32 limitations. Many binary formats inherently cannot > support files larger t

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread BartC
"Adam Tauno Williams" wrote in message news:mailman.287.1293319780.6505.python-l...@python.org... On Sat, 2010-12-25 at 22:34 +, Nobody wrote: On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: >> XML works extremely well for large datasets. One advantage it has over many legacy formats i

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Adam Tauno Williams
On Sat, 2010-12-25 at 22:34 +, Nobody wrote: > On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: > >> XML works extremely well for large datasets. > One advantage it has over many legacy formats is that there are no > inherent 2^31/2^32 limitations. Many binary formats inherently cannot > su

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Nobody
On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote: >> XML works extremely well for large datasets. One advantage it has over many legacy formats is that there are no inherent 2^31/2^32 limitations. Many binary formats inherently cannot support files larger than 2GiB or 4Gib due to the use of 32

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Stefan Sonnenberg-Carstens
Am 25.12.2010 20:41, schrieb Roy Smith: In article, Adam Tauno Williams wrote: XML works extremely well for large datasets. Barf. I'll agree that there are some nice points to XML. It is portable. It is (to a certain extent) human readable, and in a pinch you can use standard text tools

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Roy Smith
In article , Adam Tauno Williams wrote: > XML works extremely well for large datasets. Barf. I'll agree that there are some nice points to XML. It is portable. It is (to a certain extent) human readable, and in a pinch you can use standard text tools to do ad-hoc queries (i.e. grep for a

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Tim Harig
On 2010-12-25, Steve Holden wrote: > On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote: >> For large datasets I always have huge question marks if one says "xml". >> But I don't want to start a flame war. I would agree; but, you don't always have the choice over the data format that you hav

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Adam Tauno Williams
"Steve Holden" wrote: >On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote: >> For large datasets I always have huge question marks if one says >"xml". >> But I don't want to start a flame war. >I agree people abuse the "spirit of XML" using it to transfer gigabytes >of data, How so? I th

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Stefan Behnel
Steve Holden, 25.12.2010 16:55: On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote: For large datasets I always have huge question marks if one says "xml". But I don't want to start a flame war. I agree people abuse the "spirit of XML" using it to transfer gigabytes of data I keep readi

Re: Trying to parse a HUGE(1gb) xml file

2010-12-25 Thread Steve Holden
On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote: > For large datasets I always have huge question marks if one says "xml". > But I don't want to start a flame war. I agree people abuse the "spirit of XML" using it to transfer gigabytes of data, but what else are they to use? regards Stev

Re: Trying to parse a HUGE(1gb) xml file

2010-12-23 Thread Stefan Sonnenberg-Carstens
Am 23.12.2010 21:27, schrieb Nobody: On Wed, 22 Dec 2010 23:54:34 +0100, Stefan Sonnenberg-Carstens wrote: Normally (what is normal, anyway?) such files are auto-generated, and are something that has a apparent similarity with a database query result, encapsuled in xml. Most of the time the str

Re: Trying to parse a HUGE(1gb) xml file

2010-12-23 Thread Nobody
On Wed, 22 Dec 2010 23:54:34 +0100, Stefan Sonnenberg-Carstens wrote: > Normally (what is normal, anyway?) such files are auto-generated, > and are something that has a apparent similarity with a database query > result, encapsuled in xml. > Most of the time the structure is same for every "row"

Re: Trying to parse a HUGE(1gb) xml file

2010-12-22 Thread Stefan Sonnenberg-Carstens
Am 20.12.2010 20:34, schrieb spaceman-spiff: Hi c.l.p folks This is a rather long post, but i wanted to include all the details& everything i have tried so far myself, so please bear with me& read the entire boringly long post. I am trying to parse a ginormous ( ~ 1gb) xml file. 0. I am a

Re: Trying to parse a HUGE(1gb) xml file

2010-12-22 Thread John Nagle
On 12/20/2010 12:33 PM, Adam Tauno Williams wrote: On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote: I need to detect them& then for each 1, i need to copy all the content b/w the element's start& end tags& create a smaller xml file. Yep, do that a lot; via iterparse. 1. Can you po

Re: Trying to parse a HUGE(1gb) xml file

2010-12-21 Thread Stefan Behnel
spaceman-spiff, 20.12.2010 21:29: I am sorry i left out what exactly i am trying to do. 0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file. The contents of the xml, is just a dump of config parameters from a packet switch( a

Re: Trying to parse a HUGE(1gb) xml file

2010-12-21 Thread Stefan Behnel
Adam Tauno Williams, 20.12.2010 20:49: On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote: This is a rather long post, but i wanted to include all the details& everything i have tried so far myself, so please bear with me& read the entire boringly long post. I am trying to parse a ginormou

Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread Tim Harig
On 2010-12-20, spaceman-spiff wrote: > 0. Goal :I am looking for a specific element..there are several 10s/100s > occurrences of that element in the 1gb xml file. The contents of the xml, > is just a dump of config parameters from a packet switch( although imho, > the contents of the xml dont mat

Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread Terry Reedy
On 12/20/2010 2:49 PM, Adam Tauno Williams wrote: Yes, this is a terrible technique; most examples are crap. Yes, this is using DOM. DOM is evil and the enemy, full-stop. You're still using DOM; DOM is evil. For serial processing, DOM is superfluous superstructure. For random access pr

Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread Adam Tauno Williams
On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote: > I need to detect them & then for each 1, i need to copy all the > content b/w the element's start & end tags & create a smaller xml > file. Yep, do that a lot; via iterparse. > 1. Can you point me to some examples/samples of using SAX, >

Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread spaceman-spiff
Hi Usernet First up, thanks for your prompt reply. I will make sure i read RFC1855, before posting again, but right now chasing a hard deadline :) I am sorry i left out what exactly i am trying to do. 0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that

Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread Adam Tauno Williams
On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote: > Hi c.l.p folks > This is a rather long post, but i wanted to include all the details & > everything i have tried so far myself, so please bear with me & read > the entire boringly long post. > I am trying to parse a ginormous ( ~ 1gb) xml f

Re: Trying to parse a HUGE(1gb) xml file

2010-12-20 Thread Tim Harig
[Wrapped to meet RFC1855 Netiquette Guidelines] On 2010-12-20, spaceman-spiff wrote: > This is a rather long post, but i wanted to include all the details & > everything i have tried so far myself, so please bear with me & read > the entire boringly long post. > > I am trying to parse a ginormous