On Apr 3, 8:51 am, Steve Holden <[EMAIL PROTECTED]> wrote: > bijeshn wrote: > > On Apr 2, 5:37 pm, Chris <[EMAIL PROTECTED]> wrote: > >> [EMAIL PROTECTED] wrote: > >>> Hi all, > >>> i have an XML file with the following structure:: > >>> <r1> > >>> <r2>-----| > >>> <r3> | > >>> <r4> | > >>> . | > >>> . | --------------------> constitutes one record. > >>> . | > >>> . | > >>> . | > >>> </r4> | > >>> </r3> | > >>> </r2>----| > >>> <r2> > >>> . > >>> . > >>> . -----------------------| > >>> . | > >>> . | > >>> . |----------------------> there are n > >>> records in between.... > >>> . | > >>> . | > >>> . | > >>> . ------------------------| > >>> . > >>> . > >>> </r2> > >>> <r2>-----| > >>> <r3> | > >>> <r4> | > >>> . | > >>> . | --------------------> constitutes one record. > >>> . | > >>> . | > >>> . | > >>> </r4> | > >>> </r3> | > >>> </r2>----| > >>> </r1> > >>> Here <r1> is the main root tag of the XML, and <r2>...</r2> > >>> constitutes one record. What I would like to do is > >>> to extract everything (xml tags and data) between nth <r2> tag and (n > >>> +k)th <r2> tag. The extracted data is to be > >>> written down to a separate file. > >>> Thanks... > >> You could create a generator expression out of it: > > >> txt = """<r1> > >> <r2><r3><r4>1</r4></r3></r2> > >> <r2><r3><r4>2</r4></r3></r2> > >> <r2><r3><r4>3</r4></r3></r2> > >> <r2><r3><r4>4</r4></r3></r2> > >> <r2><r3><r4>5</r4></r3></r2> > >> </r1> > >> """ > >> l = len(txt.split('r2>'))-1 > >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l > >> and i.replace('>','').replace('<','').strip()) > > >> Now you have a generator you can iterate through with a.next() or > >> alternatively you could just create a list out of it by replacing the > >> outer parens with square brackets.- Hide quoted text - > > >> - Show quoted text - > > > Hmmm... will look into it.. Thanks > > > the XML file is almost a TB in size... > > Good grief. When will people stop abusing XML this way? > > > so SAX will have to be the parser.... i'm thinking of doing something > > to split the file using SAX > > ... Any suggestions on those lines..? If there are any other parsers > > suitable, please suggest... > > You could try pulldom, but the documentation is disgraceful. > > ElementTree.iterparse *might* help. > > regards > Steve > > -- > Steve Holden +1 571 484 6266 +1 800 494 3119 > Holden Web LLC http://www.holdenweb.com/
I abuse it because I can (and because I don't generally work with XML files larger than 20-30meg) :) And the OP never said the XML file for 1TB in size, which makes things different. -- http://mail.python.org/mailman/listinfo/python-list