bijeshn wrote: > On Apr 2, 5:37 pm, Chris <[EMAIL PROTECTED]> wrote: >> [EMAIL PROTECTED] wrote: >>> Hi all, >>> i have an XML file with the following structure:: >>> <r1> >>> <r2>-----| >>> <r3> | >>> <r4> | >>> . | >>> . | --------------------> constitutes one record. >>> . | >>> . | >>> . | >>> </r4> | >>> </r3> | >>> </r2>----| >>> <r2> >>> . >>> . >>> . -----------------------| >>> . | >>> . | >>> . |----------------------> there are n >>> records in between.... >>> . | >>> . | >>> . | >>> . ------------------------| >>> . >>> . >>> </r2> >>> <r2>-----| >>> <r3> | >>> <r4> | >>> . | >>> . | --------------------> constitutes one record. >>> . | >>> . | >>> . | >>> </r4> | >>> </r3> | >>> </r2>----| >>> </r1> >>> Here <r1> is the main root tag of the XML, and <r2>...</r2> >>> constitutes one record. What I would like to do is >>> to extract everything (xml tags and data) between nth <r2> tag and (n >>> +k)th <r2> tag. The extracted data is to be >>> written down to a separate file. >>> Thanks... >> You could create a generator expression out of it: >> >> txt = """<r1> >> <r2><r3><r4>1</r4></r3></r2> >> <r2><r3><r4>2</r4></r3></r2> >> <r2><r3><r4>3</r4></r3></r2> >> <r2><r3><r4>4</r4></r3></r2> >> <r2><r3><r4>5</r4></r3></r2> >> </r1> >> """ >> l = len(txt.split('r2>'))-1 >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l >> and i.replace('>','').replace('<','').strip()) >> >> Now you have a generator you can iterate through with a.next() or >> alternatively you could just create a list out of it by replacing the >> outer parens with square brackets.- Hide quoted text - >> >> - Show quoted text - > > Hmmm... will look into it.. Thanks > > the XML file is almost a TB in size... > Good grief. When will people stop abusing XML this way?
> so SAX will have to be the parser.... i'm thinking of doing something > to split the file using SAX > ... Any suggestions on those lines..? If there are any other parsers > suitable, please suggest... You could try pulldom, but the documentation is disgraceful. ElementTree.iterparse *might* help. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list