On 2010-12-20, spaceman-spiff <ashish.mak...@gmail.com> wrote: > 0. Goal :I am looking for a specific element..there are several 10s/100s > occurrences of that element in the 1gb xml file. The contents of the xml, > is just a dump of config parameters from a packet switch( although imho, > the contents of the xml dont matter)
Then you need: 1. To detect whenever you move inside of the type element you are seeking and whenever you move out of it. As long as these elements cannot be nested inside of each other, this is an easy binary task. If they can be nested, then you will need to maintain some kind of level count or recursively decompose each level. 2. Once you have obtained a complete element (from its start tag to its end tag) you will need to test whether you have the single correct element that you are looking for. Something like this (untested) will work if the target tag cannot be nested in another target tag: - import xml.sax - class tagSearcher(xml.sax.ContentHandler): - - def startDocument(): - self.inTarget = False - - def startElement(name, attrs): - if name == targetName: - self.inTarget = True - elif inTarget = True: - # save element information - - def endElement(name): - if name == targetName: - self.inTarget = False - # test the saved information to see if you have the - # one you want: - # - # if its the peice you are looking for, then - # you can process the information - # you have saved - # - # if not, disgard the accumulated - # information and move on - - def characters(content): - if self.inTarget == True: - # save the content - - yourHandler = tagSearcher() - yourParser = xml.sax.make_parser() - yourParser.parse(inputXML, yourHandler) Then you just walk through the document picking up and discarding each target element type until you have the one that you are looking for. > I need to detect them & then for each 1, i need to copy all the content > b/w the element's start & end tags & create a smaller xml file. Easy enough; but, with SAX you will have to recreate the tags from the information that they contain because they will be skipped by the character() events; so you will need to save the information from each tag as you come across it. This could probably be done more automatically using saxutils.XMLGenerator; but, I haven't actually worked with it before. xml.dom.pulldom also looks interesting > 1. Can you point me to some examples/samples of using SAX, especially , > ones dealing with really large XML files. There is nothing special about large files with SAX. Sax is very simple. It walks through the document and calls the the functions that you give it for each event as it reaches varius elements. Your callback functions (methods of a handler) do everthing with the information. SAX does nothing more then call your functions. There are events for reaching a starting tag, an end tag, and characters between tags; as well as some for beginning and ending a document. > 2.This brings me to another q. which i forgot to ask in my OP(original > post). Is simply opening the file, & using reg ex to look for the element > i need, a *good* approach ? While researching my problem, some article > seemed to advise against this, especially since its known apriori, that > the file is an xml & since regex code gets complicated very quickly & > is not very readable. > > But is that just a "style"/"elegance" issue, & for my particular problem > (detecting a certain element, & then creating(writing) a smaller xml > file corresponding to, each pair of start & end tags of said element), > is the open file & regex approach, something you would recommend ? It isn't an invalid approach if it works for your situatuation. I have used it before for very simple problems. The thing is, XML is a context free data format which makes it difficult to generate precise regular expressions, especially where where tags of the same type can be nested. It can be very error prone. Its really easy to have a regex work for your tests and fail, either by matching too much or failing to match, because you didn't anticipate a given piece of data. I wouldn't consider it a robust solution. -- http://mail.python.org/mailman/listinfo/python-list