On Dec 11, 4:23 pm, nnguyen <nguy...@gmail.com> wrote: > I need expat to parse this block of xml: > > <datafield tag="991"> > <subfield code="b">c-P&P</subfield> > <subfield code="h">LOT 3677</subfield> > <subfield code="m">(F)</subfield> > </datafield> > > I need to parse the xml and return a dictionary that follows roughly > the same layout as the xml. Currently the code for the class handling > this is: > > class XML2Map(): > > def __init__(self): > """ """ > self.parser = expat.ParserCreate() > > self.parser.StartElementHandler = self.start_element > self.parser.EndElementHandler = self.end_element > self.parser.CharacterDataHandler = self.char_data > > self.map = [] #not a dictionary > > self.current_tag = '' > self.current_subfields = [] > self.current_sub = '' > self.current_data = '' > > def parse_xml(self, xml_text): > self.parser.Parse(xml_text, 1) > > def start_element(self, name, attrs): > if name == 'datafield': > self.current_tag = attrs['tag'] > > elif name == 'subfield': > self.current_sub = attrs['code'] > > def char_data(self, data): > self.current_data = data > > def end_element(self, name): > if name == 'subfield': > self.current_subfields.append([self.current_sub, > self.current_data]) > > elif name == 'datafield': > self.map.append({'tag': self.current_tag, 'subfields': > self.current_subfields}) > self.current_subfields = [] #resetting the values for next > subfields > > Right now my problem is that when it's parsing the subfield element > with the data "c-P&P", it's not taking the whole data, but instead > it's breaking it into "c-P", "&", "P". i'm not an expert with expat, > and I couldn't find a lot of information on how it handles specific > entities. > > In the resulting map, instead of: > > {'tag': u'991', 'subfields': [[u'b', u'c-P&P'], [u'h', u'LOT 3677'], > [u'm', u'(F)']], 'inds': [u' ', u' ']} > > I get this: > > {'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'], > [u'm', u'(F)']], 'inds': [u' ', u' ']} > > In the debugger, I can see that current_data gets assigned "c-P", then > "&", and then "P". > > Any ideas on any expat tricks I'm missing out on? I'm also inclined to > try another parser that can keep the string together when there are > entities, or at least ampersands.
I forgot, ignore the "'inds':..." in the output above, it's just another part of the xml I had to parse that isn't important to this discussion. -- http://mail.python.org/mailman/listinfo/python-list