New submission from Brendan O'Connor: (This is Python 2.7 so I'm using string vs unicode terminology.)
When I use ElementTree.fromstring(), and use the .text field on nodes, the value is usually a string object, but in rare cases it's a unicode object. I'm parsing many XML documents of newspaper text [1]; on one subset of the data, out of 5 million nodes, ~200 of them have a unicode object for the .text field. I think this is all related to http://bugs.python.org/issue11033 but I can't figure out how, exactly. I'm passing in strings to ElementTree.fromstring() like you're supposed to. The workaround is to defensively convert the .text value to unicode [3]. [1] data is http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T21 [2] my processing code is https://github.com/brendano/gigaword_conversion/blob/master/annogw2justsent.py [3] def convert_to_unicode(mystr): if isinstance(mystr, unicode): return mystr if isinstance(mystr, str): return mystr.decode('utf8') ---------- messages: 191496 nosy: Brendan.OConnor priority: normal severity: normal status: open title: ElementTree.fromstring non-deterministically gives unicode text data type: behavior versions: Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18268> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com