On 2011-11-29, Stefan Behnel wrote: > Adam Funk, 29.11.2011 13:57: >> On 2011-11-28, Stefan Behnel wrote:
>>> If the name "big_json" is supposed to hint at a large set of data, you may >>> want to use something other than minidom. Take a look at the >>> xml.etree.cElementTree module instead, which is substantially more memory >>> efficient. >> >> Well, the input file in this case contains one big JSON list of >> reasonably sized elements, each of which I'm turning into a separate >> XML file. The output files range from 600 to 6000 bytes. > > It's also substantially easier to use, but if your XML writing code works > already, why change it. That module looks useful --- thanks for the tip. (TBH, I'm using minidom mainly because I've used it before and the API is similar to the DOM APIs I've used in other languages.) > You should read up on Unicode a bit. It wouldn't do me any harm. :-) >>>> I thought this would force all the output to be valid, but xmlstarlet >>>> gives some errors like these on a few documents: >>>> >>>> PCDATA invalid Char value 7 >>>> PCDATA invalid Char value 31 >>> >>> This strongly hints at a broken encoding, which can easily be triggered by >>> your erroneous encode-and-encode cycles above. >> >> No, I've checked the JSON input and those exact control characters are >> there too. > > Ah, right, I didn't look closely enough. Those are forbidden in XML: > > http://www.w3.org/TR/REC-xml/#charsets > > It's sad that minidom (apparently) lets them pass through without even a > warning. Yes, it is! I've now found this, which seems to fix the problem: http://bitkickers.blogspot.com/2011/05/stripping-control-characters-in-python.html -- The internet is quite simply a glorious place. Where else can you find bootlegged music and films, questionable women, deep seated xenophobia and amusing cats all together in the same place? [Tom Belshaw] -- http://mail.python.org/mailman/listinfo/python-list