2009/6/10 Nick Matzke <mat...@berkeley.edu>: > Hi all, > > So I'm parsing an XML file returned from a database. However, the database > entries have occasional non-ASCII characters, and this is crashing my > parsers. > > Is there some handy function out there that will schlep through a file like > this, and do something like fix the characters that it can recognize, and > delete those that it can't? Basically, like the BBEdit "convert to ASCII" > menu option under "Text". > > I googled some on this, but nothing obvious came up that wasn't specific to > fixing one or a few characters. > > Thanks! > Nick > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: mat...@berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth is > flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== > -- > http://mail.python.org/mailman/listinfo/python-list >
Hi, depending on the circumstances, there are probably more sophisticated ways (what does "fix the characters" mean?), but do you maybe think something like: >>> u"aáčbüêcîßd".encode("ascii", "ignore") 'abcd' ? It might be important to ensure, that you won't loose any useful information; where are the unexpected characters coming from, or couldn't it possibly be fixed in that source? hth, vbr -- http://mail.python.org/mailman/listinfo/python-list