Hi folks, Please help me with international string issues: I put together an AJAX discography search engine
http://www.xfeedme.com/discs/discography.html using data from the FreeDB music database http://www.freedb.org/ Unfortunately FreeDB has a lot of junk in it, including randomly mixed character encodings for international strings. As an expediency I decided to just delete all characters that weren't ascii, so I could get the thing running. Now I look through the log files and notice that a certain category of user immediatly homes in on this and finds it amusing to see how badly I've mangled the strings :(. I presume they chuckle and make disparaging remarks about "united states of ascii" and then leave never to return. Question: what is a good strategy for taking an 8bit string of unknown encoding and recovering the largest amount of reasonable information from it (translated to utf8 if needed)? The string might be in any of the myriad encodings that predate unicode. Has anyone done this in Python already? The output must be clean utf8 suitable for arbitrary xml parsers. Thanks, -- Aaron Watters === As someone once remarked to Schubert "take me to your leider" (sorry about that). -- Tom Lehrer -- http://mail.python.org/mailman/listinfo/python-list