recycling internationalized garbage

aaronwmail-usenet Wed, 08 Mar 2006 06:25:46 -0800

Hi folks,

Please help me with international string issues:
I put together an AJAX discography search engine


http://www.xfeedme.com/discs/discography.html

using data from the FreeDB music database

http://www.freedb.org/

Unfortunately FreeDB has a lot of junk in it, including
randomly mixed character encodings for international
strings.  As an expediency I decided to just delete all
characters that weren't ascii, so I could get the thing
running.  Now I look through the log files and notice that
a certain category of user immediatly homes in on this
and finds it amusing to see how badly I've mangled
the strings :(.  I presume they chuckle and make
disparaging remarks about "united states of ascii"
and then leave never to return.

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)?  The string might be in any of the
myriad encodings that predate unicode.  Has anyone
done this in Python already?  The output must be clean
utf8 suitable for arbitrary xml parsers.

Thanks,  -- Aaron Watters

===

As someone once remarked to Schubert
"take me to your leider" (sorry about that).
   -- Tom Lehrer

-- 
http://mail.python.org/mailman/listinfo/python-list

recycling internationalized garbage

Reply via email to