Re: recycling internationalized garbage

2006-03-16 Thread Ross Ridge
Martin v. Löwis wrote: > So "valid" yes; "meaningful" no. Therefore, for all practical > purposes, 8-bit single-byte characters sets *will not* produce > byte sequences that are valid in UTF-8 (although they could - > it just won't happen). > > > In fact I can't think of any multi-byte encoding tha

Re: recycling internationalized garbage

2006-03-16 Thread Fredrik Lundh
"Martin v. Löwis" wrote: >> It should be obvious that any 8-bit single-byte character set can >> produce byte sequences that are valid in UTF-8. > > It is certainly possible to interpret UTF-8 data as if they were > in a specific single-byte encoding. However, the text you then > obtain is not mea

Re: recycling internationalized garbage

2006-03-15 Thread Martin v. Löwis
Ross Ridge wrote: > It should be obvious that any 8-bit single-byte character set can > produce byte sequences that are valid in UTF-8. It is certainly possible to interpret UTF-8 data as if they were in a specific single-byte encoding. However, the text you then obtain is not meaningful in any l

Re: recycling internationalized garbage

2006-03-15 Thread Fredrik Lundh
> Unless someone has any other ideas I'm > giving up now. btw, have you looked at using http://musicbrainz.org/products/server/download.html instead? they appear to guarantee UTF-8 (to the extent that *they* have managed to autodecode the FreeDB junk, of course). not sure how complete it i

Re: recycling internationalized garbage

2006-03-15 Thread Fredrik Lundh
Ross Ridge wrote: > Despite this malicious and false accusation, your post only confirms > what I wrote above is true and what Martin wrote was false. Even with > the desperate and absurd semantic game you tried to play, like falsely > equating "fairly reliably" with "reliably", in a database as

Re: recycling internationalized garbage

2006-03-15 Thread Ross Ridge
Ross Ridge wrote: > It should be obvious that any 8-bit single-byte character set can > produce byte sequences that are valid in UTF-8. Fredrik Lundh wrote: > it should be fairly obvious that you don't know much about UTF-8... Despite this malicious and false accusation, your post only confirms w

Re: recycling internationalized garbage

2006-03-15 Thread Fredrik Lundh
Martin wrote: > > The point is that you can tell UTF-8 reliably. RFC 3629 says "fairly reliably" rather than "reliably", but they mean the same thing... > > If the data decodes > > as UTF-8, it *is* UTF-8, because no other encoding in the world > > produces the same byte sequences (except for AS

Re: recycling internationalized garbage

2006-03-15 Thread Ross Ridge
Martin v. Löwis wrote: > The point is that you can tell UTF-8 reliably. If the data decodes > as UTF-8, it *is* UTF-8, because no other encoding in the world > produces the same byte sequences (except for ASCII, which is > an UTF-8 subset). It should be obvious that any 8-bit single-byte character

Re: recycling internationalized garbage

2006-03-14 Thread Serge Orlov
[EMAIL PROTECTED] wrote: > Unless someone has any other ideas I'm > giving up now. Frederick also suggested http://chardet.feedparser.org/ that is port of Mozilla's character detection algorithm to pure python. It works pretty good for web pages, since I haven't seen garbled russian text for year

Re: recycling internationalized garbage

2006-03-14 Thread Martin v. Löwis
Ross Ridge wrote: > [EMAIL PROTECTED] wrote: > >>try: >>(uni, dummy) = utf8dec(s) >>except: >>(uni, dummy) = iso88591dec(s, 'ignore') > > > Is there really any point in even trying to decode with UTF-8? You > might as well just assume ISO 8859-1. The point is that you c

Re: recycling internationalized garbage

2006-03-14 Thread Ross Ridge
[EMAIL PROTECTED] wrote: > try: > (uni, dummy) = utf8dec(s) > except: > (uni, dummy) = iso88591dec(s, 'ignore') Is there really any point in even trying to decode with UTF-8? You might as well just assume ISO 8859-1. Ross Ridge -- http://mail.pyth

Re: recycling internationalized garbage

2006-03-14 Thread aaronwmail-usenet
Regarding cleaning of mixed string encodings in the discography search engine http://www.xfeedme.com/discs/discography.html Following 's suggestion I came up with this: utf8enc = codecs.getencoder("utf8") utf8dec = codecs.getdecoder("utf8") iso88591dec = codecs.getdecoder("iso-8859-1") def chec

Re: recycling internationalized garbage

2006-03-08 Thread Ross Ridge
[EMAIL PROTECTED] wrote: > Question: what is a good strategy for taking an 8bit > string of unknown encoding and recovering the largest > amount of reasonable information from it (translated to > utf8 if needed)? Copy the string unmodified to the WWW page and ensure your page doesn't identify the

Re: recycling internationalized garbage

2006-03-08 Thread garabik-news-2005-05
Fredrik Lundh <[EMAIL PROTECTED]> wrote: > "[EMAIL PROTECTED]" wrote: > >> Question: what is a good strategy for taking an 8bit >> string of unknown encoding and recovering the largest >> amount of reasonable information from it (translated to >> utf8 if needed)? The string might be in any of the

Re: recycling internationalized garbage

2006-03-08 Thread Fredrik Lundh
"[EMAIL PROTECTED]" wrote: > Question: what is a good strategy for taking an 8bit > string of unknown encoding and recovering the largest > amount of reasonable information from it (translated to > utf8 if needed)? The string might be in any of the > myriad encodings that predate unicode. Has an

recycling internationalized garbage

2006-03-08 Thread aaronwmail-usenet
Hi folks, Please help me with international string issues: I put together an AJAX discography search engine http://www.xfeedme.com/discs/discography.html using data from the FreeDB music database http://www.freedb.org/ Unfortunately FreeDB has a lot of junk in it, including randomly mixed char