Martin v. Löwis wrote:
> So "valid" yes; "meaningful" no. Therefore, for all practical
> purposes, 8-bit single-byte characters sets *will not* produce
> byte sequences that are valid in UTF-8 (although they could -
> it just won't happen).
>
> > In fact I can't think of any multi-byte encoding tha
"Martin v. Löwis" wrote:
>> It should be obvious that any 8-bit single-byte character set can
>> produce byte sequences that are valid in UTF-8.
>
> It is certainly possible to interpret UTF-8 data as if they were
> in a specific single-byte encoding. However, the text you then
> obtain is not mea
Ross Ridge wrote:
> It should be obvious that any 8-bit single-byte character set can
> produce byte sequences that are valid in UTF-8.
It is certainly possible to interpret UTF-8 data as if they were
in a specific single-byte encoding. However, the text you then
obtain is not meaningful in any l
> Unless someone has any other ideas I'm
> giving up now.
btw, have you looked at using
http://musicbrainz.org/products/server/download.html
instead? they appear to guarantee UTF-8 (to the extent that *they* have managed
to autodecode the FreeDB junk, of course). not sure how complete it i
Ross Ridge wrote:
> Despite this malicious and false accusation, your post only confirms
> what I wrote above is true and what Martin wrote was false. Even with
> the desperate and absurd semantic game you tried to play, like falsely
> equating "fairly reliably" with "reliably", in a database as
Ross Ridge wrote:
> It should be obvious that any 8-bit single-byte character set can
> produce byte sequences that are valid in UTF-8.
Fredrik Lundh wrote:
> it should be fairly obvious that you don't know much about UTF-8...
Despite this malicious and false accusation, your post only confirms
w
Martin wrote:
> > The point is that you can tell UTF-8 reliably.
RFC 3629 says "fairly reliably" rather than "reliably", but they mean
the same thing...
> > If the data decodes
> > as UTF-8, it *is* UTF-8, because no other encoding in the world
> > produces the same byte sequences (except for AS
Martin v. Löwis wrote:
> The point is that you can tell UTF-8 reliably. If the data decodes
> as UTF-8, it *is* UTF-8, because no other encoding in the world
> produces the same byte sequences (except for ASCII, which is
> an UTF-8 subset).
It should be obvious that any 8-bit single-byte character
[EMAIL PROTECTED] wrote:
> Unless someone has any other ideas I'm
> giving up now.
Frederick also suggested http://chardet.feedparser.org/ that is port of
Mozilla's character detection algorithm to pure python. It works pretty
good for web pages, since I haven't seen garbled russian text for year
Ross Ridge wrote:
> [EMAIL PROTECTED] wrote:
>
>>try:
>>(uni, dummy) = utf8dec(s)
>>except:
>>(uni, dummy) = iso88591dec(s, 'ignore')
>
>
> Is there really any point in even trying to decode with UTF-8? You
> might as well just assume ISO 8859-1.
The point is that you c
[EMAIL PROTECTED] wrote:
> try:
> (uni, dummy) = utf8dec(s)
> except:
> (uni, dummy) = iso88591dec(s, 'ignore')
Is there really any point in even trying to decode with UTF-8? You
might as well just assume ISO 8859-1.
Ross Ridge
--
http://mail.pyth
Regarding cleaning of mixed string encodings in
the discography search engine
http://www.xfeedme.com/discs/discography.html
Following 's suggestion I came up with this:
utf8enc = codecs.getencoder("utf8")
utf8dec = codecs.getdecoder("utf8")
iso88591dec = codecs.getdecoder("iso-8859-1")
def chec
[EMAIL PROTECTED] wrote:
> Question: what is a good strategy for taking an 8bit
> string of unknown encoding and recovering the largest
> amount of reasonable information from it (translated to
> utf8 if needed)?
Copy the string unmodified to the WWW page and ensure your page doesn't
identify the
Fredrik Lundh <[EMAIL PROTECTED]> wrote:
> "[EMAIL PROTECTED]" wrote:
>
>> Question: what is a good strategy for taking an 8bit
>> string of unknown encoding and recovering the largest
>> amount of reasonable information from it (translated to
>> utf8 if needed)? The string might be in any of the
"[EMAIL PROTECTED]" wrote:
> Question: what is a good strategy for taking an 8bit
> string of unknown encoding and recovering the largest
> amount of reasonable information from it (translated to
> utf8 if needed)? The string might be in any of the
> myriad encodings that predate unicode. Has an
Hi folks,
Please help me with international string issues:
I put together an AJAX discography search engine
http://www.xfeedme.com/discs/discography.html
using data from the FreeDB music database
http://www.freedb.org/
Unfortunately FreeDB has a lot of junk in it, including
randomly mixed char
16 matches
Mail list logo