On 17 sep, 02:55, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > On Sep 16, 9:27?pm, "Gabriel Genellina" <[EMAIL PROTECTED]> > wrote: > > > En Sun, 16 Sep 2007 21:58:09 -0300, [EMAIL PROTECTED] > > <[EMAIL PROTECTED]> escribi : > > > >> I'm eagerly awaiting publication of your professional specification > > >> for correctly detecting the encoding of an arbitrary stream of > > >> bytes > > > > The very presence of an algorithm to detect encoding is a bug. > > > Files with they .txt extension should always be treated as ANSI > > > even if they contain binary data. > > > Why ANSI? > > Because that's the absence of encoding?
Are you kidding? > > Because it's convenient to *you*? > > No, it's ANSI unless told otherwise. Oh, yes, it's a joke surely. (Anyway, *which* ANSI standard? AFAIK, the Windows character set has never been standardized by ANSI). > > What about the rest of the world that don't speak > > English or even worse, don't use the Latin alpabet? > > When the rest of the world creates the next > generation of computers, THEY can chosse the > defaults. No comments. > > What do you mean by "binary data"? > > 8-bit, ASCII is only 7-bit. Being "binary" as opposed to "text" has nothing to do with the number of bits. "¡Olé!" is text, and contains characters outside the ASCII set. A signal with range 0-63 can be encoded into 6 bits, but it's binary data, not text. > > Notepad is not interpreting the file as > > "binary", it's text, > > And will treat non-ASCII data as if it were ASCII. I think you were complaining about the opposite situation. > > but interpreted using the wrong encoding. > > So that's not a serious bug? To decide that a file > is Unicode despite the absence of the appropriate > markers? Which are "the appropiate markers"? A BOM is not always required, and Notepad supported Unicode even before the BOM was invented. Please redirect your bug reports to [EMAIL PROTECTED] > > As every character goes into the same code block the heuristics concludes > > that the text is some Estern language encoded in UTF16. > > But...but...Notepad doesn't have a UTF16 option. What it calls "Unicode" is in fact UTF16, or UCS2 on some previous Windows versions. > > This is the "Well you are speed" phrase interpreted as UTF16: > > u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465' > > How can you tell from that that it's UTF16? If there's > something stored in addition to those 18 bytes, you're > being misleading. *I* can tell it's not, but Notepad (which presumibly calls IsTextUnicode) cannot, and I can't blame it given a so small sample of less than 20 bytes. > > > Notepad should never be > > > allowed to try to decide what the encoding is if the the open > > > dialog has the encoding set to ANSI. > > > I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and > > that's exactly what happens. I have to explicitely select Unicode in order > > to see those Han characters. > > So which is worse, you having to tell it that it's > Unicode or Notepad deciding on its own that a file > is Unicode when it isn't. I don't know, and I don't care, and I don't use Notepad. -- Gabriel Genellina
-- http://mail.python.org/mailman/listinfo/python-list