[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-02 Thread Julien ÉLIE
Julien ÉLIE added the comment: > Unless I'm mistaken, Content-Type should only apply to the body, not the > headers. Either the headers use UTF-8 (RFC 3977), or they should be > MIME-encoded. Everything else is undecodable. Yes, of course. Such articles are not RFC-compliant. You're not mis

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-02 Thread Antoine Pitrou
Antoine Pitrou added the comment: > Antoine, a news client could guess it because of the Content-Type: > header field (in this example, it mentions charset="gb2312"). > Yet, articles without a Content-Type: header field exist in the > wild... Unless I'm mistaken, Content-Type should only apply

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-02 Thread Julien ÉLIE
Julien ÉLIE added the comment: Antoine, a news client could guess it because of the Content-Type: header field (in this example, it mentions charset="gb2312"). Yet, articles without a Content-Type: header field exist in the wild... There is no way to always make the right guess, unfortunately.

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-02 Thread Antoine Pitrou
Antoine Pitrou added the comment: > And my text string is "\xC9ric", that's all. You mean b"\xC9ric", right? > If you look at the source of the articles, you will for instance see > that the Subject: header field is not MIME-encoded. It is directly > written in gb2312. How is an NNTP client

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-02 Thread Julien ÉLIE
Julien ÉLIE added the comment: David, the headers are not at all supposed to be "utf-8" encoded. For instance, have a look at the cn.bbs.comp.lang.python newsgroup: http://groups.google.fr/group/cn.bbs.comp.lang.python If you look at the source of the articles, you will for instance see that t

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread R. David Murray
R. David Murray added the comment: That's not what you opened the bug about, though, according to the title. I discussed the headers-in-things-other-than HEAD/ARTICLE, and Antoine was of the opinion that they were "supposed" to be utf-8 and that in any case using surrogate escape was good eno

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
Julien ÉLIE added the comment: Maybe the bug should be reopened -- or the subject changed -- because the real issue is when I read: # Incompatible changes from the 2.x nntplib: # - all commands are encoded as UTF-8 data (using the "surrogateescape" # error handler), except for raw message da

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
Julien ÉLIE added the comment: Éric: there is no notion of encoding in a few NNTP commands. Regarding AUTHINFO, the real string that I should have written is: AUTHINFO USER \xC9ric 7-bit bytes are considered to be encoded in ASCII. 8-bit bytes are just 8-bit bytes. No encoding. The news cli

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
Julien ÉLIE added the comment: David: no, the RFC does not mention UTF-8 about AUTHINFO. Please note the subtlety: command =/ authinfo-sasl-command / authinfo-user-command / authinfo-pass-command authinfo-sasl-command = "AUTHINFO" WS "SASL" WS mechanism [WS initi

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread R. David Murray
R. David Murray added the comment: Éric: UTF-8 (IIUC the RFC says "SHOULD be UTF-8"). Julien: yes, there are differences in the way printing to the console works between 2.x and 3.x, and this has caused some surprises for Windows users, where the default console codec is a bit limited. So ye

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Éric Araujo
Éric Araujo added the comment: But É cannot be transferred as is. It needs to be encoded to bytes using some encoding. What encoding is correct? -- ___ Python tracker ___ ___

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
Julien ÉLIE added the comment: Yes, you're right. I meant to say that AUTHINFO is not expecting a UTF-8-encoded string. For instance: AUTHINFO USER Éric is valid and should not always be transformed by nntplib to: AUTHINFO USER Éric News servers do a byte-string comparison (as specified i

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Éric Araujo
Éric Araujo added the comment: FTR, a UTF-8 string *is* a byte string. -- nosy: +eric.araujo ___ Python tracker ___ ___ Python-bugs-l

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
Julien ÉLIE added the comment: Traceback (most recent call last): File "nntplib-test.py", line 10, in print(s.descriptions('*')) File "C:\Program Files\Python32\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeE

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread R. David Murray
R. David Murray added the comment: What's the exception? If there were any escaped bytes in the string returned by descriptions, you would get an error when you try to print them. This could be a design problem. -- nosy: +r.david.murray ___ Python

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
Changes by Julien ÉLIE : -- components: +Unicode ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.py

[issue10284] Exception raised when decoding NNTP newsgroup descriptions

2010-11-01 Thread Julien ÉLIE
New submission from Julien ÉLIE : > +# - all commands are encoded as UTF-8 data (using the "surrogateescape" > +# error handler), except for raw message data (POST, IHAVE) > +# - all responses are decoded as UTF-8 data (using the "surrogateescape" > +# error handler), except for raw message d