On 1月27日, 下午7时04分, John Machin <[EMAIL PROTECTED]> wrote: > On Jan 27, 9:18 pm, glacier <[EMAIL PROTECTED]> wrote: > > > > > > > On 1月24日, 下午4时44分, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > > > > On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote: > > > > My second question is: is there any one who has tested very long mbcs > > > > decode? I tried to decode a long(20+MB) xml yesterday, which turns out > > > > to be very strange and cause SAX fail to parse the decoded string. > > > > That's because SAX wants bytes, not a decoded string. Don't decode it > > > yourself. > > > > > However, I use another text editor to convert the file to utf-8 and > > > > SAX will parse the content successfully. > > > > Because now you feed SAX with bytes instead of a unicode string. > > > > Ciao, > > > Marc 'BlackJack' Rintsch > > > Yepp. I feed SAX with the unicode string since SAX didn't support my > > encoding system(GBK). > > Let's go back to the beginning. What is "SAX"? Show us exactly what > command or code you used. > SAX is the package 'xml.sax' distributed with Python 2.5:) 1,I read text from a GBK encoded XML file then I skip the first line declare the encoding. 2,I converted the string to uncode by call decode('mbcs') 3,I used xml.sax.parseString to parse the string.
######################################################################## f = file('e:/temp/456.xml','rb') s = f.read() f.close() n = 0 for i in xrange(len(s)): if s[i]=='\n': n += 1 if n == 1: s = s[i+1:] break s = '<root>'+s+'</root>' s = s.decode('mbcs') xml.sax.parseString(s,handler,handler) ######################################################################## > How did you let this SAX know that the file was encoded in GBK? An > argument to SAX? An encoding declaration in the first few lines of the > file? Some other method? ... precise answer please. Or did you expect > that this SAX would guess correctly what the encoding was without > being told? I didn't tell the SAX the file is encoded in GBK since I used the 'parseString' method. > > What does "didn't support my encoding system" mean? Have you actually > tried pushing raw undecoded GBK at SAX using a suitable documented > method of telling SAX that the file is in fact encoded in GBK? If so, > what was the error message that you got? I mean SAX only support a limited number of encoding such as utf-8 utf-16 etc.,which didn't include GBK. > > How do you know that it's GBK, anyway? Have you considered these > possible scenarios: > (1) It's GBK but you are telling SAX that it's GB2312 > (2) It's GB18030 but you are telling SAX it's GBK > Frankly speaking, I cannot tell if the file contains any GB18030 characters...^______^ > HTH, > John- 隐藏被引用文字 - > > - 显示引用的文字 - -- http://mail.python.org/mailman/listinfo/python-list