Edward K Ream <[EMAIL PROTECTED]> added the comment:
On Mon, Aug 18, 2008 at 1:51 PM, Antoine Pitrou <[EMAIL PROTECTED]>wrote:
>
> Antoine Pitrou <[EMAIL PROTECTED]> added the comment:
>
> From the discussion on the python-3000, it looks like it would be nice
> if sax.parser handled both bytes and unicode streams.
>
> Edward, does your simple fix make sax.parser work entirely well with
> byte streams?
No. The sax.parser seems to have other problems. Here is what I *think* I
know ;-)
1. A smallish .leo file (an xml file) containing a single non-ascii (utf-8)
encoded character appears to have been read correctly with Python 3.0.
2. A larger .leo file fails as follows (it's possible that the duplicate
error messages are a Leo problem):
Traceback (most recent call last):
Traceback (most recent call last):
File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in
parse_leo_file
parser.parse(theFile) # expat does not support parseString
File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in
parse_leo_file
parser.parse(theFile) # expat does not support parseString
File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse
buffer = file.read(self._bufsize)
File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse
buffer = file.read(self._bufsize)
File "C:\Python30\lib\io.py", line 1670, in read
eof = not self._read_chunk()
File "C:\Python30\lib\io.py", line 1670, in read
eof = not self._read_chunk()
File "C:\Python30\lib\io.py", line 1499, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "C:\Python30\lib\io.py", line 1499, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "C:\Python30\lib\io.py", line 1236, in decode
output = self.decoder.decode(input, final=final)
File "C:\Python30\lib\io.py", line 1236, in decode
output = self.decoder.decode(input, final=final)
File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74:
character maps to <undefined>
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74:
character maps to <undefined>
The same calls to sax read the file correctly on Python 2.5.
It would be nice to have a message pinpoint the line and character offset of
the problem.
My vote would be for the code to work on both kinds of input streams. This
would save the users considerable confusion if sax does the (tricky)
conversions automatically.
Imo, now would be the most convenient time to attempt this--there is a
certain freedom in having everything be partially broken :-)
Edward
--------------------------------------------------------------------
Edward K. Ream email: [EMAIL PROTECTED]
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------
Added file: http://bugs.python.org/file11147/unnamed
_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3590>
_______________________________________
<div dir="ltr"><br><br><div class="gmail_quote">On Mon, Aug 18, 2008 at 1:51
PM, Antoine Pitrou <span dir="ltr"><<a href="mailto:[EMAIL
PROTECTED]">[EMAIL PROTECTED]</a>></span> wrote:<br><blockquote
class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin:
0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d"><br>
Antoine Pitrou <<a href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a>>
added the comment:<br>
<br>
</div>From the discussion on the python-3000, it looks like it would be nice<br>
if sax.parser handled both bytes and unicode
streams.<br></blockquote><div> <br></div><blockquote class="gmail_quote"
style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;
padding-left: 1ex;">
Edward, does your simple fix make sax.parser work entirely well with<br>
byte streams?</blockquote><div><br>No. The sax.parser seems to have other
problems. Here is what I *think* I know ;-)<br><br>1. A smallish .leo
file (an xml file) containing a single non-ascii (utf-8) encoded character
appears to have been read correctly with Python 3.0.<br>
<br>2. A larger .leo file fails as follows (it's possible that the
duplicate error messages are a Leo problem):<br><br>Traceback (most recent call
last):<br>Traceback (most recent call last):<br><br> File
"C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in
parse_leo_file<br>
parser.parse(theFile) # expat does not support
parseString<br> File
"C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in
parse_leo_file<br> parser.parse(theFile) # expat does not
support parseString<br>
<br> File "c:\python30\lib\xml\sax\expatreader.py", line 107,
in parse<br> xmlreader.IncrementalParser.parse(self,
source)<br> File "c:\python30\lib\xml\sax\expatreader.py", line
107, in parse<br>
xmlreader.IncrementalParser.parse(self,
source)<br><br> File "c:\python30\lib\xml\sax\xmlreader.py",
line 121, in parse<br> buffer =
file.read(self._bufsize)<br> File
"c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse<br>
buffer = file.read(self._bufsize)<br><br> File
"C:\Python30\lib\io.py", line 1670, in read<br> eof
= not self._read_chunk()<br> File "C:\Python30\lib\io.py", line
1670, in read<br> eof = not self._read_chunk()<br>
<br> File "C:\Python30\lib\io.py", line 1499, in
_read_chunk<br>
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))<br> File
"C:\Python30\lib\io.py", line 1499, in
_read_chunk<br>
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))<br>
<br> File "C:\Python30\lib\io.py", line 1236, in
decode<br> output = self.decoder.decode(input,
final=final)<br> File "C:\Python30\lib\io.py", line 1236, in
decode<br> output = self.decoder.decode(input,
final=final)<br>
<br> File "C:\Python30\lib\encodings\cp1252.py", line 23, in
decode<br> return
codecs.charmap_decode(input,self.errors,decoding_table)[0]<br> File
"C:\Python30\lib\encodings\cp1252.py", line 23, in decode<br>
return
codecs.charmap_decode(input,self.errors,decoding_table)[0]<br><br>UnicodeDecodeError:
'charmap' codec can't decode byte 0x81 in position 74: character
maps to <undefined><br>UnicodeDecodeError: 'charmap' codec
can't decode byte 0x81 in position 74: character maps to
<undefined><br>
<br>The same calls to sax read the file correctly on Python 2.5.<br><br>It
would be nice to have a message pinpoint the line and character offset of the
problem.<br><br>My vote would be for the code to work on both kinds of input
streams. This would save the users considerable confusion if sax does the
(tricky) conversions automatically.<br>
<br>Imo, now would be the most convenient time to attempt this--there is a
certain freedom in having everything be partially broken
:-)<br><br>Edward<br></div></div>--------------------------------------------------------------------<br>
Edward K. Ream email: <a href="mailto:[EMAIL PROTECTED]">[EMAIL
PROTECTED]</a><br>Leo: <a
href="http://webpages.charter.net/edreamleo/front.html">http://webpages.charter.net/edreamleo/front.html</a><br>--------------------------------------------------------------------<br>
<br>
</div>
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com