Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald
Martin v. Löwis wrote:
>> Any comments?
> 
> -1. First, (as already discussed on the tracker,) "xml" is a bad name
> for an encoding. How would you encode "Hello" "in xml"?

Then how about the suggested "xml-auto-detect"?

> Then, I'd claim that the problem that the codec solves doesn't really
> exist. IOW, most XML parsers implement the auto-detection of encodings,
> anyway, and this is where architecturally this functionality belongs.

But not all XML parsers support all encodings. The XML codec makes it
trivial to add this support to an existing parser.

Furthermore encoding-detection might be part of the responsibility of
the XML parser, but this decoding phase is totally distinct from the
parsing phase, so why not put the decoding into a common library?

> For a text editor, much more useful than a codec would be a routine
> (say, xml.detect_encoding) which performs the auto-detection.

There's a (currently undocumented) codecs.detect_xml_encoding() in the
patch. We could document this function and make it public. But if
there's no codec that uses it, this function IMHO doesn't belong in the
codecs module. Should this function be available from xml/__init__.py or
should be put it into something like xml/utils.py?

> Finally, I think the codec is incorrect. When saving XML to a file
> (e.g. in a text editor), there should rarely be encoding errors, since
> one could use character references in many cases.

This requires some intelligent fiddling with the errors attribute of the
encoder.

> Also, the XML
> spec talks about detecting EBCDIC, which I believe your implementation
> doesn't.

Correct, but as long as Python doesn't have an EBCDIC codec, that won't
help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
rather simple though.

Servus,
   Walter

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Summary of Tracker Issues

2007-11-08 Thread Tracker

ACTIVITY SUMMARY (11/01/07 - 11/08/07)
Tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue 
number.  Do NOT respond to this message.


 1319 open (+18) / 11570 closed (+19) / 12889 total (+37)

Open issues with patches:   419

Average duration of open issues: 686 days.
Median duration of open issues: 785 days.

Open Issues Breakdown
   open  1314 (+18)
pending 5 ( +0)

Issues Created Or Reopened (37)
___

Doc changes left over after mega-merge from trunk11/01/07
CLOSED http://bugs.python.org/issue1370created  gvanrossum   
   

Two bsddb tests temporarily commented out in py3k branch 11/01/07
CLOSED http://bugs.python.org/issue1371created  gvanrossum   
   py3k

zlibmodule.c: int overflow in PyZlib_decompress  11/02/07
   http://bugs.python.org/issue1372created  PeterW   
   patch, 64bit

turn off socket timeout in test_xmlrpc   11/02/07
CLOSED http://bugs.python.org/issue1373created  hupp 
   py3k, patch 

IDLE - minor FormatParagraph bug fix 11/02/07
   http://bugs.python.org/issue1374created  taleinat 
   patch   

hotshot IndexError when loading stats11/02/07
   http://bugs.python.org/issue1375created  ratsberg 
   

uu module catches a wrong exception type 11/02/07
CLOSED http://bugs.python.org/issue1376created  billiejoex   
   

test_import breaks on Linux  11/02/07
CLOSED http://bugs.python.org/issue1377created  gvanrossum   
   py3k

fromfd() and dup() for _socket on WIndows11/03/07
   http://bugs.python.org/issue1378created  roudkerk 
   patch   

reloading imported modules sometimes fail with 'parent not in sy 11/03/07
CLOSED http://bugs.python.org/issue1379created  _doublep 
   py3k, patch 

fix for test_asynchat and test_asyncore on pep3137 branch11/03/07
CLOSED http://bugs.python.org/issue1380created  hupp 
   py3k, patch 

cmath is numerically unsound 11/03/07
   http://bugs.python.org/issue1381created  inducer  
   

py3k-pep3137: patch for test_ctypes  11/04/07
CLOSED http://bugs.python.org/issue1382created  amaury.forgeotdarc   
   py3k, patch 

Backport abcoll to 2.6   11/04/07
   http://bugs.python.org/issue1383created  baranguren   
   patch   

Windows fix for inspect tests11/04/07
CLOSED http://bugs.python.org/issue1384created  tiran
   py3k, patch 

hmac module violates RFC for some hash functions, e.g. sha51211/04/07
CLOSED http://bugs.python.org/issue1385created  jowagner 
   py3k

py3k-pep3137: patch to ensure that all codecs return bytes   11/04/07
CLOSED http://bugs.python.org/issue1386created  amaury.forgeotdarc   
   py3k, patch 

py3k-pep3137: patch for hashlib on Windows   11/04/07
CLOSED http://bugs.python.org/issue1387created  amaury.forgeotdarc   
   py3k, patch 

py3k-pep3137: possible ref leak in ctypes11/05/07
CLOSED http://bugs.python.org/issue1388created  tiran
   py3k

py3k-pep3137: struct module is leaking references11/05/07

Re: [Python-Dev] XML codec?

2007-11-08 Thread Martin v. Löwis
> Then how about the suggested "xml-auto-detect"?

That is better.

>> Then, I'd claim that the problem that the codec solves doesn't really
>> exist. IOW, most XML parsers implement the auto-detection of encodings,
>> anyway, and this is where architecturally this functionality belongs.
> 
> But not all XML parsers support all encodings. The XML codec makes it
> trivial to add this support to an existing parser.

I would like to question this claim. Can you give an example of a parser
that doesn't support a specific encoding and where adding such a codec
solves that problem?

In particular, why would that parser know how to process Python Unicode
strings?

> Furthermore encoding-detection might be part of the responsibility of
> the XML parser, but this decoding phase is totally distinct from the
> parsing phase, so why not put the decoding into a common library?

I would not object to that - just to expose it as a codec. Adding it
to the XML library is fine, IMO.

> There's a (currently undocumented) codecs.detect_xml_encoding() in the
> patch. We could document this function and make it public. But if
> there's no codec that uses it, this function IMHO doesn't belong in the
> codecs module. Should this function be available from xml/__init__.py or
> should be put it into something like xml/utils.py?

Either - or.

>> Finally, I think the codec is incorrect. When saving XML to a file
>> (e.g. in a text editor), there should rarely be encoding errors, since
>> one could use character references in many cases.
> 
> This requires some intelligent fiddling with the errors attribute of the
> encoder.

Much more than that, I think - you cannot use a character reference
in an XML Name. So the codec would have to parse the output stream
to know whether or not a character reference could be used.

> Correct, but as long as Python doesn't have an EBCDIC codec, that won't
> help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
> rather simple though.

But it does! cp037 is EBCDIC, and supported by Python.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald
Walter Dörwald wrote:

> Martin v. Löwis wrote:
> 
> [...]
>>> Correct, but as long as Python doesn't have an EBCDIC codec, that won't
>>> help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
>>> rather simple though.
>> But it does! cp037 is EBCDIC, and supported by Python.
> 
> I didn't know that. I'm going to update the patch.

Done: http://bugs.python.org/1399

I also renamed the codec to xml_auto_detect.

Servus,
Walter
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Martin v. Löwis
> ci = codecs.lookup("xml-auto-detect")
> p = expat.ParserCreate()
> e = "utf-32"
> s = (u"" % e).encode(e)
> s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
> p.Parse(s, True)

So how come the document being parsed is recognized as UTF-8?

> OK, so should I put the C code into a _xml module?

I don't see the need for C code at all.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald
Martin v. Löwis wrote:

>> Then how about the suggested "xml-auto-detect"?
> 
> That is better.

OK.

>>> Then, I'd claim that the problem that the codec solves doesn't really
>>> exist. IOW, most XML parsers implement the auto-detection of encodings,
>>> anyway, and this is where architecturally this functionality belongs.
>> But not all XML parsers support all encodings. The XML codec makes it
>> trivial to add this support to an existing parser.
> 
> I would like to question this claim. Can you give an example of a parser
> that doesn't support a specific encoding

It seems that e.g. expat doesn't support UTF-32:

from xml.parsers import expat

p = expat.ParserCreate()
e = "utf-32"
s = (u"" % e).encode(e)
p.Parse(s, True)

This fails with:

Traceback (most recent call last):
   File "gurk.py", line 6, in 
 p.Parse(s, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, 
column 1

Replace "utf-32" with "utf-16" and the problem goes away.

> and where adding such a codec
> solves that problem?
> 
> In particular, why would that parser know how to process Python Unicode
> strings?

It doesn't have to. You can use an XML encoder to reencode the unicode 
string into bytes (forcing an encoding that the parser knows):

import codecs
from xml.parsers import expat

ci = codecs.lookup("xml-auto-detect")
p = expat.ParserCreate()
e = "utf-32"
s = (u"" % e).encode(e)
s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
p.Parse(s, True)

>> Furthermore encoding-detection might be part of the responsibility of
>> the XML parser, but this decoding phase is totally distinct from the
>> parsing phase, so why not put the decoding into a common library?
> 
> I would not object to that - just to expose it as a codec. Adding it
> to the XML library is fine, IMO.

But it does make sense as a codec. The decoding phase of an XML parser 
has to turn a byte stream into a unicode stream. That's the job of a codec.

>> There's a (currently undocumented) codecs.detect_xml_encoding() in the
>> patch. We could document this function and make it public. But if
>> there's no codec that uses it, this function IMHO doesn't belong in the
>> codecs module. Should this function be available from xml/__init__.py or
>> should be put it into something like xml/utils.py?
> 
> Either - or.

OK, so should I put the C code into a _xml module?

>>> Finally, I think the codec is incorrect. When saving XML to a file
>>> (e.g. in a text editor), there should rarely be encoding errors, since
>>> one could use character references in many cases.
>> This requires some intelligent fiddling with the errors attribute of the
>> encoder.
> 
> Much more than that, I think - you cannot use a character reference
> in an XML Name. So the codec would have to parse the output stream
> to know whether or not a character reference could be used.

That's what I meant with "intelligent" fiddling. But I agree this is way 
beyond what a text editor should do. AFAIK it is way beyond what 
existing text editors do. However using the XML codec would at least 
guarantee that the encoding specified in the XML declaration and the 
encoding used for encoding the file stay consistent.

>> Correct, but as long as Python doesn't have an EBCDIC codec, that won't
>> help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
>> rather simple though.
> 
> But it does! cp037 is EBCDIC, and supported by Python.

I didn't know that. I'm going to update the patch.

Servus,
Walter
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Adam Olsen
On 11/8/07, Walter Dörwald <[EMAIL PROTECTED]> wrote:
> Martin v. Löwis wrote:
>
> >> Then how about the suggested "xml-auto-detect"?
> >
> > That is better.
>
> OK.
>
> >>> Then, I'd claim that the problem that the codec solves doesn't really
> >>> exist. IOW, most XML parsers implement the auto-detection of encodings,
> >>> anyway, and this is where architecturally this functionality belongs.
> >> But not all XML parsers support all encodings. The XML codec makes it
> >> trivial to add this support to an existing parser.
> >
> > I would like to question this claim. Can you give an example of a parser
> > that doesn't support a specific encoding
>
> It seems that e.g. expat doesn't support UTF-32:
>
> from xml.parsers import expat
>
> p = expat.ParserCreate()
> e = "utf-32"
> s = (u"" % e).encode(e)
> p.Parse(s, True)
>
> This fails with:
>
> Traceback (most recent call last):
>File "gurk.py", line 6, in 
>  p.Parse(s, True)
> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
> column 1
>
> Replace "utf-32" with "utf-16" and the problem goes away.
>
> > and where adding such a codec
> > solves that problem?
> >
> > In particular, why would that parser know how to process Python Unicode
> > strings?
>
> It doesn't have to. You can use an XML encoder to reencode the unicode
> string into bytes (forcing an encoding that the parser knows):
>
> import codecs
> from xml.parsers import expat
>
> ci = codecs.lookup("xml-auto-detect")
> p = expat.ParserCreate()
> e = "utf-32"
> s = (u"" % e).encode(e)
> s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
> p.Parse(s, True)
>
> >> Furthermore encoding-detection might be part of the responsibility of
> >> the XML parser, but this decoding phase is totally distinct from the
> >> parsing phase, so why not put the decoding into a common library?
> >
> > I would not object to that - just to expose it as a codec. Adding it
> > to the XML library is fine, IMO.
>
> But it does make sense as a codec. The decoding phase of an XML parser
> has to turn a byte stream into a unicode stream. That's the job of a codec.

Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
codecs to do the encoding.  There's no need to create a magical
mystery codec to pick out which though.  It's not even sufficient for
XML:

1) round-tripping a file should be done in the original encoding.
Containing the auto-detected encoding within a codec doesn't let you
see what it picked.
2) the encoding may be specified externally from the file/stream[1].
The xml parser needs to handle these out-of-band encodings anyway.


[2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] hex() and oct() still include the trailing L - change this in 2.6?

2007-11-08 Thread Gregory P. Smith
I thought the hell of stripping trailing Ls off of stringed numbers was gone
but it appears that the hex() and oct() builtins still leave the trailing
'L' on longs:

Python 2.6a0 (trunk:58846M, Nov  4 2007, 15:44:12)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 0xc10025be
>>> x
18446744072652596670L
>>> str(x)
'18446744072652596670'
>>> hex(x)
'0xc10025beL'
>>> '0x%x' % (x)
'0xc10025be'
>>> oct(x)
'017770100022676L'

This appears to be fixed in py3k (as there is no longer an int/long to
distinguish).  Can we at least get rid of the annoying L in 2.6?

-gps
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] hex() and oct() still include the trailing L - change this in 2.6?

2007-11-08 Thread Brett Cannon
On Nov 8, 2007 6:05 PM, Gregory P. Smith <[EMAIL PROTECTED]> wrote:
> I thought the hell of stripping trailing Ls off of stringed numbers was gone
> but it appears that the hex() and oct() builtins still leave the trailing
> 'L' on longs:
>
> Python 2.6a0 (trunk:58846M, Nov  4 2007, 15:44:12)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> x = 0xc10025be
> >>> x
>  18446744072652596670L
> >>> str(x)
> '18446744072652596670'
> >>> hex(x)
> '0xc10025beL'
> >>> '0x%x' % (x)
> '0xc10025be'
> >>> oct(x)
> '017770100022676L'
>
> This appears to be fixed in py3k (as there is no longer an int/long to
> distinguish).  Can we at least get rid of the annoying L in 2.6?

It will break code, so probably not.  Consider this motivation to move
over to Python 3.0.  =)

-Brett
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] hex() and oct() still include the trailing L - change this in 2.6?

2007-11-08 Thread Guido van Rossum
On Nov 8, 2007 6:18 PM, Brett Cannon <[EMAIL PROTECTED]> wrote:
> On Nov 8, 2007 6:05 PM, Gregory P. Smith <[EMAIL PROTECTED]> wrote:
> > I thought the hell of stripping trailing Ls off of stringed numbers was gone
> > but it appears that the hex() and oct() builtins still leave the trailing
> > 'L' on longs:
> >
> > Python 2.6a0 (trunk:58846M, Nov  4 2007, 15:44:12)
> > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> x = 0xc10025be
> > >>> x
> >  18446744072652596670L
> > >>> str(x)
> > '18446744072652596670'
> > >>> hex(x)
> > '0xc10025beL'
> > >>> '0x%x' % (x)
> > '0xc10025be'
> > >>> oct(x)
> > '017770100022676L'
> >
> > This appears to be fixed in py3k (as there is no longer an int/long to
> > distinguish).  Can we at least get rid of the annoying L in 2.6?
>
> It will break code, so probably not.  Consider this motivation to move
> over to Python 3.0.  =)

Right. Or perhaps in some kind of forward compatibility mode. A future
import might do: from __future__ import no_long_suffix perhaps.

Reminder (I can't say this enough): Python 2.6 needs to be as close as
possible to 2.5, only adding forward compatibility with 3.0 as an
option (using either a command line flag or a future import depending
on what feature we're talking about).

Additions and improvements are fine of course; but deletions or
changes "in anticipation of 3.0" should not occur by default, only
when a specific forward compatibility feature is requested.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com