Re: [Python-Dev] Matrix product

2008-07-30 Thread Raymond Hettinger

Further, while A**B is not so common, A**n is quite common (for
integral n, in the sense of repeated matrix multiplication). So a
matrix multiplication operator really should come with a power
operator cousin.


Which obviously should be @@ :-)


I think much of this thread is a repeat of conversations
that were held for PEP 225:
http://www.python.org/dev/peps/pep-0225/

That PEP is marked as deferred.  Maybe it's time to
bring it back to life.


Raymond
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Matrix product

2008-07-30 Thread Sebastien Loisel
Dear Raymond,

Thank you for your email.

> I think much of this thread is a repeat of conversations
> that were held for PEP 225:
> http://www.python.org/dev/peps/pep-0225/
>
> That PEP is marked as deferred.  Maybe it's time to
> bring it back to life.

This is a much better PEP than the one I had found, and would solve
all of the numpy problems. The PEP is very well thought-out.

Sincerely,

-- 
Sébastien Loisel
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Hi folks,

This issue got some attention a few weeks back but it seems to have
fallen quiet, and I haven't had a good chance to sit down and reply
again till now.

As I've said before this is a serious issue which will affect a great
deal of code. However it's obviously not as clear-cut as I originally
believed, since there are lots of conflicting opinions. Let us see if
we can come to a consensus.

(For those who haven't seen the discussion, the thread starts here:
http://mail.python.org/pipermail/python-dev/2008-July/081013.html
continues here for some reason:
http://mail.python.org/pipermail/python-dev/2008-July/081066.html
and I've got a bug report with a fully tested and documented patch here:
http://bugs.python.org/issue3300)

Firstly, it looks like most of the people agree we should add an
optional "encoding" argument which lets the caller customize which
encoding to use. What we tend to disagree about is what the default
encoding should be.

Here I present the various options as I see it (and I'm trying to be
impartial), and the people who've indicated support for that option
(apologies if I've misrepresented anybody's opinion, feel free to
correct):

1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to
UTF-8. unquote is Latin-1.
In favour: Anybody who doesn't reply to this thread
Pros: Already implemented; some existing code depends upon ord values
of string being the same as they were for byte strings; possible to
hack around it.
Cons: unquote is not inverse of quote; quote behaviour
internally-inconsistent; garbage when unquoting UTF-8-encoded URIs.

2. Default to UTF-8.
In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
Pros: Fully working and tested solution is implemented; recommended by
RFC 3986 for all future schemes; recommended by W3C for use with HTML;
UTF-8 used by all major browsers; supports all characters; most
existing code compatible by default; unquote is inverse of quote.
Cons: By default, URIs may have invalid octet sequences (not possible
to reverse).

3. quote default to UTF-8, unquote default to Latin-1.
In favour: André Malo
Pros: quote able to handle all characters; unquote able to handle all sequences.
Cons: unquote is not inverse of quote; totally inconsistent.

4. quote accepts either bytes or str, unquote default to outputting
bytes unless given an encoding argument.
In favour: Bill Janssen
Pros: Technically does what the spec says, which is treat it as an
octet encoding.
Cons: unquote will break most existing code; almost 100% of the time
people will want it as a string.



I'll just comment on #4 since I haven't already. Let's talk about
quote and unquote separately. For quote, I'm all for letting it accept
a bytes as well as a str. That doesn't break anything or surprise
anyone.

For unquote, I think it will break a lot and surprise everyone. I
think that while this may be "purely" the best option, it's pretty
silly. I reckon the vast majority of users will be surprised when they
see it spitting out a bytes object, and all that most people will do
is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs
specify a method for encoding octet sequences", I'm reading them as
"URLs specify a method for encoding strings, and leave the character
encoding unspecified." The second reading supports the idea that
unquote outputs a str.

I'm also recommending we add unquote_to_bytes to do what you suggest
unquote should do. (So either way we'll get both versions of unquote;
I'm just suggesting the one called "unquote" do the thing everybody
expects). But that's less of a priority so I want to commit these
urgent fixes first.

I'm basically saying just two things: 1. The standards are undefined;
2. Therefore we should pick the most useful and/or intuitive default.
IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be
more so in the future when more technologies are hard-coded as UTF-8
(which this RFC recommends they do in the future).

I am also quite adamant that unquote be the inverse of quote.

Are there any more opinions on this matter? It would be good to reach
a consensus. If anyone seriously wants to push a different alternative
to mine, please write a working implementation and attach it to issue
3300.

On the technical side of things, does anybody have time to review my
patch for this issue?
http://bugs.python.org/issue3300
Patch 5.
It's just a patch for unquote, quote, and small related functions, as
well as numerous changes to test cases and documentation.

Cheers
Matt
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Arg! Damnit, why do my replies get split off from the main thread?
Sorry about any confusion this may be causing.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Oleg Broytmann
On Thu, Jul 31, 2008 at 12:11:40AM +1000, Matt Giuca wrote:
> 2. Default to UTF-8.
> In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven

   Count me too: +1. Most sites I use theese days use UTF-8 for URL
encoding. Examples:

Wikipedia:
http://ru.wikipedia.org/wiki/%D0%93%D0%B2%D0%B8%D0%B4%D0%BE_%D0%B2%D0%B0%D0%BD_%D0%A0%D0%BE%D1%81%D1%81%D1%83%D0%BC

LingVo (Russian-English dictionary):
http://lingvo.yandex.ru/en?text=%D0%BF%D0%B8%D1%82%D0%BE%D0%BD

>>> print urllib.quote(unicode('питон', 'koi8-r').encode('utf-8'))
%D0%BF%D0%B8%D1%82%D0%BE%D0%BD

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Facundo Batista
2008/7/30 Matt Giuca <[EMAIL PROTECTED]>:

> 2. Default to UTF-8.
> In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
> Pros: Fully working and tested solution is implemented; recommended by
> RFC 3986 for all future schemes; recommended by W3C for use with HTML;
> UTF-8 used by all major browsers; supports all characters; most
> existing code compatible by default; unquote is inverse of quote.
> Cons: By default, URIs may have invalid octet sequences (not possible
> to reverse).

+1, assuming that if you have a different encoding in the URI you can
pass it as a parameter.

Regards,

-- 
. Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Antoine Pitrou
Facundo Batista  gmail.com> writes:

> 
> 2008/7/30 Matt Giuca  gmail.com>:
> 
> > 2. Default to UTF-8.
> > In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
> > Pros: Fully working and tested solution is implemented; recommended by
> > RFC 3986 for all future schemes; recommended by W3C for use with HTML;
> > UTF-8 used by all major browsers; supports all characters; most
> > existing code compatible by default; unquote is inverse of quote.
> > Cons: By default, URIs may have invalid octet sequences (not possible
> > to reverse).
> 
> +1, assuming that if you have a different encoding in the URI you can
> pass it as a parameter.

+1 for me as well, with an optional encoding parameter to override the default.
Also, your "con" is a "pro" to me, since it means errors are reported instead of
silently producing garbage (as would be the case with latin1).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread André Malo
[I was pretty busy these days, so sorry for jumping in late again]

* Matt Giuca wrote: 

> 1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to
> UTF-8. unquote is Latin-1.
> In favour: Anybody who doesn't reply to this thread
> Pros: Already implemented; some existing code depends upon ord values
> of string being the same as they were for byte strings; possible to
> hack around it.
> Cons: unquote is not inverse of quote; quote behaviour
> internally-inconsistent; garbage when unquoting UTF-8-encoded URIs.

> 2. Default to UTF-8.
> In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
> Pros: Fully working and tested solution is implemented; recommended by
> RFC 3986 for all future schemes; recommended by W3C for use with HTML;
> UTF-8 used by all major browsers; supports all characters; most
> existing code compatible by default; unquote is inverse of quote.
> Cons: By default, URIs may have invalid octet sequences (not possible
> to reverse).

Con: URI encoding does not encode characters.

>
> 3. quote default to UTF-8, unquote default to Latin-1.
> In favour: André Malo
> Pros: quote able to handle all characters; unquote able to handle all
> sequences. Cons: unquote is not inverse of quote; totally inconsistent.

I'm not in favour of that. I merely answered a question there ;)

I'm actually in favour of encoding bytes only back and forth. A useful 
extension would be *another* function which wraps quote/unquote and encodes 
and decodes characters.


> 4. quote accepts either bytes or str, unquote default to outputting
> bytes unless given an encoding argument.
> In favour: Bill Janssen
> Pros: Technically does what the spec says, which is treat it as an
> octet encoding.
> Cons: unquote will break most existing code; almost 100% of the time
> people will want it as a string.
>
> 
>
> I'll just comment on #4 since I haven't already. Let's talk about
> quote and unquote separately. For quote, I'm all for letting it accept
> a bytes as well as a str. That doesn't break anything or surprise
> anyone.
>
> For unquote, I think it will break a lot and surprise everyone. I
> think that while this may be "purely" the best option, it's pretty
> silly. I reckon the vast majority of users will be surprised when they
> see it spitting out a bytes object, and all that most people will do
> is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs
> specify a method for encoding octet sequences", I'm reading them as
> "URLs specify a method for encoding strings, and leave the character
> encoding unspecified." The second reading supports the idea that
> unquote outputs a str.
>
> I'm also recommending we add unquote_to_bytes to do what you suggest
> unquote should do. (So either way we'll get both versions of unquote;
> I'm just suggesting the one called "unquote" do the thing everybody
> expects). But that's less of a priority so I want to commit these
> urgent fixes first.
>
> I'm basically saying just two things: 1. The standards are undefined;

That's still disputed...

> 2. Therefore we should pick the most useful and/or intuitive default.
> IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be
> more so in the future when more technologies are hard-coded as UTF-8
> (which this RFC recommends they do in the future).

See my suggestion above.

nd
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 8:09 AM, André Malo <[EMAIL PROTECTED]> wrote:
> I'm actually in favour of encoding bytes only back and forth. A useful
> extension would be *another* function which wraps quote/unquote and encodes
> and decodes characters.

I'd reverse this. By all means, add a new pair of functions that is
bytes in / bytes out. But keep the existing functions purely string in
/ string out, hardcoded to UTF-8. People wanting another encoding can
use the bytes functions and explicit encode / decode calls.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
> For unquote, I think it will break a lot and surprise everyone. I
> think that while this may be "purely" the best option, it's pretty
> silly.

I don't mind being silly to do the right thing.  Happens to me a lot :-).

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
> On Wed, Jul 30, 2008 at 8:09 AM, André Malo <[EMAIL PROTECTED]> wrote:
> > I'm actually in favour of encoding bytes only back and forth. A useful
> > extension would be *another* function which wraps quote/unquote and encod=
> es
> > and decodes characters.
> 
> I'd reverse this. By all means, add a new pair of functions that is
> bytes in / bytes out. But keep the existing functions purely string in
> / string out, hardcoded to UTF-8. People wanting another encoding can
> use the bytes functions and explicit encode / decode calls.

Actually (as I pointed out before) the existing functions are not
string-in/string-out.  They are something-in and bytes-out.  just look
like string-in/string-out because of the confusion between byte
strings and Unicode strings in Python 1 and 2.

Look, Matt's suggestion is a degradation of the integrity of the
stdlib, because it enthrones a broken understanding, a misreading of
the RFC, in a very prominent place.  I'd prefer not to have Python
contribute to that breakage.  Keep the functions the way they are now:
bytes-in and bytes-out.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 9:52 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
>> On Wed, Jul 30, 2008 at 8:09 AM, André Malo <[EMAIL PROTECTED]> wrote:
>> > I'm actually in favour of encoding bytes only back and forth. A useful
>> > extension would be *another* function which wraps quote/unquote and encod=
>> es
>> > and decodes characters.
>>
>> I'd reverse this. By all means, add a new pair of functions that is
>> bytes in / bytes out. But keep the existing functions purely string in
>> / string out, hardcoded to UTF-8. People wanting another encoding can
>> use the bytes functions and explicit encode / decode calls.
>
> Actually (as I pointed out before) the existing functions are not
> string-in/string-out.  They are something-in and bytes-out.  just look
> like string-in/string-out because of the confusion between byte
> strings and Unicode strings in Python 1 and 2.

Actually, we'd need to look at the various other APIs in Py3k before
we can decide whether these should be considered taking or returning
bytes or text. It looks like all other APIs in the Py3k version of
urllib treat URLs as text. I don't think switching these to bytes
would be a good idea; you might as well claim that filenames should be
bytes because that's how the filesystem stores them.

> Look, Matt's suggestion is a degradation of the integrity of the
> stdlib, because it enthrones a broken understanding, a misreading of
> the RFC, in a very prominent place.  I'd prefer not to have Python
> contribute to that breakage.  Keep the functions the way they are now:
> bytes-in and bytes-out.

I think that would break too much code, without a good way to
automatically fix it.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
> Actually (as I pointed out before) the existing functions are not
> string-in/string-out.  They are something-in and bytes-out.

Sorry, this is wrong.  "quote" is clearly bytes-in and string-out.
"unquote" is clearly string-in and bytes-out.

The whole point of "quote" is to take an arbitrary sequence of bytes
and represent them as an ASCII string, while unquote reverses this
process.  Again, I urge everyone participating in this discussion to
read RFC 3986.  We're not creating in a vacuum here; we're talking
about implementation of a standard.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
> It looks like all other APIs in the Py3k version of
> urllib treat URLs as text.

The URL is text, a string of ASCII characters.  We're just talking
about urllib.quote() and urllib.unquote(), which are there to support
the text-ization of binary values, and the de-text-ization.

> I think that would break too much code, without a good way to
> automatically fix it.

You'd rather break Python?  Somehow I don't think so.

Here's the signature I'm proposing:

  quote() -- takes string or bytes, and produces string.

 If input is a string, looks to optional "encoding" parameter to
 determine character set encoding to use to transform it to byte before
 quoting it.  If "encoding" is not specified, defaults to UTF-8.

  unquote() -- takes string, produces bytes or string

 If optional "encoding" parameter is specified, decodes bytes with
 that encoding and returns string.  Otherwise, returns bytes.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 10:33 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
>> It looks like all other APIs in the Py3k version of
>> urllib treat URLs as text.
>
> The URL is text, a string of ASCII characters.  We're just talking
> about urllib.quote() and urllib.unquote(), which are there to support
> the text-ization of binary values, and the de-text-ization.
>
>> I think that would break too much code, without a good way to
>> automatically fix it.
>
> You'd rather break Python?  Somehow I don't think so.

Let's stop the rhetoric, or I'll have to beat you over the head with
the Zen of Python. :-)

urllib is not meant as a reference implementation of any RFC; it is
meant as a practical tool for Python users writing web apps (servers
and clients).

> Here's the signature I'm proposing:
>
>  quote() -- takes string or bytes, and produces string.
>
> If input is a string, looks to optional "encoding" parameter to
> determine character set encoding to use to transform it to byte before
> quoting it.  If "encoding" is not specified, defaults to UTF-8.

No contest here, since it supports the common string->string use case.
E.g. quote('a%b') returns 'a%25b'.

>  unquote() -- takes string, produces bytes or string
>
> If optional "encoding" parameter is specified, decodes bytes with
> that encoding and returns string.  Otherwise, returns bytes.

The default of returning bytes will break almost all uses. Most code
will uses the unquoted result as a text string, not as bytes -- e.g. a
server has to unquote the values it receives from a form (whether POST
or GET), but almost always the unquoted values are text, e.g.
someone's name or address, or a draft email message.

(Aside: I dislike functions that have a different return type based on
the value of a parameter.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fuzzing bugs: most bugs are closed

2008-07-30 Thread Guido van Rossum
On Mon, Jul 21, 2008 at 10:41 AM, A.M. Kuchling <[EMAIL PROTECTED]> wrote:
> On Mon, Jul 21, 2008 at 03:53:18PM +, Antoine Pitrou wrote:
>> The underscore at the beginning of _sre clearly indicates that the module is
>> not recommended for direct consumption, IMO. Even the functions that don't
>> themselves start with an underscore...
>
> Sure, but if someone is trying to break in or DoS your application
> server, they don't care if the module starts with an underscore or
> not.
>
> To answer Victor's original question: the parser & compiler that turn
> a regex into bytecode is written in Python.  I can't think of a way to
> prevent other Python modules from importing _sre or accessing the
> compile() function; if nothing else, code could always do 'import re ;
> re.sre_compile._sre.compile(...)'.

I've written a re-code verifier for the Google App Engine. I have
permission to open source this, hopefully I will get to this before
2.6 beta 3.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Jeff Hall
>
>
> (Aside: I dislike functions that have a different return type based on
> the value of a parameter.)
>
>
I wanted to stay out of the whole discussion as it's largely over my head...
But I did want to express support for this idea which I think almost rises
to the level of a standard... I see more bugs created in our software
because of the above issues then anything else... I have no problem with
functions that accept various input but producing various outputs just seems
to wreak havoc...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
> >  unquote() -- takes string, produces bytes or string
> >
> > If optional "encoding" parameter is specified, decodes bytes with
> > that encoding and returns string.  Otherwise, returns bytes.
> 
> The default of returning bytes will break almost all uses. Most code
> will uses the unquoted result as a text string, not as bytes -- e.g. a
> server has to unquote the values it receives from a form (whether POST
> or GET), but almost always the unquoted values are text, e.g.
> someone's name or address, or a draft email message.

I actually do know a lot about the uses of this function...

But:  OK, OK, I yield.  Though I still think this is a bad idea, I'll
shut up if we can also add "unquote_as_bytes" which returns a byte
sequence instead of a string.  I'll just change my code to use that.

> (Aside: I dislike functions that have a different return type based on
> the value of a parameter.)

Fair enough.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 12:49 PM, Bill Janssen <[EMAIL PROTECTED]> wrote:
>> >  unquote() -- takes string, produces bytes or string
>> >
>> > If optional "encoding" parameter is specified, decodes bytes with
>> > that encoding and returns string.  Otherwise, returns bytes.
>>
>> The default of returning bytes will break almost all uses. Most code
>> will uses the unquoted result as a text string, not as bytes -- e.g. a
>> server has to unquote the values it receives from a form (whether POST
>> or GET), but almost always the unquoted values are text, e.g.
>> someone's name or address, or a draft email message.
>
> I actually do know a lot about the uses of this function...
>
> But:  OK, OK, I yield.  Though I still think this is a bad idea, I'll
> shut up if we can also add "unquote_as_bytes" which returns a byte
> sequence instead of a string.  I'll just change my code to use that.
>
>> (Aside: I dislike functions that have a different return type based on
>> the value of a parameter.)
>
> Fair enough.

I think this is as close as consensus as we can get on this issue. Can
whoever wrote the patch adjust the patch to this outcome? (I think the
only change is to remove the encoding arguments and make separate
functions for bytes.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
> I think this is as close as consensus as we can get on this issue. Can
> whoever wrote the patch adjust the patch to this outcome? (I think the
> only change is to remove the encoding arguments and make separate
> functions for bytes.)

This is 2.7/3.1 only, right?  I'm looking at the bales of code I've
got that says something like,

  v = urlib.quote_plus(x.encode("UTF-8", "strict"))

then later on

  x = unicode(urllib.unquote_plus(v), "UTF-8", "strict")

Bill

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] critical issues for 2.6 and 3.0

2008-07-30 Thread Benjamin Peterson
I just went through the disturbingly long list of 67 open issues with
a "critical" priority pinging and trying to get things moving. There
are ~55 now; I was able to close some, but others I promoted to
release blocker for beta 3. Shouldn't all criticals be resolved by the
final?

I've never been through a Python release before, but I find these
statistics rather worrying if we want to make the October release
date. It doesn't help that we are low on active core developers,
presumably because they are taking full advantage of their summer
vacations. :) (Speaking of which, I'm leaving this Saturday.)

Please focus getting fixes reviewed, checked in, and their issue's
closed so we can bring beta 3 out on time!

-- 
Cheers,
Benjamin Peterson
"There's no place like 127.0.0.1."
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] critical issues for 2.6 and 3.0

2008-07-30 Thread Brett Cannon
On Wed, Jul 30, 2008 at 7:31 PM, Benjamin Peterson
<[EMAIL PROTECTED]> wrote:
> I just went through the disturbingly long list of 67 open issues with
> a "critical" priority pinging and trying to get things moving. There
> are ~55 now; I was able to close some, but others I promoted to
> release blocker for beta 3. Shouldn't all criticals be resolved by the
> final?
>

Probably, but at that point they will be promoated to release blocker
as necessary.

> I've never been through a Python release before, but I find these
> statistics rather worrying if we want to make the October release
> date.

If we don't make the release, then we don't make it. Plus this is one
of the more complicated releases that I have been through thanks to
the release of two simultaneous major revisions, so having a lot to do
is not a shock. But people tend to step up work when a beta release is
coming so when we get closer to b3 more work will probably land.

Another thing to keep in mind beyond the open issues is the code in
2.6 that is not 3.0 compatible when Python is run with -3. I just
finished running regrtest with -3 and have a text file listing all of
the code that has some warning thanks to -3. I will try to open an
issue with those files listed as some point soon, but I will hopefully
be able to plow through them rather quickly since most of them are
minor like dict.has_key(), etc.

-Brett
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
> Con: URI encoding does not encode characters.

OK, for all the people who say URI encoding does not encode characters: yes
it does. This is not an encoding for binary data, it's an encoding for
character data, but it's unspecified how the strings map to octets before
being percent-encoded. From RFC 3986, section
1.2.1
:

Percent-encoded octets (Section 2.1) may be used within a URI to represent
> characters outside the range of the US-ASCII coded character set if this
> representation is allowed by the scheme or by the protocol element in which
> the URI is referenced.  Such a definition should specify the character
> encoding used to map those characters to octets prior to being
> percent-encoded for the URI.


So the string->string proposal is actually correct behaviour. I'm all in
favour of a bytes->string version as well, just not with the names "quote"
and "unquote".

I'll prepare a new patch shortly which has bytes->string and string->bytes
versions of the functions as well. (quote will accept either type, while
unquote will output a str, there will be a new function unquote_to_bytes
which outputs a bytes - is everyone happy with that?)

Guido says:

> Actually, we'd need to look at the various other APIs in Py3k before we can
> decide whether these should be considered taking or returning bytes or text.
> It looks like all other APIs in the Py3k version of urllib treat URLs as
> text.


Yes, as I said in the bug tracker, I've groveled over the entire stdlib to
see how my patch affects the behaviour of dependent code. Aside from a few
minor bits which assumed octets (and did their own encoding/decoding) (which
I fixed), all the code assumes strings and is very happy to go on assuming
this, as long as the URIs are encoded with UTF-8, which they almost
certainly are.

Guido says:

> I think the only change is to remove the encoding arguments and ...


You really want me to remove the encoding= named argument? And hard-code
UTF-8 into these functions? It seems like we may as well have the optional
encoding argument, as it does no harm and could be of significant benefit.
I'll post a patch with the unquote_to_bytes function, but leave the encoding
arguments in until this point is clarified.

Matt
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 8:49 PM, Matt Giuca <[EMAIL PROTECTED]> wrote:
>
>> Con: URI encoding does not encode characters.
>
> OK, for all the people who say URI encoding does not encode characters: yes
> it does. This is not an encoding for binary data, it's an encoding for
> character data, but it's unspecified how the strings map to octets before
> being percent-encoded. From RFC 3986, section 1.2.1:
>
>> Percent-encoded octets (Section 2.1) may be used within a URI to represent
>> characters outside the range of the US-ASCII coded character set if this
>> representation is allowed by the scheme or by the protocol element in which
>> the URI is referenced.  Such a definition should specify the character
>> encoding used to map those characters to octets prior to being
>> percent-encoded for the URI.
>
> So the string->string proposal is actually correct behaviour. I'm all in
> favour of a bytes->string version as well, just not with the names "quote"
> and "unquote".
>
> I'll prepare a new patch shortly which has bytes->string and string->bytes
> versions of the functions as well. (quote will accept either type, while
> unquote will output a str, there will be a new function unquote_to_bytes
> which outputs a bytes - is everyone happy with that?)

I'd rather have two pairs of functions, so that those who want to give
the readers of their code a clue can do so. I'm not opposed to having
redundant functions that accept either string or bytes though, unless
others prefer not to.

> Guido says:
>>
>> Actually, we'd need to look at the various other APIs in Py3k before we
>> can decide whether these should be considered taking or returning bytes or
>> text. It looks like all other APIs in the Py3k version of urllib treat URLs
>> as text.
>
> Yes, as I said in the bug tracker, I've groveled over the entire stdlib to
> see how my patch affects the behaviour of dependent code. Aside from a few
> minor bits which assumed octets (and did their own encoding/decoding) (which
> I fixed), all the code assumes strings and is very happy to go on assuming
> this, as long as the URIs are encoded with UTF-8, which they almost
> certainly are.

Sorry, I have yet to look at the tracker (only so many minutes in a day...).

> Guido says:
>>
>> I think the only change is to remove the encoding arguments and ...
>
> You really want me to remove the encoding= named argument? And hard-code
> UTF-8 into these functions? It seems like we may as well have the optional
> encoding argument, as it does no harm and could be of significant benefit.
> I'll post a patch with the unquote_to_bytes function, but leave the encoding
> arguments in until this point is clarified.

I don't mind an encoding argument, as long as it isn't used to change
the return type (as Bill was proposing).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Memory Error while reading large file

2008-07-30 Thread Sumant Gupta
Hi

I have a problem reading very large text file.
When I call the len function to get the total lines in python file.i get memory 
error .
I am reading the list of files in a loop ,2 files are read properly but when 
the third file is read ,
It gives an memory error .

Sumant Gupta
Software Engineer
Ext:5105



"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in 
error,please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility forloss or damage arising from the use of the information 
transmitted by this email including damage from virus."
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Stephen J. Turnbull
Matt Giuca writes:

 > OK, for all the people who say URI encoding does not encode characters: yes
 > it does. This is not an encoding for binary data, it's an encoding for
 > character data, but it's unspecified how the strings map to octets before
 > being percent-encoded.

In other words, it's an encoding for binary data, since the octet
sequences that might be encountered are completely unrestricted.  I
have to side with Bill on this.  URIs are sequences of characters, but
the character set used must contain the ASCII repertoire as a subset,
of which the URI delimiters must be mapped to the corresponding ASCII
codes, the rest of the set must be represented as sequences of octets
(which need not even be constant; you could gzip them first for all
URI-encoding cares).

URI-encoding itself is a purely mechanical process which transforms
reserved octets (not used as delimiters) to percent codes.

 > From RFC 3986, section
 > 1.2.1:

 > > Percent-encoded octets (Section 2.1) may be used within a URI to represent
 > > characters outside the range of the US-ASCII coded character set if this
 > > representation is allowed by the scheme or by the protocol element in which
 > > the URI is referenced.  Such a definition should specify the character
 > > encoding used to map those characters to octets prior to being
 > > percent-encoded for the URI.

This is kinda perverted, but suppose you have bytes which are actually a
Japanese string represented in packed EUC-JP.  AFAICS the paragraph above
does *not* say you can't transcode to UTF-8 before percent-encoding, and
in fact you might be required to by the definition of the scheme.

 > So the string->string proposal is actually correct behaviour.

Ye-e-es, but.  What the RFC clearly envisions is not that the
percent-encoder will be handed an unencoded string that looks like a
URI, but rather a sequence of octets representing one component
(scheme, authority, path, query, etc) of a URI.

In other words, a string->string URI encoder should only be called by
an URI builder, and never with a precomposed URI-like string.

Something like

def URIBuilder (strings):
"""Return an URI built from a list of strings.
The first string *must* be the scheme.
If the URI follows the generic URI syntax of RFC 3986, the
remaining components should be given in the order authority, path,
fragment, query part [, query part ...]."""

def uriencode (s):
"""URI encode a string per RFC 3986 Section 3."""
# We all know what this does.

if strings[0] == "http":
# HTTP scheme, delimiters and authority
uri = "http://"; + uriencode(strings[1]) + "/"
# path, if present
if strings[2]:
uri = uri + uriencode(strings[2])
# query, if present
if  strings[4]:
uri = uri + "?" + uriencode(strings[4])
# further query parameters, if present
for s in strings[4:]
uri = uri + ";" + uriencode(s)
# fragment, if present
if strings[3]:
uri = uri + "#" + uriencode(strings[3])
else if strings[0] == "mailto":
uri = "mailto:"; + uriencode(strings[1])
# etc etc

return uri

I think you'd have a much easier time enforcing this pedantically
correct usage with a bytes->bytes encoder.

Of course, it's un-Pythonic to enforce pedantry, and we pedants can
use a string->string encoder correctly.

 > You really want me to remove the encoding= named argument? And hard-code
 > UTF-8 into these functions?

A quoting function that accepts bytes *must* have an encoding
argument.  There's no point to passing the quoter bytes unless the
text is represented in a non-Unicode encoding.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com