Re: [Python-Dev] A proposed solution for Issue 502236: Asyncrhonousexceptions between threads

2008-07-12 Thread Robert Brewer
Josiah Carlson wrote:
> This doesn't need to be an interpreter thing; it's easy to implement
> by the user (I've done it about a dozen times using a single global
> flag).  If you want it to be automatic, it's even possible to make it
> happen automatically using sys.settrace() and friends (you can even
> make it reasonably fast if you use a C callback).

Agreed. If someone wants a small library to help do this, especially in
web servers, the latest version of Cherrpy includes a 'process'
subpackage under a generous license. It does all the things Andy
describes via a Bus object:

> Andy Scott wrote:
> > 1. Put in place a new function call sys.exitapplication, what this
> > would do is:
> >  a. Mark a flag in t0's data structure saying a request to
> > shutdown has been made

This is bus.exit(), which publishes a 'stop' message to all subscribed
'stop' listeners, and then an 'exit' message to any 'exit' listeners.

> >  b. Raise a new exception, SystemShuttingDown, in t1.

That's up to the listener.

> >  2. As the main interpreter executes it checks the "shutting down
> > flag" in the per thread data and follows one of two paths:
> > If it is t0:
> >  a. Stops execution of the current code sequence
> >  b. Iterates over all extant threads ...
> >  c. Enters a timed wait loop where it will allow the other
> > threads time to see the signal. It will iterate this loop
> > a set number of times to avoid being blocked on any given
> > thread.

This is implemented as [t.join() for t in threading.enumerate()] in the
main thread.

> >  d. When all threads have exited, or been forcefully closed,
> > raise the SystemShuttingDown exception

The bus just lets the main thread exit at this point.

> > P1. If the thread is in a tight loop will it see the exception? Or
> > more generally: when should the exception be raised?

That's dependent enough on what work the thread is doing that a
completely generic approach is generally not sufficient. Therefore, the
process.bus sends a 'stop' message, and leaves the implementation of the
receiver up to the author of that thread's logic. Presumably, one
wouldn't register a listener for the 'stop' message unless one knew how
to actually stop.

> > P2. When should the interpreter check this flag?
> >
> > I think the answer to both of these problems is to check the flag,
> > and hence raise the exception, in the following circumstances:
> >   - When the interpreter executes a back loop. So this should catch
> > the jump back to the top of a "while True:" loop
> >   - Just before the interpreter makes a call to a hooked in non-
> > Python system function, e.g. file I/O, networking &c.

This is indeed how most well-written apps do it already.

> > Checking at these points should be the minimal required, I think, to
> > ensure that a given thread can not ignore the exception. It may be
> > possible, or even required, to perform the check every time a Python
> > function call is made.

PLEASE don't make Python function calls slower.

> >  1. The Python interpreter has per thread information.
> >  2. The Python interpreter can tell if the system, t0, thread is
> > running.
> >  3. The Python engine has (or can easily obtain) a list of all
> > threads it created.
> >  4. It is possible to raise exceptions as the byte code is
executing.

Replace 'Python interpreter' with 'your application' and those become
relatively simple architectural issues: maintain a list of threads, have
them expose an interface to determine if they're running, and make them
monitor a flag to know when another thread is asking them to stop.


Robert Brewer
[EMAIL PROTECTED]

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Hi all,

My first post to the list. In fact, first time Python hacker, long-time
Python user though. (Melbourne, Australia).

Some of you may have seen for the past week or so my bug report on Roundup,
http://bugs.python.org/issue3300

I've spent a heap of effort on this patch now so I'd really like to get some
more opinions and have this patch considered for Python 3.0.

Basically, urllib.quote and unquote seem not to have been updated since
Python 2.5, and because of this they implicitly perform Latin-1 encoding and
decoding (with respect to percent-encoded characters). I think they should
default to UTF-8 for a number of reasons, including that's what other
software such as web browsers use.

I've submitted a patch which fixes quote and unquote to use UTF-8 by
default. I also added extra arguments allowing the caller to choose the
encoding (after discussion, there was some consensus that this would be
beneficial). I have now completed updating the documentation, writing
extensive test cases, and testing the rest of the standard library for code
breakage - with the result being there wasn't really any, everything seems
to just work nicely with UTF-8. You can read the sordid details of my
investigation in the tracker.

Firstly, it'd be nice to hear if people think this is desirable behaviour.
Secondly, if it's feasible to get this patch in Python 3.0. (I think if it
were delayed to Python 3.1, the code breakage wouldn't justify it). And
thirdly, if the first two are positive, if anyone would like to review this
patch and check it in.

I have extensively tested it, and am now pretty confident that it won't
cause any grief if it's checked in.

Thanks very much,
Matt Giuca
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Brett Cannon
On Sat, Jul 12, 2008 at 10:27 AM, Matt Giuca <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> My first post to the list. In fact, first time Python hacker, long-time
> Python user though. (Melbourne, Australia).
>

Welcome!

> Some of you may have seen for the past week or so my bug report on Roundup,
> http://bugs.python.org/issue3300
>
> I've spent a heap of effort on this patch now so I'd really like to get some
> more opinions and have this patch considered for Python 3.0.
>

Hopefully we can get to it in the near future. Since we are having two
more betas (one of this is next week) hopefully there is enough time
before hitting a release candidate to have this looked at.

> Basically, urllib.quote and unquote seem not to have been updated since
> Python 2.5, and because of this they implicitly perform Latin-1 encoding and
> decoding (with respect to percent-encoded characters). I think they should
> default to UTF-8 for a number of reasons, including that's what other
> software such as web browsers use.
>
> I've submitted a patch which fixes quote and unquote to use UTF-8 by
> default. I also added extra arguments allowing the caller to choose the
> encoding (after discussion, there was some consensus that this would be
> beneficial). I have now completed updating the documentation, writing
> extensive test cases, and testing the rest of the standard library for code
> breakage - with the result being there wasn't really any, everything seems
> to just work nicely with UTF-8. You can read the sordid details of my
> investigation in the tracker.
>
> Firstly, it'd be nice to hear if people think this is desirable behaviour.

Based on what is said in this email, it sounds reasonable.

> Secondly, if it's feasible to get this patch in Python 3.0. (I think if it
> were delayed to Python 3.1, the code breakage wouldn't justify it).

If what you are saying is true, then it can probably go in as a bug
fix (unless someone else knows something about Latin-1 on the Net that
makes this not true).

> And
> thirdly, if the first two are positive, if anyone would like to review this
> patch and check it in.
>

That I can't say I can necessarily due; have my own bug reports to
work through this weekend. =)

-Brett
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Bill Janssen
> Basically, urllib.quote and unquote seem not to have been updated since
> Python 2.5, and because of this they implicitly perform Latin-1 encoding and
> decoding (with respect to percent-encoded characters). I think they should
> default to UTF-8 for a number of reasons, including that's what other
> software such as web browsers use.

The standard here is RFC 3986, from Jan 2005, which says,

  ``When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded.''

The "unreserved set" consists of the following ASCII characters:

  ``Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved.  These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.

   unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
''

There are a few other wrinkles; it's worth reading section 2.5
carefully.

I'd say, treat the incoming data as either Unicode (if it's a Unicode
string), or some unknown superset of ASCII (which includes both
Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown
encoding), and apply the appropriate transformation.

Bill

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Jeroen Ruigrok van der Werven
-On [20080712 19:27], Matt Giuca ([EMAIL PROTECTED]) wrote:
>Basically, urllib.quote and unquote seem not to have been updated since Python
>2.5, and because of this they implicitly perform Latin-1 encoding and decoding
>(with respect to percent-encoded characters). I think they should default to
>UTF-8 for a number of reasons, including that's what other software such as web
>browsers use.

Very nice, I had this somewhere on my todo list to work on. I'm very much
in favour, especially since it synchronizes us with the RFCs (for all I
remember reading about it last time).

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Can your hear the Dolphin's cry..?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Martin v. Löwis
> Very nice, I had this somewhere on my todo list to work on. I'm very much
> in favour, especially since it synchronizes us with the RFCs (for all I
> remember reading about it last time).

I still think that it doesn't. The RFCs haven't changed, and can't
change for compatibility reasons. The encoding of non-ASCII characters
in URLs remains as underspecified as it always was.

Now, with IRIs, the situation is different, but I don't think the patch
claims to implement IRIs (and if so, it perhaps shouldn't change URL
processing in doing so).

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Thanks for all the replies, and making me feel welcome :)

>
> If what you are saying is true, then it can probably go in as a bug
> fix (unless someone else knows something about Latin-1 on the Net that
> makes this not true).
>

Well from what I've seen, the only time Latin-1 naturally appears on the net
is when you have a web page in Latin-1 (either explicit or inferred; and
note that a browser like Firefox will infer Latin-1 if it sees only ASCII
characters) with a form in it. Submitting the form, the browser will use
Latin-1 to percent-encode the query string.

So if you write a web app and you don't have any non-ASCII characters or
mention the charset, chances are you'll get Latin-1. But I would argue
you're leaving things to chance and you deserve to get funny behaviour. If
you do any of the following:

   - Use a non-ASCII character, encoded as UTF-8 on the page.
   - Send a Content-Type: ; charset=utf-8.
   - In HTML, set a .
   - In the form itself, set .

then the browser will encode the form data as UTF-8. And most "proper" web
pages should get themselves explicitly served as UTF-8.

That I can't say I can necessarily due; have my own bug reports to
> work through this weekend. =)


OK well I'm busy for the next few days; after that I can do a patch trade
with someone. (That is if I am allowed to do reviews; not sure since I don't
have developer privileges).


On Sun, Jul 13, 2008 at 5:58 AM, Mark Hammond <[EMAIL PROTECTED]>
wrote:

> > My first post to the list. In fact, first time Python hacker,
> > long-time Python user though. (Melbourne, Australia).
>
> Cool - where exactly?  I'm in Wantirna (although not at this very moment -
> I'm in Lithuania, but home again in a couple of days)


Cool :) Balwyn.


> * Please take Martin with a grain of salt ( \I would say "ignore him", but
> that is too strong ;)


Lol, he is a hard man to please, but he's given some good feedback.


On Sun, Jul 13, 2008 at 7:07 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:

>
> The standard here is RFC 3986, from Jan 2005, which says,
>
>  ``When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set [UCS],
> the data should first be encoded as octets according to the UTF-8
> character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be
> percent-encoded.''


Ah yes, I was originally hung up on the idea that "URLs had to be encoded in
UTF-8", till Martin pointed out that it only says "new URI scheme" there.
It's perfectly valid to have non-UTF-8-encoded URIs. However in practice
they're almost always UTF-8. So I think introducing the new encoding
argument and having it default to "utf-8" is quite reasonable.

I'd say, treat the incoming data as either Unicode (if it's a Unicode
> string), or some unknown superset of ASCII (which includes both
> Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown
> encoding), and apply the appropriate transformation.
>

Ah there may be some confusion here. We're only dealing with str->str
transformations (which in Python 3 means Unicode strings). You can't put a
bytes in or get a bytes out of either of these functions. I suggested a
"quote_raw" and "unquote_raw" function which would let you do this.

The issue is with the percent-encoded characters in the URI string, which
must be interpreted as bytes, not code points. How then do you convert these
into a Unicode string? (Python 2 did not have this problem, since you simply
output a byte string without caring about the encoding).

On Sun, Jul 13, 2008 at 9:10 AM, "Martin v. Löwis" <[EMAIL PROTECTED]>
wrote:

> > Very nice, I had this somewhere on my todo list to work on. I'm very much
> > in favour, especially since it synchronizes us with the RFCs (for all I
> > remember reading about it last time).
>
> I still think that it doesn't. The RFCs haven't changed, and can't
> change for compatibility reasons. The encoding of non-ASCII characters
> in URLs remains as underspecified as it always was.


Correct. But my patch brings us in-line with that unspecification. The
unpatched version forces you to use Latin-1. My patch lets you specify the
encoding to use.


> Now, with IRIs, the situation is different, but I don't think the patch
> claims to implement IRIs (and if so, it perhaps shouldn't change URL
> processing in doing so).


True. I don't claim to have implemented IRIs or even know enough about them
to do that. I'll read up on these things in the next few days.

However, this is a URI library, not IRI. From what I've seen, it's
percent-encoded URIs coming in from the browser, not IRIs. We just need to
make sure with this patch that IRIs don't become less-supported than they
were before; don't need to explicitly support them.

Cheers,
Matt Giuca
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread André Malo
* Matt Giuca wrote:

> Well from what I've seen, the only time Latin-1 naturally appears on the
> net is when you have a web page in Latin-1 (either explicit or inferred;
> and note that a browser like Firefox will infer Latin-1 if it sees only
> ASCII characters) with a form in it. Submitting the form, the browser
> will use Latin-1 to percent-encode the query string.

This POV is way too browser-centric...

> So if you write a web app and you don't have any non-ASCII characters or
> mention the charset, chances are you'll get Latin-1. But I would argue
> you're leaving things to chance and you deserve to get funny behaviour.
> If you do any of the following:
>
>- Use a non-ASCII character, encoded as UTF-8 on the page.
>- Send a Content-Type: ; charset=utf-8.
>- In HTML, set a  />. - In the form itself, set .
>
> then the browser will encode the form data as UTF-8. And most "proper"
> web pages should get themselves explicitly served as UTF-8.

... because

1) URL encoding is not limited to web forms at all

2) The web form encoding depends on the browser settings as well (for 
example, try playing around with the internet explorer settings regarding 
query encoding)

3) The process submitting the form may not be a browser at all

4) The web form may not be under your own control (Search engine forms are a 
common example here, e.g. "put this google form snippet onto your webpage")

5) Different cultures do not choose necessarily between latin-1 and utf-8. 
They deal more with things like, say KOI8-R or Big5.

etc pp

Besides all that and without any offense: "most proper" and "should do" and 
the implication that all web browsers behave the same way are not a good 
location to argue from when talking about implementing a standard ;)

nd
-- 
Wenn nur Ingenieure mit Diplom programmieren würden, hätten wir
wahrscheinlich weniger schlechte Software.
Wir hätten allerdings auch weniger gute Software.
   -- Felix von Leitner in dasr
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
> This POV is way too browser-centric...
>

This is but one example. Note that I found web forms to be the least
clear-cut example of choosing an encoding. Most of the time applications
seem to be using UTF-8, and all the standards I have read are moving towards
specifying UTF-8 (from being unspecified). I've never seen a standard
specify or even recommend Latin-1.

Where web forms are concerned, basically setting the form accept-charset or
the page charset is the *maximum amount* of control you have over the
encoding. As you say, it can be encoded by another page or the user can
override their settings. Then what can you do as the server? Nothing ...
there's no way to predict the encoding. So you just handle the cases you
have control over.

5) Different cultures do not choose necessarily between latin-1 and utf-8.
> They deal more with things like, say KOI8-R or Big5.


Exactly. This is exactly my point - Latin-1 is arbitrary from a standards
point of view. It's just one of the many legacy encodings we'd like to
forget. The UTFs are the only options which support all languages, and UTF-8
is the only ASCII-compatible (and therefore URI-compatible) encoding. So we
should aim to support that as the default.

Besides all that and without any offense: "most proper" and "should do" and
> the implication that all web browsers behave the same way are not a good
> location to argue from when talking about implementing a standard ;)


I agree. However if there *was* a proper standard we wouldn't have to argue!
"Most proper" and "should do" is the most confident we can be when dealing
with this standard, as there is no correct encoding.

Does anyone have a suggestion which will be more compatible with the rest of
the world than allowing the user to select an encoding, and defaulting to
"utf-8"?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com