threading.Thread vs. signal.signal

2005-09-17 Thread Jack Orenstein
I'd like to create a program that invokes a function once a second,
and terminates when the user types ctrl-c. So I created a signal
handler, created a threading.Thread which does the invocation every
second, and started the thread. The signal handler seems to be
ineffective. Any idea what I'm doing wrong? This is on Fedora FC4 and
Python 2.4.1. The code appears below.

If I do the while ... sleep in the main thread, then the signal
handler works as expected. (This isn't really a satisfactory
implementation because the function called every second might
take a significant fraction of a second to execute.)

Jack Orenstein


import sys
import signal
import threading
import datetime
import time

class metronome(threading.Thread):
 def __init__(self, interval, function):
 threading.Thread.__init__(self)
 self.interval = interval
 self.function = function
 self.done = False

 def cancel(self):
 print '>>> cancel'
 self.done = True

 def run(self):
 while not self.done:
 time.sleep(self.interval)
 if self.done:
 print '>>> break!'
 break
 else:
 self.function()

def ctrl_c_handler(signal, frame):
 print '>>> ctrl c'
 global t
 t.cancel()
 sys.stdout.close()
 sys.stderr.close()
 sys.exit(0)

signal.signal(signal.SIGINT, ctrl_c_handler)

def hello():
 print datetime.datetime.now()

t = metronome(1, hello)
t.start()
-- 
http://mail.python.org/mailman/listinfo/python-list


Threading and consuming output from processes

2005-02-24 Thread Jack Orenstein
I am developing a Python program that submits a command to each node
of a cluster and consumes the stdout and stderr from each. I want all
the processes to run in parallel, so I start a thread for each
node. There could be a lot of output from a node, so I have a thread
reading each stream, for a total of three threads per node. (I could
probably reduce to two threads per node by having the process thread
handle stdout or stderr.)
I've developed some code and have run into problems using the
threading module, and have questions at various levels of detail.
1) How should I solve this problem? I'm an experienced Java programmer
but new to Python, so my solution looks very Java-like (hence the use of
the threading module). Any advice on the right way to approach the
problem in Python would be useful.
2) How many active Python threads is it reasonable to have at one
time? Our clusters have up to 50 nodes -- is 100-150 threads known to
work? (I'm using Python 2.2.2 on RedHat 9.)
3) I've run into a number of problems with the threading module. My
program seems to work about 90% of the time. The remaining 10%, it
looks like notify or notifyAll don't wake up waiting threads; or I
find some other problem that makes me wonder about the stability of
the threading module. I can post details on the problems I'm seeing,
but I thought it would be good to get general feedback
first. (Googling doesn't turn up any signs of trouble.)
Thanks.
Jack Orenstein
--
http://mail.python.org/mailman/listinfo/python-list


Re: Threading and consuming output from processes

2005-02-26 Thread Jack Orenstein
I asked:
I am developing a Python program that submits a command to each node
of a cluster and consumes the stdout and stderr from each. I want all
the processes to run in parallel, so I start a thread for each
node. There could be a lot of output from a node, so I have a thread
reading each stream, for a total of three threads per node. (I could
probably reduce to two threads per node by having the process thread
handle stdout or stderr.)
Simon Wittber said:
> In the past, I have used the select module to manage asynchronous
> IO operations.
>
> I pass the select.select function a list of file-like objects, and it
> returns a list of file-like objects which are ready for reading and
> writing.
Donn Cave said:
As I see another followup has already mentioned, the classic
"pre threads" solution to multiple I/O sources is the select(2)
function, ...
Thanks for your replies. The streams that I need to read contain
pickled data. The select call returns files that have available input,
and I can use read(file_descriptor, max) to read some of the input
data. But then how can I convert the bytes just read into a stream for
unpickling? I somehow need to take the bytes arriving for a given file
descriptor and buffer them until the unpickler has enough data to
return a complete unpickled object.
(It would be nice to do this without copying the bytes from one place
to another, but I don't even see how do solve the problem with
copying.)
Jack
--
http://mail.python.org/mailman/listinfo/python-list


Thread scheduling

2005-02-26 Thread Jack Orenstein
I am using Python 2.2.2 on RH9, and just starting to work with Python
threads.
I started using the threading module and found that 10-20% of the runs
of my test program would hang. I developed smaller and smaller test
cases, finally arriving at the program at the end of this message,
which uses the thread module, not threading. This program seems to
point to problems in Python thread scheduling.
The program is invoked like this:
python threadtest.py THREADS COUNT
THREADS is the number of threads created. Each thread contains a loop
that runs COUNT times, and all threads increment a counter. (The
counter is incremented without locking -- I expect to see a final
count of less than THREADS * COUNT.)
Running with THREADS = 2 and COUNT = 10, most of the time, the
program runs to completion. About 20% of the time however, I see one
thread finish, but the other thread never resumes.
Here is output from a run that completes normally:
[EMAIL PROTECTED] python threadtest.py 2 10
nThreads: 2
nCycles: 10
thread 1: started
thread 1: i = 0, counter = 1
thread 2: started
thread 2: i = 0, counter = 2691
thread 1: i = 1, counter = 13496
thread 2: i = 1, counter = 22526
thread 1: i = 2, counter = 27120
thread 2: i = 2, counter = 40365
thread 1: i = 3, counter = 41264
thread 1: i = 4, counter = 55922
thread 2: i = 3, counter = 58416
thread 2: i = 4, counter = 72647
thread 1: i = 5, counter = 74602
thread 1: i = 6, counter = 88468
thread 2: i = 5, counter = 99319
thread 1: i = 7, counter = 110144
thread 2: i = 6, counter = 110564
thread 2: i = 7, counter = 125306
thread 1: i = 8, counter = 129252
Still waiting, done = 0
thread 2: i = 8, counter = 141375
thread 1: i = 9, counter = 147459
thread 2: i = 9, counter = 155268
thread 1: leaving
thread 2: leaving
Still waiting, done = 2
All threads have finished, counter = 168322
Here is output from a run that hangs. I killed the process using
ctrl-c.
[EMAIL PROTECTED] python threadtest.py 2 10
nThreads: 2
nCycles: 10
thread 1: started
thread 1: i = 0, counter = 1
thread 2: started
thread 2: i = 0, counter = 990
thread 1: i = 1, counter = 11812
thread 2: i = 1, counter = 13580
thread 1: i = 2, counter = 19127
thread 2: i = 2, counter = 25395
thread 1: i = 3, counter = 31457
thread 1: i = 4, counter = 44033
thread 2: i = 3, counter = 48563
thread 1: i = 5, counter = 55131
thread 1: i = 6, counter = 65291
thread 1: i = 7, counter = 78145
thread 2: i = 4, counter = 82715
thread 1: i = 8, counter = 92073
thread 2: i = 5, counter = 101784
thread 1: i = 9, counter = 104294
thread 2: i = 6, counter = 112866
Still waiting, done = 0
thread 1: leaving
Still waiting, done = 1
Still waiting, done = 1
Still waiting, done = 1
Still waiting, done = 1
Still waiting, done = 1
Still waiting, done = 1
Still waiting, done = 1
Still waiting, done = 1
Traceback (most recent call last):
  File "threadtest.py", line 26, in ?
time.sleep(1)
KeyboardInterrupt
[EMAIL PROTECTED] osh]$
In this case, thread 1 finishes but thread 2 never runs again. Is
this a known problem? Any ideas for workarounds? Are threads widely
used in Python?
Jack Orenstein

# threadtest.py
import sys
import thread
import time
nThreads = int(sys.argv[1])
nCycles = int(sys.argv[2])
print 'nThreads: %d' % nThreads
print 'nCycles: %d' % nCycles
counter = 0
done = 0
def run(id):
global done
print 'thread %d: started' % id
global counter
for i in range(nCycles):
counter += 1
if i % 1 == 0:
print 'thread %d: i = %d, counter = %d' % (id, i, counter)
print 'thread %d: leaving' % id
done += 1
for i in range(nThreads):
thread.start_new_thread(run, (i + 1,))
while done < nThreads:
time.sleep(1)
print 'Still waiting, done = %d' % done
print 'All threads have finished, counter = %d' % counter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Thread scheduling

2005-02-26 Thread Jack Orenstein
Peter Hansen wrote:
> Jack Orenstein wrote:
>
>> I am using Python 2.2.2 on RH9, and just starting to work with Python
>> threads.
>
>
> Is this also the first time you've worked with threads in general,
> or do you have much experience with them in other situations?
Yes, I've used threading in Java.
> You've got two shared global variables, "done" and "counter".
> Each of these is modified in a manner that is not thread-safe.
> I don't know if "counter" is causing trouble, but it seems
> likely that "done" is.
I understand that. As I said in my posting, "The counter is
incremented without locking -- I expect to see a final count of less
than THREADS * COUNT." This is a test case, and I threw out more and
more code, including synchronization around counter and done, until it
got as simple as possible and still showed the problem.
> Basically, the statement "done += 1" is equivalent to the
> statement "done = done + 1" which, in Python or most other
> languages is not thread-safe.  The "done + 1" part is
> evaluated separately from the assignment, so it's possible
> that two threads will be executing the "done + 1" part
> at the same time and that the following assignment of
> one thread will be overwritten immediately by the assignment
> in the next thread, but with a value that is now one less
> than what you really wanted.
Understood. I was counting on this being unlikely for my test
case. I realize this isn't something to rely on in real software.
> If you really want to increment globals from the thread, you
> should look into locks.  Using the "threading" module (as is
> generally recommended, instead of using "thread"), you would
> use threading.Lock().
As my note said, I did start with the threading module. And variables
updated by different threads were protected by threading.Condition
variables. As I analyzed my test cases, and threading.py, I started
suspecting thread scheduling.  I then wrote the test case in my email,
which does not rely on the threading module at all. The point of the
test is not to maintain counter -- it's to show that sometimes even
after one thread completes, the other thread never is scheduled
again. This seems wrong. Try running the code, and let me see if you
see this behavior.
If you'd like, replace this:
counter += 1
by this:
time.sleep(0.01 * id)
You should see the same problem. So that removes counter from the
picture. And the two increments of done (one by each thread) are still
almost certainly not going to coincide and cause a problem. Also, if
you look at the output from the code on a hang, you will see that
'thread X: leaving' only prints once. This has nothing to do with what
happens with the done variable.
Jack
--
http://mail.python.org/mailman/listinfo/python-list


Re: Thread scheduling

2005-02-26 Thread Jack Orenstein
On my machines (one Py2.4 on WinXP, one Py2.3.4 on RH9.0) I don't
see this behaviour.  Across about fifty runs each.
Thanks for trying this.
One thing you might try is experimenting with sys.setcheckinterval(),
just to see what effect it might have, if any.
That does seem to have an impact. At 0, the problem was completely
reproducible. At 100, I couldn't get it to occur.
It's also possible there were some threading bugs in Py2.2 under
Linux.  Maybe you could repeat the test with a more recent
version and see if you get different behaviour.  (Not that that
proves anything conclusively, but at least it might be a good
solution for your immediate problem.)
2.3 (on the same machine) does seem better, even with setcheckinterval(0).
Thanks for your suggestions.
Can anyone with knowledge of Python internals comment on these results?
(Look earlier in the thread for details. But basically, a very simple
program with the thread module, running two threads, shows that on
occasion, one thread finishes and the other never runs again. python2.3
seems better, as does python2.2 with  sys.setcheckinterval(100).)
Jack
--
http://mail.python.org/mailman/listinfo/python-list


distutils setup ignoring scripts

2005-03-14 Thread Jack Orenstein
I'm using Python 2.2 on RH9. I have a set of Python modules organized
into a root package and one other package named foobar. setup.py looks
like this:
from distutils.core import setup
setup(
name = 'foobar',
version = '0.3',
description = 'Foo Bar',
author = 'Jack Orenstein',
author_email = '[EMAIL PROTECTED]',
packages = ['', 'xyz'],
scripts = ['bin/foobar']
)
The resulting package has everything in the specified directories, but
does not include the script. I've tried making the path bin/foobar
absolute, but that doesn't help. I've googled for known bugs of this
sort but have come up emtpy. (The first line of bin/foobar is
#!/usr/bin/python.)
I've also tried using DISTUTIL_DEBUG, which has been uninformative,
(e.g. no mention of bin/foobar at all).
Can anyone see what I'm doing wrong?
Jack Orenstein
--
http://mail.python.org/mailman/listinfo/python-list


Re: how to remove 50000 elements from a 100000 list?

2006-05-05 Thread Jack Orenstein
On May 5, 2006, at 9:36 AM, Ju Hui wrote:

>>> a=range(10)
>>> b=range(5)
>>> for x in b:
 > ... a.remove(x)
 > ...
 > it will very slowly. Shall I change to another data structure and 
choos
 > a better arithmetic?
 > any suggestion is welcome.

If removal is an O(n) operation, then removing 1/2 the list will take 
O(n**2), which you don't want. You'd be better off with the contents of 
  "a" being in a hash table (O(1) removal in practice) or a balanced 
tree (O(log n) removal).

Another possibility: If the a and b lists are ordered in the same way, 
then you could walk through the lists in order using a merge procedure, 
generating a new list as you go.

After ruling out slow data structures and algorithms, you'll almost 
certainly be better off using something built in to Python rather than 
coding your own data structure in Python.

Jack Orenstein 

-- 
http://mail.python.org/mailman/listinfo/python-list