In article <[EMAIL PROTECTED]>, "Paddy" <[EMAIL PROTECTED]> writes: |> |> Three to four months before `strange errors`? I'd spend some time |> correlating logs; not just for your program, but for everything running |> on the server. Then I'd expect to cut my losses and arrange to safely |> re-start the program every TWO months. |> (I'd arrange the re-start after collecting logs but before their |> analysis. Life is too short).
Forget it. That strategy is fine in general, but is a waste of time where threading issues are involved (or signal handling, or some types of communication problem, for that matter). There are three unrelated killer facts that interact: Such failures are usually probabilistic ("Poisson process"), and so have no "history". The expected number is usually proportional to the square of the activity, sometimes a higher power. Virtually nothing involved does any routine logging, or even has options to log relevant events. The first means that the strategy of restarting doesn't help. All three mean that current logs are almost never any use. Regards, Nick Maclaren. -- http://mail.python.org/mailman/listinfo/python-list