On Wed, Jun 03, 2015 at 07:05:15AM +0200, Thierry Goubier wrote: > Hi Dave, > > Le 03/06/2015 03:15, David T. Lewis a ?crit : > >Hi Thierry and Jose, > > > >I am reading this thread with interest and will help if I can. > > > >I do have one idea that we have not tried before. I have a theory that > >this may > >be an intermittent problem caused by SIGCHLD signals (from the external OS > >process > >when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess > >that handles them. > > > >If this is happening, then I may be able to change grimReaperProcess to > >work around the problem. > > > >When you see the OS deadlock condition, are you able tell if your Pharo VM > >process has subprocesses in the zombie state (indicating that > >grimReaperProcess > >did not clean them up)? The unix command "ps -axf | less" will let you look > >at the process tree and that may give us a clue if this is happening. > > I found it very easy to reproduce and I do have a zombie children > process to the pharo process.
Jose confirms this also (thanks). Can you try filing in the attached UnixOSProcessAccessor>>grimReaperProcess and see if it helps? I do not know if it will make a difference, but the idea is to put a timeout on the semaphore that is waiting for signals from SIGCHLD. I am hoping that if these signals are sometimes being missed, then the timeout will allow the process to recover from the problem. > > Interesting enough, the lock-up happens in a very specific place, a call > to git branch, which is a very short command returning just a few > characters (where all other commands have longuer outputs). Reducing the > frequency of the calls to git branch by a bit of caching reduces the > chances of a lock-up. > This is a good clue, and it may indicate a different kind of problem (so maybe I am looking in the wrong place). Ben's suggestion of adding a delay to the external process sounds like a good idea to help troubleshoot it. Dave
'From Squeak4.5 of 30 May 2015 [latest update: #15039] on 2 June 2015 at 9:35:21 pm'! !UnixOSProcessAccessor methodsFor: 'initialize - release' stamp: 'dtl 6/2/2015 20:54'! grimReaperProcess "This is a process which waits for the death of a child OSProcess, and informs any dependents of the change. Use SIGCHLD events if possible, otherwise a Delay to poll for exiting child processes." | eventWaiter processSynchronizationDelay | ^ self canAccessSystem ifTrue: [eventWaiter := (self canAccessSystem and: [self canForwardExternalSignals]) ifTrue: [self sigChldSemaphore "semaphore signaled by SIGCHLD" ] ifFalse: [Delay forMilliseconds: 200 "simple polling loop" ]. processSynchronizationDelay := Delay forMilliseconds: 20. grimReaper ifNil: [grimReaper := [[(eventWaiter respondsTo: #waitTimeoutMSecs: ) ifTrue: [eventWaiter waitTimeoutMSecs: 1000 "semaphore with timeout"] ifFalse: [eventWaiter wait]. processSynchronizationDelay wait. "Avoids lost signals in heavy process switching" self changed: #childProcessStatus] repeat] newProcess. grimReaper resume. "name selected to look reasonable in the process browser" grimReaper name: ((ReadStream on: grimReaper hash asString) next: 5) , ': the child OSProcess watcher']] ifFalse: [nil] ! !