Steve Carter commented on Bug JENKINS-9104

Nice work Daniel. Will be interesting to see whether that solves the problem.

For the good of the thread, I'm going to try to summarize this from the top down as there's a lot of talk on here that seems to miss the key points.

1) BUILD_ID is an environment variable, set by Jenkins when it starts a job.

2) Environment variables are inherited when processes start other processes, except when overwritten. For e.g. in bash scripts you can go

MYVAR=myvalue myscript.sh

and myscript.sh will run with MYVAR set to myvalue.

3) Therefore, all processes started by a jenkins job have the same BUILD_ID. This is recursive.

4) Jenkins, in order to catch rogue processes at job end (i.e. those that have broken ties with their parent process) scans the whole process space for those with the particular BUILD_ID in their environment, and kills them.

This is correct and good behavior by Jenkins.

5) When you start an MSBUILD job, pdbsrv is started, which catches requests from parallel compilations and serializes them to write pdb files. When started from Jenkins, that pbdsrv process inherits BUILD_ID from the job.

6) If you run two MSBUILD builds at once, then they share the same pdbsrv process.

7) When the first job ends, it kills the pdbsrv process – because its BUILD_ID matches the first job's build id. The second job then fails.

8) Solution 1: start pdbsrv with a BUILD_ID that doesn't match the build jobs. Then pdbsrv will not be killed at the end of the job.

Solution 2: use Daniel's whitelist feature to not kill pdbsrv at the end of the job.

Casual readers stop here.
=========================

10) The problem with Solutions 1 and 2 are this: pdbsrv still has a timeout, so you will get sporadic failures when the server goes away.

11) My "heavyweight" python fix is trying to deal with that. Basically wrapping pdbsrv with a proper timeout and reference counting so that pdbsrv is present exactly when needed.

12) pdbsrv's timeout doesn't get a new lease every time you use pdbsrv. I regard this as a bug in pdbsrv.

13) You can't leave pdbsrv running forever because it (allegedly) has memory leaks. I regard this as a bug in pdbsrv.

I really think to roll back Jenkins' ProcessTreeKiller is NOT a solution. The use of BUILD_ID brings the Jenkins machine under better control against rogue processes, and the workaround (for well-behaved servers) is easy, set BUILD_ID before starting the server, or use Daniel's whitelist.

14) Solution 3: start pdbsrv periodically, e.g. every day with a day-long timeout. That will mitigate against the memory leaks. If you use some concurrency control, e.g. Job Weight plugin, you can make sure this "kill and restart pdbsrv" job does not fire during a build.

=========================

Solution 0: Finally, it would be remiss of me not to mention again my python workaround, which has been happily keeping parallel builds working for 54 weeks now without trouble.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to