I'm doing a heap-dump analysis now and I think I might know what the issue was. The start of this whole problem was the disk-usage plugin hanging our attempts to view a job in Jenkins (see https://issues.jenkins-ci.org/browse/JENKINS-20876) so we disabled that plugin. After disabling, Jenkins complained about data in an older/unreadable format:
You have data stored in an older format and/or unreadable data. If I click the "Manage" button to delete it, it takes a _long_ time for it to display all the disk-usage plugin data - there must be thousands of rows, but it does display it all eventually. The error shown in each row is: CannotResolveClassException: hudson.plugins.disk_usage.BuildDiskUsageAction If I click "Discard Unreadable Data" at the bottom of the page, I quickly get a stack trace: javax.servlet.ServletException: java.util.ConcurrentModificationException at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:735) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:799) at org.kohsuke.stapler.MetaClass$6.doDispatch(MetaClass.java:239) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:53) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:685) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:799) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:587) at org.kohsuke.stapler.Stapler.service(Stapler.java:218) at javax.servlet.http.HttpServlet.service(HttpServlet.java:45) at winstone.ServletConfiguration.execute(ServletConfiguration.java:248) at winstone.RequestDispatcher.forward(RequestDispatcher.java:333) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:376) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:96) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:203) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:181) at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:86) and it fails to discard the data. Older data isn't usually a problem so I brushed off this error. However, here is dominator_tree of the heap dump: Class Name | Shallow Heap | Retained Heap | Percentage -------------------------------------------------------------------------------------------------------------------------------------------------------------------- hudson.diagnosis.OldDataMonitor @ 0x6f9f2c4a0 | 24 | 3,278,466,984 | 88.69% com.thoughtworks.xstream.converters.SingleValueConverterWrapper @ 0x6f9da8780 | 16 | 13,825,616 | 0.37% hudson.model.Hudson @ 0x6f9b8b8e8 | 272 | 3,572,400 | 0.10% org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6f9a73598 | 88 | 2,308,760 | 0.06% org.apache.commons.jexl.util.introspection.Introspector @ 0x6fbb74710 | 32 | 1,842,392 | 0.05% org.kohsuke.stapler.WebApp @ 0x6f9c0ff10 | 64 | 1,127,480 | 0.03% java.lang.Thread @ 0x7d5c2d138 Handling GET /view/Alle/job/common-translation-main/ : RequestHandlerThread[#105] Thread| 112 | 971,336 | 0.03% -------------------------------------------------------------------------------------------------------------------------------------------------------------------- What is hudson.diagnosis.OldDataMonitor? Could the disk-usage plugin data be the cause of all my recent OOM errors? If so, how do I get rid of it? -tim On Monday, December 9, 2013 9:41:25 AM UTC-5, Tim Drury wrote: > > I intended to install 1.532 on Friday, but mistakenly installed 1.539. It > gave us the same OOM exceptions. I'm installing 1.532 now and will - > hopefully - know tomorrow whether it's stable or not. I'm not exactly sure > what's going to happen with our plugins though. Hopefully Jenkins will > tell me if they must be downgraded too. > > -tim > > On Monday, December 9, 2013 7:45:28 AM UTC-5, Stephen Connolly wrote: >> >> How does the current LTS (1.532.1) hold up? >> >> >> On 6 December 2013 13:33, Tim Drury <tdr...@gmail.com> wrote: >> >>> We updated Jenkins to 1.542 two days ago (from 1.514) and we're getting >>> a lot of OOM errors. (info: Windows server 2008 R2, Jenkins JVM is jdk >>> -x64-1.6.0_26) >>> >>> At first I did the simplest thing and increased the heap from 3G to 4.2G >>> (and bumped up permgen). This didn't help so I started looking at threads >>> via the Jenkins monitoring tool. It indicated the disk-usage plugin was >>> hung. When you tried to view a page for a particularly large job, the page >>> would "hang" and the stack trace showed the disk-usage plugin was to blame >>> (or so I thought). Jira report with thread dump here: >>> https://issues.jenkins-ci.org/browse/JENKINS-20876<https://www.google.com/url?q=https%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-20876&sa=D&sntz=1&usg=AFQjCNFcjP8y2rafiviVJB5cLwC_Tn7MPg> >>> >>> We disabled the disk-usage plugin and restarted and now we can visit >>> that job page. However, we still get OOM and lots of GCs in the logs at >>> least once a day. The stack trace looks frighteningly similar to that from >>> the disk-usage plugin. Here is an edited stack trace showing the methods >>> common between the two OOM incidents: one during the disk-usage plugin and >>> one after it was disabled: >>> >>> [lots of xstream methods snipped] >>> hudson.XmlFile.unmarshal(XmlFile.java:165) >>> hudson.model.Run.reload(Run.java:323) >>> hudson.model.Run.<init>(Run.java:312) >>> hudson.model.AbstractBuild.<init>(AbstractBuild.java:185) >>> hudson.maven.AbstractMavenBuild.<init>(AbstractMavenBuild.java:54) >>> hudson.maven.MavenModuleSetBuild.<init>(MavenModuleSetBuild.java:146) >>> ... [JVM methods snipped] >>> hudson.model.AbstractProject.loadBuild(AbstractProject.java:1155) >>> hudson.model.AbstractProject$1.create(AbstractProject.java:342) >>> hudson.model.AbstractProject$1.create(AbstractProject.java:340) >>> hudson.model.RunMap.retrieve(RunMap.java:225) >>> hudson.model.RunMap.retrieve(RunMap.java:59) >>> >>> jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:677) >>> >>> jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:660) >>> >>> jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:502) >>> >>> jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:536) >>> hudson.model.AbstractProject.getBuildByNumber(AbstractProject.java:1077) >>> hudson.maven.MavenBuild.getParentBuild(MavenBuild.java:165) >>> hudson.maven.MavenBuild.getWhyKeepLog(MavenBuild.java:273) >>> hudson.model.Run.isKeepLog(Run.java:572) >>> ... >>> >>> It seems something in "core" Jenkins has changed and not for the better. >>> Anyone seeing these issues? >>> >>> -tim >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Jenkins Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to jenkinsci-use...@googlegroups.com. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- You received this message because you are subscribed to the Google Groups "Jenkins Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.