On Tue, Jul 3, 2012 at 8:16 AM, Jan Stary <h...@stare.cz> wrote: > This is 5.1-beta/i386 on an ALIX about five years old, > running as my home server. > > Recently, processes started to die for reasons unknown, as in > > pid 20260 (postgres): user write of 118784@0x28052000 at 159088 failed: 14 > pid 1872 (cron): user write of 118784@0x2b1e3000 at 30224 failed: 14 > > 14 is EFAULT as per sys/errno,h > - what can be causing it?
Having briefly looked at this, I suspect there are (at least) two cases where that can occur: 1) a race in the code that prevents writing to a file that is mapped for execution, and 2) system runs out of memory when trying to do copy-on-write; the process is killed because of that and then the coredump logic hits the missing page when trying to write out the process's memory and logs that failure. I think (1) is what sthen@ described; I think you're hitting (2). > It usually happens under a stress (the machine is pretty scarse > on resources, so 'stress' can be accepting a batch of DNS queries). > For example, the postgres EFAULT above happened exactly when a batch > of emails arrived. When you have a reproducible test case, adding monitoring of the system resources and seeing what shows up when it occurs is a good idea. It seems to me that it's likely this case is a result of the system running out of total memory and that adding swap would alleviate the problem. But "likely" is not certainty. Monitor your system's resources, based on that make a hypothesis about the cause, figure out a way to test your hypothesis, then do it and check the result. Philip Guenther