On Tue, Jul 3, 2012 at 8:16 AM, Jan Stary <h...@stare.cz> wrote:
> This is 5.1-beta/i386 on an ALIX about five years old,
> running as my home server.
>
> Recently, processes started to die for reasons unknown, as in
>
> pid 20260 (postgres): user write of 118784@0x28052000 at 159088 failed: 14
> pid 1872 (cron): user write of 118784@0x2b1e3000 at 30224 failed: 14
>
> 14 is EFAULT as per sys/errno,h
> - what can be causing it?

Having briefly looked at this, I suspect there are (at least) two
cases where that can occur:
1) a race in the code that prevents writing to a file that is mapped
for execution, and
2) system runs out of memory when trying to do copy-on-write; the
process is killed because of
   that and then the coredump logic hits the missing page when trying
to write out the process's
   memory and logs that failure.

I think (1) is what sthen@ described; I think you're hitting (2).


> It usually happens under a stress (the machine is pretty scarse
> on resources, so 'stress' can be accepting a batch of DNS queries).
> For example, the postgres EFAULT above happened exactly when a batch
> of emails arrived.

When you have a reproducible test case, adding monitoring of the
system resources and seeing what shows up when it occurs is a good
idea.  It seems to me that it's likely this case is a result of the
system running out of total memory and that adding swap would
alleviate the problem.

But "likely" is not certainty.  Monitor your system's resources, based
on that make a hypothesis about the cause, figure out a way to test
your hypothesis, then do it and check the result.


Philip Guenther

Reply via email to