On 12/07/10 21:03, Andras Fabian wrote:
> This STDOU issue gets even weirder. Now I have set up our two new servers 
> (identical hw/sw) as I would have needed to do so anyways. After having PG 
> running, I also set up the same test scenario as I have it on our problematic 
> servers, and started the COPY-to-STDOUT experiment. And you know what? Both 
> new servers are performing well. No hanging, and the 3 GByte test dump was 
> written in around 3 minutes (as expected). To make things even more 
> complicated ... I went back to our production servers. Now, the first one - 
> which I froze up with oprofile this morning and needed a REBOOT - is 
> performing well too! It needed 3 minutes for the test case ... WTF? BUT, the 
> second production server, which did not have a reboot, is still behaving 
> badly.
> Now I tried to dig deeper (without killing a production server again) ... and 
> came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now 
> I have found something interesting:
> - all fast servers show the COPY process as being in the state Rs ("runnable 
> (on run queue)")
> - on the still slow server, this process is in 9 out of 10 samples in Ds 
> ("uninterruptible sleep (usually IO)") 
> 
> Now, this "Ds" state seems to be something unhealthy - especially if it is 
> there almost all the time - as far as my first reeds on google show (and 
> although it points to IO, there is seemingly only very little IO, and IO-wait 
> is minimal too). I have also done "-axl" with PS, which brings the following 
> line for our process:
> F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
> 1  5551  2819  4201  20   0 5941068 201192 conges Ds ?          2:05 
> postgres: postgres musicload_cache [local] COPY"


Your wchan column isn't wide enough to show the full function name, but
I'd say it's related to some form of throttling or congestion control.

Get a wider view of that column to find out what the full function name
is. Grepping the kernel source for it can then tell you a lot about
where in the kernel it is and what might be going on.

Try:

    ps ax -O wchan:40

to get a decently wide view of that col.

-- 
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to