This STDOU issue gets even weirder. Now I have set up our two new servers 
(identical hw/sw) as I would have needed to do so anyways. After having PG 
running, I also set up the same test scenario as I have it on our problematic 
servers, and started the COPY-to-STDOUT experiment. And you know what? Both new 
servers are performing well. No hanging, and the 3 GByte test dump was written 
in around 3 minutes (as expected). To make things even more complicated ... I 
went back to our production servers. Now, the first one - which I froze up with 
oprofile this morning and needed a REBOOT - is performing well too! It needed 3 
minutes for the test case ... WTF? BUT, the second production server, which did 
not have a reboot, is still behaving badly.
Now I tried to dig deeper (without killing a production server again) ... and 
came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now I 
have found something interesting:
- all fast servers show the COPY process as being in the state Rs ("runnable 
(on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds 
("uninterruptible sleep (usually IO)") 

Now, this "Ds" state seems to be something unhealthy - especially if it is 
there almost all the time - as far as my first reeds on google show (and 
although it points to IO, there is seemingly only very little IO, and IO-wait 
is minimal too). I have also done "-axl" with PS, which brings the following 
line for our process:
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
1  5551  2819  4201  20   0 5941068 201192 conges Ds ?          2:05 postgres: 
postgres musicload_cache [local] COPY"

Now, as far as I understood from my google searches, the column WCHAN shows, 
where in the kernel my process is hanging. Here it says "conges". Now, can 
somebody tell me, what "conges" means ???? Or do I have other options to get 
out even more info from the system (maybe without oprofile - as it already 
burned my hand :-).

And yes, now I see a reboot as a possible "Fix", but that would not ensure me, 
that the problem will not resurface. So, for the time being, I will leave my 
current second production server as is ... so I can further narrow down the 
potential reasons of this strange STDOUT slow down (especially I someone ha s a 
tip for me :-)

Andras Fabian

(in the meantime my "slow" server finished the COPY ... it took 46 minutes 
instead of 3 minutes on the fast machines ... a slowdown of factor 15). 




-----Ursprüngliche Nachricht-----
Von: Andras Fabian 
Gesendet: Montag, 12. Juli 2010 10:45
An: 'Tom Lane'
Cc: pgsql-general@postgresql.org
Betreff: AW: [GENERAL] PG_DUMP very slow because of STDOUT ?? 

Hi Tom (or others),

are there some recommended settings/ways to use oprofile on a situation like 
this??? I got it working, have seen a first profile report, but then managed to 
completely freeze the server on a second try with different oprofile settings 
(next tests will go against the newly installed - next and identical - new 
servers). 

Andras Fabian

-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Gesendet: Freitag, 9. Juli 2010 15:39
An: Andras Fabian
Cc: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ?? 

Andras Fabian <fab...@atrada.net> writes:
> Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on 
> out new machine?

Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?

On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.

                        regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to