Hi Craig I am able to reproduce the issue now . I have postgres-8.1.2 installed in cluster setup.
I have started the below query from one system let say A to system B in cluster . psql -U<dbname> -h<ip of system B> -c "select sleep(300);" while this command is going on , system B is stopped abruptly by taking out the power cable from it . This caused the above query on system A to hang. This is still showing in 'ps -eaf' output after one day. I think the tcp keepalive mechanism which has been set at system level should have closed this connection. But it didnt . Following keepalive values have been set on system A : net.ipv4.tcp_keepalive_intvl = 75 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_time = 7200 Why system level keepalive is not working in this case. Well, I learnt , from the link you have provided, that programs must request keepalive control for their sockets using the setsockopt interface. I wonder if postgres8.1.2 supports / request for system level keepalive control ?? If not, then which release/version of postgres supports that ?? Thanks... Tamanna On Tue, Nov 29, 2011 at 4:56 PM, tamanna madaan < tamanna.mad...@globallogic.com> wrote: > well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 . > > I am using postgres on linux platform . > > > > On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan < > tamanna.mad...@globallogic.com> wrote: > >> Hi Craig >> >> Thanks for your reply . But unfortunately I dont have that process >> running right now. I have already killed that process . But I have seen >> this problem sometimes on my setup. >> It generally happens when the remote system is going slow for some reason >> (CPU utilization high etc.) . But whatever is the reason , I would assume >> that the query should return with some error or so >> in case the system, the query is running on , is rebooted . But it >> doesn't return and remain stuck. Moreover, the same query sometimes hangs >> even if it is run on local postgres database so I dont think >> network issues have any role in that . Please help. >> >> Thanks.... >> >> Regards >> Tamanna >> >> >> On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ring...@ringerc.id.au>wrote: >> >>> On 11/28/2011 05:30 PM, tamanna madaan wrote: >>> >>>> Hi All >>>> I have postgres installed in cluster setup. My system has a script >>>> which executes the below query on remote system in cluster. >>>> psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;" >>>> But somehow this query got stuck. It didnt return even after the remote >>>> system( on which this query was supposed to execute) is rebooted . What >>>> could be the reason ?? >>>> >>> >>> I relised just after sending my last message: >>> >>> You should use ps to find out what exactly psql is doing and which >>> system call it's blocked in in the kernel (if it's waiting on a syscall). >>> As you didn't mention your OS I'll assume you're on Linux, where you'd use: >>> >>> ps -C psql -o wchan:80= >>> >>> or >>> >>> ps -p 1234 -o wchan:80= >>> >>> ... where "1234" is the pid of the stuck psql process. In a psql waiting >>> for command line input I see it blocked in the kernel routine "n_tty_read" >>> for example. >>> >>> >>> If you really want to know what it's doing you can also attach gdb and >>> get a backtrace to see what code it's paused in inside psql: >>> >>> gdb -q -p 1234 <<__END__ >>> bt >>> q >>> __END__ >>> >>> If you get a message about "missing debuginfos", lots of lines reading >>> "no debugging symbols found" or lots of lines ending in "?? ()" then you >>> need to install debug symbols. How to do that depends on your OS/distro so >>> I won't go into that; it's documented on the PostgreSQL wiki under "how to >>> get a stack trace" but you probably won't want to bother if this is just >>> for curiosity's sake. >>> >>> You're looking for output that looks like: >>> >>> #1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6 >>> #2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6 >>> #3 0x000000369d215b11 in readline_internal_char () from >>> /lib64/libreadline.so.6 >>> #4 0x000000369d216065 in readline () from /lib64/libreadline.so.6 >>> >>> ... etc ... >>> >>> >>> -- >>> Craig Ringer >>> >> >> >> >> -- >> Tamanna Madaan | Associate Consultant | GlobalLogic Inc. >> Leaders in Software R&D Services >> ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA >> >> Office: +0-120-406-2000 x 2971 >> >> www.globallogic.com >> >> >> > > > -- > Tamanna Madaan | Associate Consultant | GlobalLogic Inc. > Leaders in Software R&D Services > ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA > > Office: +0-120-406-2000 x 2971 > > www.globallogic.com > > > -- Tamanna Madaan | Associate Consultant | GlobalLogic Inc. Leaders in Software R&D Services ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA Office: +0-120-406-2000 x 2971 www.globallogic.com