[GENERAL] Postmaster Out of Memory

2005-06-23 Thread Jeff Gold
Joe Maldonado and I have a vexing problem with PostgreSQL 7.4.5 (we are 
using the PGDG RPMs on a RedHat 9 system).  He posted about this briefly 
with the subject "info on strange error messages on postgresql" on this 
list, but I'll start from the beginning.  About once per week the 
database enters some pathological state where the parent postmaster -- 
NOT one of the child connections -- appears to run out of memory.  This 
causes it to dump megabytes of diagnostic data that I don't know how to 
analyze into the log.  Then it does it all over again.  And again and 
again and... well, eventually the log grows until it fills as many 
gigabytes as the disk has available.


Here is an example from the logfile:

TopMemoryContext: 87552048 total in 10683 blocks; 181280 free (60 
chunks); 87370768 used
TopTransactionContext: 57344 total in 3 blocks; 648 free (5 chunks); 
56696 used

DeferredTriggerXact: 0 total in 0 blocks; 0 free (0 chunks); 0 used
SPI Plan: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
SPI Plan: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
SPI Plan: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
MessageContext: 8192 total in 1 blocks; 7248 free (1 chunks); 944 used
PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
PortalHeapMemory: 3072 total in 2 blocks; 1136 free (12 chunks); 1936 used
CacheMemoryContext: 421519360 total in 60 blocks; 830096 free (457 
chunks); 420689264 used
current_hostpairs2_idx: 1024 total in 1 blocks; 640 free (0 chunks); 384 
used
current_hostpair2_eid_key: 2048 total in 1 blocks; 704 free (0 chunks); 
1344 used

pg_temp_260223: 2048 total in 1 blocks; 768 free (0 chunks); 1280 used
pg_temp_260225: 1024 total in 1 blocks; 640 free (0 chunks); 384 used
pg_temp_260226: 1024 total in 1 blocks; 640 free (0 chunks); 384 used
pg_toast_888405987_index: 1024 total in 1 blocks; 320 free (0 chunks); 
704 used
pg_toast_888405987_index: 1024 total in 1 blocks; 320 free (0 chunks); 
704 used

pg_temp_260196: 1024 total in 1 blocks; 640 free (0 chunks); 384 used
pg_toast_888382258_index: 1024 total in 1 blocks; 320 free (0 chunks); 
704 used
pg_toast_888382258_index: 1024 total in 1 blocks; 320 free (0 chunks); 
704 used

[...]
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used

Eventually the following is reported in the log:

[1587] ERROR:  out of memory
DETAIL:  Failed on request of size 268.

A quick look at the code shows that this comes from the AllocSetAlloc() 
routine at line 502 of src/backend/utils/mmgr/aset.c from the 7.4.5 
release.  This doesn't tell me much about what happened beyond the fact 
that malloc() failed.  I see no evidence in /var/log/messages to 
indicate a system wide memory exhaustion and a periodic sampling of the 
amount of available swap space shows that there is always at least a 
gigabyte of that available.  This suggests that a postmaster process has 
simply run out of address space somehow.


In each case the process identifier is the same (1587 in this instance). 
 Other processes are unable to establish new connections, but the log 
shows that existing connections continue to issue queries.  This leads 
us to believe that it is the parent process which is running out of 
memory rather than any child forked for a specific connection.  Is this 
possible?  What might lead to such a case?


There is some history here.  We originally discovered that the database 
would develop long term problems because system tables weren't being 
vacuumed properly so we decided to deply pg_autovacuum.  This didn't 
work out because we have temporary tables that get periodically deleted. 
 Every so often pg_autovacuum would notice that a table was missing and 
throw up its hands in despair.  We attempted to address this by 
implementing a wrapper that restarted it whenever it terminates.


After this change the problem described above began.  We have 
subsequently added a delay to the wrapper that doubles in duration each 
time pg_autovacuum crashes to avoid hammering the system when something 
is wrong.  That might make the problem less catestrophic, but of course 
it won't really solve it.  After all, the memory problem is in the 
parent postmaster, not pg_autovacuum.


Why should ordinary vacuuming run the system out of memory?  What should 
we do to make things work reliably?


Jeff

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Postmaster Out of Memory

2005-06-24 Thread Jeff Gold

Tom Lane wrote:

I can absolutely, positively say that that dump is not from the parent
postmaster.  It's a backend.


That makes sense.  I'm still a bit puzzled about why new clients can't 
connect when the problem happens, though.  Does the parent postmaster 
need some resource from one of the backends in order to establish a new 
connection?  Perhaps this will allow us to understand which one is 
producing these dumps.



I can't really think what would blow out CacheMemoryContext like that.


I have some new information here.  In at least one case it turns out the 
statements that provoke the out of memory error are these:


  TRUNCATE TABLE events.current_hostpair1
  Query failed with message ERROR:  out of memory
  DETAIL:  Failed on request of size 268.

One of the things we did around the time we introduced the wrapper 
around pg_autovacuum was to change the way we process certain tables so 
that instead of dropping a table and re-creating it we would use 
TRUNCATE statements.  This is more efficient in our testing.


Could this practice be running one of the backends out of memory 
somehow?  Is there a less cache hungry alternative we could try?



This is quite uninformative.  Are you suggesting that there are many
many lines about indexes?  Many many lines about DynaHashTable?  Both?


I presented the start and the end of what seemed to my uninformed eye to 
be the relevant error messages, since posting all 46.7 megabytes seemed 
impolite.  :-)  According to grep there are 122034 lines that include 
the word "index" in any combination of case.  "DynaHashTable" appears a 
mere eleven times, all at the bottom.  There are 640582 lines in total.


In other words, I'm not sure which of these messages is important or 
what I should be looking for.  Any suggestions?



How many tables/indexes do you have in your database, anyway?


According to a quick-and-dirty script I hacked together we have 484 
tables and 141 indexes outside of the system resources created by 
PostgreSQL itself.  Here's the script:


  tables=0
  indexes=0
  for schema in $(echo '\dn' | psql mazu postgres | grep '|' | \
  grep -v Name | sed 's/|.*//' | grep -v pg_ | \
  grep -v information_schema); do
tables=$(($tables + $(echo '\dt' $schema.\* | \
  psql mazu postgres | grep rows | \
  sed 's/.*(//;s/ rows).*//') ))
indexes=$(($indexes + $(echo '\di' $schema.\* | \
psql mazu postgres | grep rows | \
sed 's/.*(//;s/ rows).*//') ))
  done
  echo tables = $tables
  echo indexes = $indexes


Could this process be running pg_dump, or something else that would try to
touch them all?


We only use pg_dump for manual backup procedures.  There is no evidence 
in our logs that anyone attempted that during the time that the problem 
occured.  The process that issues the TRUNCATE command works with only 
one of our schemas as far as I can tell.


Thanks for taking the time to respond.

Jeff

---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] Postmaster Out of Memory

2005-06-24 Thread Jeff Gold

Tom Lane wrote:

I was sort of expecting you to come back and say that you
thought the process might have done 640K TRUNCATEs over its lifespan,
but I guess not?


That's possible.  The process does twelve TRUNCATEs every minute.  The 
problem we're talking about seems to occur only when the system has been 
running for quite some time.  Still, that's seems like an awefully slow 
rate.  Is it really likely that a postmaster backend would accumulate 
relcache entries in this case?



What about temporary tables?


We don't use AS TEMP tables, but we have something like thirty tables 
that we create, use, drop and eventually re-create.  This gets done 
about twice a day or so.



(Note: you have the list
of tables involved, in the form of that dump you showed --- does
eyeballing the list suggest any pattern to you?  Are there duplicate
table names in the list?)


What stands out is that nearly all of the entries seem to be pg_* 
tables.  A "grep '^pg_' crash_dump | wc -l" gives 640499 entries, which 
is pretty nearly the entire file.  We are careful not to name our tables 
with the "pg_" prefix, so I believe these are system tables.  This is 
part of what originally made us believe pg_autovacuum was responsible.



Given that you are showing "SHARE" as 180M, I think
it's pretty likely you've got only 2MB of actual local storage in that
backend.


Ah, good point.  I was assuming that memory was being allocated from a 
shared pool, but I have no evidence to support that.



If it's a true leak, then letting it run until RSS is well beyond shared
memory will prove it.


We'll give that a try over the weekend and see where it goes.  (We have 
since stopped the script because Joe is improving the instrumentation of 
the code so that we don't overflow the logs again.)


  Jeff

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [GENERAL] Postmaster Out of Memory

2005-06-24 Thread Jeff Gold

Tom Lane wrote:

I suppose what we are looking at here is some operation that is
invalidating a relcache entry but failing to clear it.


Hm.  After discussing this with people here we have a hypothesis.  The 
process that issues the TRUNCATE command does something a little 
peculiar: every minute or so twelve distinct functions are overwritten 
using CREATE OR REPLACE FUNCTION.  Perhaps this is causing the 
postmaster backend to fill its relcache.


To test this Joe created a simple C program that does the same operation 
on slightly changed functions (changed only enough so as not to 
interfere with the real process).  After running this for a little over 
an hour here is what we see from top:


  PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
17033 postgres  16   0  182M 182M  180M R35.7  4.6  22:33   1 postmaster
17032 root   9   0 63056  61M  1336 S 0.5  1.5   0:21   0 
test_function


The RSS value has been steadily growing the entire time, so I think we 
have our cause.  Is this a postmaster memory leak, or is this behavior 
intentional?  Either way, what would be the best way to work around it? 
 These functions are created in this way because we want them to point 
to different tables at different times.


Thanks again for pointing us in the right direction on this.

   Jeff

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Postmaster Out of Memory

2005-06-27 Thread Jeff Gold

Tom Lane wrote:

Found it --- the actual leak is in index_create, not in TRUNCATE or
CLUSTER at all; and it's been there a really long time.
Patch for 7.4 branch attached.


Excellent!  Does index_create refer to something that is invoked as a 
consequence of CREATE INDEX?  I'm looking through the code on our side 
and can't find any obvious places where we recreate indexes, but I might 
just be missing something.  Might it be a conseuence of repulating a 
recently truncated table instead?  I'll confer with Joe on this new 
information tomorrow.  Thanks for taking the time to look into this.


Jeff



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] Postmaster Out of Memory

2005-06-27 Thread Jeff Gold

Tom Lane wrote:

TRUNCATE and CLUSTER both rebuild indexes, so they'd also trigger the
leak.


Sorry to bug you again, but I have two quick followup questions: (1) is 
the leak you discovered fixed on the 8.0 branch? and (2) would closing 
the database connection once per day be a reasonable way to work around 
the problem in the absence of the patch you forwarded?


Jeff

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq