Timo, I reinstalled all our servers with -O0 and full debugging on
"-gdwarf-2 -g3". I should be able to look up anything you need done now.
Here's some more examples with the data you asked for last time, now
without optimization. I also noticed that there are a few users that
seem to be regulars with this problem.
#4 0x00000000004d8fd7 in i_panic (format=0x501688 "file %s: line %d
(%s): assertion failed: (%s)") at failures.c:207
args = {{gp_offset = 40, fp_offset = 48, overflow_arg_area =
0x7fffcf52b5a0, reg_save_area = 0x7fffcf52b4e0}}
#5 0x000000000044f34e in maildir_uidlist_records_array_delete
(uidlist=0x1b612330, rec=0x1b61f1f8) at maildir-uidlist.c:405
recs = (struct maildir_uidlist_rec * const *) 0x1b607f60
pos = (struct maildir_uidlist_rec * const *) 0x0
idx = 0
count = 2
__PRETTY_FUNCTION__ = "maildir_uidlist_records_array_delete"
#6 0x000000000044f8fc in maildir_uidlist_next (uidlist=0x1b612330,
line=0x1b60426a
"1249562340.M302085P7316V04240006I00DB2CF1_0.gehenna7.rutgers.edu,S=19277:2,S")
at maildir-uidlist.c:552
rec = (struct maildir_uidlist_rec *) 0x1b6209f8
old_rec = (struct maildir_uidlist_rec *) 0x1b61f1f8
recs = (struct maildir_uidlist_rec * const *) 0x1b604264
count = 0
uid = 2807
--------------------
(gdb) fr 5
#5 0x000000000044f34e in maildir_uidlist_records_array_delete
(uidlist=0x1b612330, rec=0x1b61f1f8) at maildir-uidlist.c:405
405 i_assert(pos != NULL);
(gdb) p *recs
$1 = (struct maildir_uidlist_rec * const) 0x1b61fa60
(gdb) p count
$2 = 2
(gdb) p *recs[0]
$3 = {uid = 2735, flags = 6, filename = 0x1b61fa78
"1262739815.M913182P4280V04240006I014D0F92_0.gehenna7.rutgers.edu,S=11791:2,",
extensions = 0x1b620980 "W12036"}
(gdb) p *recs[1]
$4 = {uid = 2736, flags = 0, filename = 0x1b6209a0
"1262739645.M307617P13458V04240006I014D0F91_0.gehenna10.rutgers.edu,S=5005:2,S",
extensions = 0x1b6209f0 "W5124"}
(gdb)
Another user for comparison:
(gdb) fr 5
#5 0x000000000044f34e in maildir_uidlist_records_array_delete
(uidlist=0x6bd7680, rec=0x6bc93d0) at maildir-uidlist.c:405
405 i_assert(pos != NULL);
(gdb) p *recs
$1 = (struct maildir_uidlist_rec * const) 0x6d5c418
(gdb) p count
$2 = 7
(gdb) p *recs[0]
$3 = {uid = 52566, flags = 4, filename = 0x6d5c438
"1262781360.M836147P9323V04240007I0148DC53_0.gehenna5.rutgers.edu,S=2157:2,a",
extensions = 0x6d5c430 "W2201"}
(gdb) p *recs[1]
$4 = {uid = 52568, flags = 0, filename = 0x6cd4590
"1262628888.M94168P6001V04240006I003C41C2_0.gehenna10.rutgers.edu,S=22946:2,Sa",
extensions = 0x6cd4588 "W23481"}
(gdb) p *recs[2]
$5 = {uid = 52569, flags = 0, filename = 0x6bc9328
"1041233872.000006.mbox:2,S", extensions = 0x6bc9318 "W6334"}
(gdb) p *recs[3]
$6 = {uid = 52570, flags = 0, filename = 0x6bc9370
"1041233872.000013.mbox:2,S", extensions = 0x6bc9360 "W1193"}
(gdb) p *recs[4]
$7 = {uid = 52571, flags = 0, filename = 0x6bc92e0
"1041233872.000014.mbox:2,RS", extensions = 0x6bc92d0 "W3095"}
(gdb) p *recs[5]
$8 = {uid = 52572, flags = 4, filename = 0x6d58ec8
"1262782141.M730129P15749V03E80006I0148DC54_0.gehenna9.rutgers.edu,S=7454:2,",
extensions = 0x6d62160 "W7626"}
(gdb) p *recs[6]
$9 = {uid = 52573, flags = 0, filename = 0x6d62140
"1041233872.000016.mbox:2,S", extensions = 0x0}
(gdb)
And one more user:
(gdb) fr 5
#5 0x000000000044f34e in maildir_uidlist_records_array_delete
(uidlist=0x10f44680, rec=0x10f67120) at maildir-uidlist.c:405
405 i_assert(pos != NULL);
(gdb) p *recs
$1 = (struct maildir_uidlist_rec * const) 0x11093760
(gdb) p count
$2 = 2
(gdb) p *recs[0]
$3 = {uid = 55108, flags = 0, filename = 0x11093778
"1262788462.M201250P7938V04240006I01EF76E0_0.gehenna10.rutgers.edu,S=1731278:2,",
extensions = 0x1109d240 "W1753846"}
(gdb) p *recs[1]
$4 = {uid = 55109, flags = 4, filename = 0x1109d268
"1262788793.M851477P3866V045C0007I01EF76E3_0.gehenna8.rutgers.edu,S=19990",
extensions = 0x0}
On 12/30/2009 12:10 PM, David Halik wrote:
Ok, I think I've got some more info and a more accurate time line for
you. I tried this on two different dumps from two different users. The
count was 4 in the first example and 0 in the second. I'm guessing
that's considered "small"? The links to my gdb sessions for both are
below and have some of the info you were looking for. The
corresponding logs are also there so you can see how each failed. I
put everything on pastebin so it's a little easier to see.
By the way, I also found that the stale NFS file handle message does
appear first in each instance, it was just farther back in the logs.
"Lowering uid" message also appears immediately after every stale NFS
message, which in turn causes all of this n amount of time later
(sometimes 5 minutes, sometimes 20) when a user does a new action. The
"file reappeared message only occurs some of the time. Here's the
chain of events in every case so far that I can see:
1) fdatasync(/rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist)
failed: Stale NFS file handle
2) /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: next_uid was
lowered (n -> n-1, hdr=n-1)
...a few minutes later...
(may or may not be a "message reappeared" warning at this point)
3) /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file
entry at line 3:
1261057547.M378185P17303V03E80002I0197FB4A_0.gehenna9.rutgers.edu,S=7174:2,RS
(uid i -> n+1,2,3 )
4) Panic: file maildir-uidlist.c: line 405
(maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
One thing to note, after the "Expunged message reappeared, giving a
new UID" he died quickly and one more than one server simultaneously.
The gdb output is from server gehenna11 of that log file. The uid in
*recs[0] is also the number that you can see in the logs being lowered
from 719 -> 718.
First user log: http://pastebin.com/m1718f07b
First user gdb: http://pastebin.com/m40088dc8
The second user also died on more than one server. The output is also
from gehenna11
Second user log: http://pastebin.com/f3a1756f2
Second user gdb: http://pastebin.com/m59aacde4
On 12/29/2009 7:50 PM, Timo Sirainen wrote:
On 29.12.2009, at 19.09, David Halik wrote:
I'll definitely get back to you on this. Right now we're closed
until after New Years and I don't want to go updating the dovecot
package on all of our servers until we're all back at work. I did do
some quick poking around and the count is optimized out, so I'll
have the package rebuilt without optimization and let you what the
values are at the beginning of next week. Thanks again.
well, you can probably also get the values in a bit more difficult way:
p count = p uidlist.records.arr.buffer.used /
uidlist.records.arr.element_size
p recs[n] = p *(*uidlist.records.v)[n]
--
================================
David Halik
System Administrator
OIT-CSS Rutgers University
dha...@jla.rutgers.edu
================================