> [karl - Sun Aug 21 21:01:41 2011]: > > Hello sysadmins, > > Unfortunately, it seems that messages are occasionally going missing > from the archives on lists.gnu.org, even though they are being > delivered. > > Here is the example I have the most data for: > on Wed 10 Aug 2011 03:28:00 AM PDT, Shailesh posted comment #8 on a > savannah ticket, https://savannah.gnu.org/support/?107667#comment8. > This was received by me (and others) in email -- I will attach the > text. It was Message-Id: <20110810- > 102801.sv81343....@savannah.gnu.org>. > > However, looking at the thread index: > https://lists.gnu.org/archive/html/savannah-hackers/2011- > 08/threads.html > it does not appear (the thread, "CLISP: Permission denied"), is about > halfway down the page. All other comments in the thread are there, > including several from Shailesh. > > Looking at the date index, for August 10: > https://lists.gnu.org/archive/html/savannah-hackers/2011-08/index.html > Shailesh's message is also not there. (The message from him on August > 11 is the next one, comment #9.) > > Bizarrely, it apparently did not reach mail-archive.com, either. > It would be the next message from comment #7, > http://www.mail-archive.com/savannah-hackers@gnu.org/msg17425.html > but it jumps to comment #9, like our thread index, even though > arch...@mail-archive.com is subscribed to savannah-hackers. > I don't get that. > > It is also not in the mbox archive, > /var/lib/mailman/archives/private/savannah-hackers.mbox. > > Shailesh reposted his comment exactly, as comment #11, to test if it > would reach the archives this time. It did. So, as one might guess, > it > is not about the content but about something happening at the time of > the mail processing. > > Now, the worst part: a Google search shows thousands of missing > messages, past and present, even discounting google's overcounting of > results and the likelihood that some of them are just threading > computations going wrong. > http://www.google.com/search? hl=en&safe=off&q=site%3Alists.gnu.org+threads.html+"message+not+available"&oq=site%3Alists.gnu.org+threads. html+"message+not+available"&aq=f&aqi=&aql=&gs_sm=e&gs_upl=19750l21000l0l21202l13l9l0l0l0l6l222l1231l2.6.1l 9l0 > > Help?
It took me a while because the logs for 20110810 had already been rotated, but I finally figured out what happened: the post had been marked as spam on eggs (note the take_sa_hint_router) and has been ditched: 2011-08-10 06:26:53 [3630] 1Qr5zt-0000wY-Gp <= www-d...@savannah.gnu.org H=eggs.gnu.org [140.186.70.92]:54778 I=[140.186.70.17]:25 P=esmtp S=5383 id=20110810-102801.sv81343....@savannah.gnu.org T="[sr #107667] CLISP: Permission denied" from <www-d...@savannah.gnu.org> for savannah-hackers@gnu.org 2011-08-10 06:26:53 [3632] cwd=/spool/exim4 3 args: /usr/sbin/exim4 -Mc 1Qr5zt-0000wY-Gp 2011-08-10 06:26:53 [3630] SMTP connection from eggs.gnu.org [140.186.70.92]:54778 I=[140.186.70.17]:25 closed by QUIT 2011-08-10 06:26:53 [3632] 1Qr5zt-0000wY-Gp => savannah-hackers <savannah-hackers@gnu.org> F=<www- d...@savannah.gnu.org> P=<www-d...@savannah.gnu.org> R=take_sa_hint_router T=spam_archive S=5383 QT=4s DT=0s 2011-08-10 06:26:53 [3632] 1Qr5zt-0000wY-Gp Completed QT=4s So mailman *never* saw the message. mharc is probably smart enough to notice the missing messagid from the next reply the thread. This explains the "message not available" lines in the archives. The exim routing proceeds like this: # We run spamassassin on the host that feeds mail to lists take_sa_hint_router: verify = false condition = ${if eq{${length_3:$h_X-Spam-Flag:}}{YES} {1} {0}} driver = accept transport = spam_archive spam_archive: driver = appendfile directory = /spam/$local_part/ create_directory = true maildir_format = true The lost message was at /spam/savannah-hackers/new/1312972013.H459966P3633.lists.gnu.org. The reason why SpamAssassin marked it as spam is: X-Spam-Report: * 3.3 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL * [120.62.160.64 listed in zen.spamhaus.org] * 0.8 RCVD_IN_SORBS_WEB RBL: SORBS: sender is an abusable web server * [120.62.160.64 listed in dnsbl.sorbs.net] * 1.4 RCVD_IN_BRBL_LASTEXT RBL: RCVD_IN_BRBL_LASTEXT * [120.62.160.64 listed in bb.barracudacentral.org] * 0.6 HS_INDEX_PARAM URI: Link contains a common tracker pattern. * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * 0.8 RDNS_NONE Delivered to internal network by a host with no rDNS * 0.0 HELO_NO_DOMAIN Relay reports its domain incorrectly The user apparently posted the comment from 120.62.160.64, which seems to belong to a dynamic block of an Indian ISP and is blacklisted in serveral places. I'm not sure how we could reduce the amount of miscategorized posts. the listhelper mechanism is for posts blocked by mailman. The posts blocked by SpamAssassin currently go to quarantine maildirs that nobody ever looks at. (I'm not suggesting that someone should, it would require a huge amount of time). -- Bernie Innocenti Systems Administrator, Free Software Foundation