I'm in the process of building up a new set of mail servers, running on Ubuntu 18.04 with latest of everything. The old systems were using AWL for whitelisting, while the new ones will be using the bigger/better/brighter/now-with-AI! txrep ...

I have a large curated set of ham and spam in various folders, with email going back a long time. Once everything is moved to the new systems, I'll train SA by running each email through :

cat /var/mail/..../Maildir/INBOX/cur/154....:2,S | spamc -4 -d localhost -L ham

I can't just use sa-learn as the mailstore is encrypted with dovecot mail_crypt. So it will be a for loop decrypting each mailfile and pumping it through. That part all works fine.

But one thing I noticed was that the txrep last_hit field gets populated with the current date/time that the email is pushed through spamc. Here's an example of a single spam from about a month ago - I've chopped it a bit to help it fit ...

MariaDB [spamassassin]> select * from txrep where totscore > 0;
+----------+---------------------------------+--------+-------+----------+------------+---------------------+
| username | email | ip | count | totscore | signedby | last_hit |
+----------+---------------------------------+--------+-------+----------+------------+---------------------+
| debian-sp| 194.25.134.19 | none | 1 | 20 | | 2019-11-02 12:29:23 | | debian-sp| 51a57833f1a45ef1cf@sa_generated | none | 2 | 40 | 1572413195 | 2019-11-02 12:29:23 | | debian-sp| bosch_siemens_haus | 194.25 | 1 | 20 | | 2019-11-02 12:29:23 | | debian-sp| bosch_siemens_haus | none | 1 | 20 | | 2019-11-02 12:29:23 | | debian-sp| t-online.de | 194.25 | 1 | 20 | | 2019-11-02 12:29:23 | | debian-sp| _185.118.165.202_ | none | 1 | 20 | helo | 2019-11-02 12:29:23 |
+----------+---------------------------------+--------+-------+----------+------------+---------------------+
6 rows in set (0.001 sec)

This was a clean database and I ran a single spam email from about a month ago in through spamc.

The last_hit field is when it was run through. I think this will mess up the expiration etc. won't it ? These records will hang around for an extra 60 days as things are expired out with the usual mysql query

DELETE FROM txrep WHERE last_hit <= (now() - INTERVAL 120 day);

I'm wondering if there's a way to pick up the date from the topmost Received: header and use that.

Hmm, come to think of it, bayes_seen has the same issue, it has a lastupdate field. bayes_token has an atime field, not sure about that.

Is this going to cause issues ?

Reply via email to