I'm in the process of building up a new set of mail servers, running on
Ubuntu 18.04 with latest of everything. The old systems were using AWL
for whitelisting, while the new ones will be using the
bigger/better/brighter/now-with-AI! txrep ...
I have a large curated set of ham and spam in various folders, with
email going back a long time. Once everything is moved to the new
systems, I'll train SA by running each email through :
cat /var/mail/..../Maildir/INBOX/cur/154....:2,S | spamc -4 -d localhost
-L ham
I can't just use sa-learn as the mailstore is encrypted with dovecot
mail_crypt. So it will be a for loop decrypting each mailfile and
pumping it through. That part all works fine.
But one thing I noticed was that the txrep last_hit field gets populated
with the current date/time that the email is pushed through spamc.
Here's an example of a single spam from about a month ago - I've chopped
it a bit to help it fit ...
MariaDB [spamassassin]> select * from txrep where totscore > 0;
+----------+---------------------------------+--------+-------+----------+------------+---------------------+
| username | email | ip | count | totscore
| signedby | last_hit |
+----------+---------------------------------+--------+-------+----------+------------+---------------------+
| debian-sp| 194.25.134.19 | none | 1 | 20
| | 2019-11-02 12:29:23 |
| debian-sp| 51a57833f1a45ef1cf@sa_generated | none | 2 | 40
| 1572413195 | 2019-11-02 12:29:23 |
| debian-sp| bosch_siemens_haus | 194.25 | 1 | 20
| | 2019-11-02 12:29:23 |
| debian-sp| bosch_siemens_haus | none | 1 | 20
| | 2019-11-02 12:29:23 |
| debian-sp| t-online.de | 194.25 | 1 | 20
| | 2019-11-02 12:29:23 |
| debian-sp| _185.118.165.202_ | none | 1 | 20
| helo | 2019-11-02 12:29:23 |
+----------+---------------------------------+--------+-------+----------+------------+---------------------+
6 rows in set (0.001 sec)
This was a clean database and I ran a single spam email from about a
month ago in through spamc.
The last_hit field is when it was run through. I think this will mess
up the expiration etc. won't it ? These records will hang around for an
extra 60 days as things are expired out with the usual mysql query
DELETE FROM txrep WHERE last_hit <= (now() - INTERVAL 120 day);
I'm wondering if there's a way to pick up the date from the topmost
Received: header and use that.
Hmm, come to think of it, bayes_seen has the same issue, it has a
lastupdate field. bayes_token has an atime field, not sure about that.
Is this going to cause issues ?