For anyone interested, I largely resolved the performance issues with sa-learn training when using txrep with a little mysql server tuning. As a reference point, training with ~6400 messages (most of which had already been learned) took about 14 minutes for both txrep+bayes, and about 3.5 minutes less with txrep disabled. (you could do much better with better hardware)
For those interested in improving txrep training performance, I wonder if it couldn't be improved tremendously; I'm a little unclear on what it does/doesn't track, and by this statement: https://apache.googlesource.com/spamassassin/+/trunk/lib/Mail/SpamAssassin/Plugin/TxRep.pm#1805 The TxRep plugin currently does track each message individually, hence it does not detect when you learn the message repeatedly. It will add/subtract the penalty/bonus score each time the message is fed to the spam learner. Is that a typo? If it does track individual messages, it seems obvious that it *should* detect learning a message repeatedly, and do nothing when you try to re-learn a message as the same type. (I have some queries save, and such but eg. learning a message the first time issued 19 queries - relearning the same message as the same time issued 41 queries.) My guess is the current state of things is: could be improved, maybe file an rfe ? Thanks... On Wed, 2017-07-12 at 17:40 -0600, Jesse Norell wrote: > One thing pointing to maybe a need for reworking the training logic is > that I have txrep_track_messages at the default (1), and almost every > message in my corpus has already been trained; each run brings in only a > handful of new messages (usually 10-20, but often 0, and always < 100). > It sure seems like a quick check to find out if it has already learned > this message as the same type (ham/spam) would take a single query, then > move on to the next message for those already seen; but I see sa-learn > doing many INSERTS (usually failing with 'Duplicate entry') and UPDATEs > of the txrep table. > > > On Wed, 2017-07-12 at 09:59 -0600, Jesse Norell wrote: > > Hello, > > > > I have txrep data in a mysql database, and am working on a training > > script to run sa-learn; with bayes also in MySQL and a corpus size of > > 5279 nspam and 849 nham, sa-learn takes a full 2 hours to run with txrep > > enabled (use_txrep 1), but only 13 minutes with txrep disabled > > (use_txrep 0). One of my main gripes with the old AWL was that it > > didn't learn/correct when training messages, so I love that txrep does > > that, but does anyone have any tips to improve txrep training > > performance? Either tweaks/improvements on my end, or even a little > > thought on logic redesign in that area? > > > > Thanks, > > > > -- Jesse Norell Kentec Communications, Inc. 970-522-8107 - www.kci.net