learning from IMAP spam collection

Michael Monnerie Sun, 17 May 2009 00:43:36 -0700

Dear experts,

I have a question regarding spam/ham learning, regarding performance. I 
store spam in a mail folder accessible via IMAP. Then I want to feed 
this into bayes. For this, I do:


fetchmail -asnp IMAP --folder autolearn --user $username -m "formail -s 
|spamassassin -d >>/tmp/x" $mailserver
# now learn as user
formail </tmp/x -s spamc -u $user -L spam
# now feed to bayes
formail </tmp/x -n 3 -s spamc -u $user -C report
# we could also do this:
spamassassin -r --mbox 

Question 1:
Do I need to call spamc twice, once with "-L spam" and once with "-C 
report"? Do I understand correctly that -L trains my bayes, while -C 
reports to spamcop etc.?

Question 2:
Is calling spamassassin better than spamc for such a mbox?

Question 3, my main question:
The fetchmail command is taking *ages*, when I call it like above it 
takes *hours*, replacing the "-m" parameter with "cat >>/tmp/x" takes 7 
minutes. I can see spamassassin using 100% cpu. Why is it so extremely 
slow and CPU consuming just to remove any existing markups?
I like to remove existing markups, and I need the resulting mbox format 
for other things as well. Is there a way to make it so fast that it's 
usable?

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4

signature.asc
Description: This is a digitally signed message part.

learning from IMAP spam collection

Reply via email to