However, many of tokens in even Forbes and WP newsletters may occure in different spamy newsletters, so be careful when traning even these.
On 15.02.23 09:51, Alex wrote:
This is exactly what I was thinking. When going through the quarantine, it's also very difficult to always not only identify which newsletters may have been miscategorized or trained incorrectly, but also ever being able to correct an improperly trained newsletter (or email in general).
this is why I recomment not to do any training on newsletters, or at least no HAM training unless they are known.
If you get the score down enough not to be classified as spam, you've won and should not contine (unless you are willing to check all BAYES_0 mail for suspicious newsletters and train those as spam, seeing how much it affects mentioned Forbes and WP newsletters.
Too bad it wasn't possible to build a shared list of trusted newsletters/senders to compensate for these mistakes.
I wouldn't trust such list, too many organizations set up their newsletters to anyone they (n)ever communicated with...
On a related note, how about emails with only an image attachment? People use email to send pictures, screenshots and other emails with nothing in the body and sometimes even no subject, but aren't spam. The ones I see in the quarantine are almost always ham, and despite training them as ham (even with --max-size 0), they continue to be tagged as spam.
There are a few rules supposed to catch short/empty messages with image attachment.
There is ExtractText plugin that supports OCR scanning with tesseract, which should be able to extract any text in those images. But note that OCR takes time.
I've always also had difficulty with marking them so DCC ignores them.
yes, from DCC's point of view they are empty messages and it's hard to score anything besides EMPTY_MESSAGE and rules I mentioned above.
-- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Microsoft dick is soft to do no harm