Re: Suggestion: OCR

2 Mar 2005 19:13:06 -0000

 --- On Wed 03/02, Matt Kettler < [EMAIL PROTECTED] > wrote:

That part is definitely NOT safe in the context of spamassassin... Nonsense 
looks a lot like bugs in spam mailers, and very little like legitimate email to 
SA.


If nothing else, consider the tripwire rules, which look for letter 
combinations that don't exist in normal English...
-----------

Thanks! If so, then it's a bit more work to implement. For example, a trivial 
idea is not to let the attachments, which stem from images, go through the 
rules that search for nonsense.

I meant 'safe' in the following sense: if the tool says some meaningful word 
(e.g. present in the english wordlist up to a small misspell), then this word 
is surely present in the image up to a small misspell. So, if some spam rule 
sees "viagra" or 'click here to get removed' after OCRing, then it is 'safe' to 
give a hit for it, for example.

Another work-intensive method could be as follows (corrections are welcome)
1. OCR.
2. Throw out all the words which are not in the english (german, russian, 
etc...) dictionary up to a misspell. E.g. tolerate at most one error per word. 
Correct the misspelled words. (Fast dictionary search required, e.g. represent 
wordlists as binary balanced trees.)
3. run other text-based rules.

Actually, I posted because I get too much image spam (which goes ok through SA) 
and tried to determine the possibility of catching it with the present tools. 
Sometimes I get photos and image-smileys so I'm very reluctant to stop all 
mails containing images without inspecting images.

My strong belief is that such tools as gocr can really help. The other question 
is how to integrate it in SA and who does it. I'm afraid I cannot dig into the 
SA code myself; so it's a suggestion to the advanced users and developers.

Regards,
sasha.

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

Re: Suggestion: OCR

Reply via email to