Theo Van Dinter wrote:

the convolution is a
fingerprint that you could write a rule for and then you don't care
what the content actually is.  For example, you'd render something
like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
and they'd all be different tokes.

That's really a good idea. Put the chains of extraction in a pseudoheader that can be tested in rules and seen as a token by bayes.

I'm putting that in the todo for the plugin.

The most common thing to extract apart from text will most likely be images.
Any OCR text extractor tied into my plugin would get to see those images,
but any OCR SA plugins run after my plugin won't. It might be good to make
extracted images available to those, and other image handling plugins.

But yours already ran, so who cares about the others?

Because they work very differently?

A OCR plugin that adds the rendered text to the message for bayes and text rules is very different from one that does it's own scoring based on the OCRed text.

If you're expending the resources to OCR the same image in an email
multiple times ...  You clearly either have a lot of hardware or not a
lot of mail.

*I* don't use any OCR at all. We don't have the resources for that (beeing a small non-profit NGO), and so far I haven't seen any need for OCR either since we never had much image spam slip through anyway.

So I will not implement a OCR extractor for my plugin. I'll leave that for others. This is actually one of the reasons I'd like to let existing OCR plugins have access to any images extracted by my plugin. So that those who allready do use OCR can get a benefit from the extraction.

I'm not going to spend much time on it though. I'm happy just extracting text. :-) And it does extract text (currently from Word, OpenXML, OpenDocument and RTF documents). :-)

I actually hadn't even thought about this image/OCR etc stuff before Matus suggested it.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Reply via email to