> Matus UHLAR - fantomas wrote:
>
>>> I'm currently working on a modular plugin for extracting text and add 
>>> it  to SA message parts.
>>
>> if possible, extract images too, so the fuzzyocr and similar plugins would
>> be able to look at that too.
>
> You meen extract images and add them as parts to the message?
>
> I guess that should be doable. I know that "unrtf" can extract images  
> from RTF files. I'll probably implement support for this, but I'll  
> probably not implement actually doing it right away.
>
>> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
>> if you manage the above, it shouldn't be hard to extract PDF's too :)

On 25.06.09 14:44, Jonas Eckerman wrote:
> This I don't understand. Do they put PDFs inside .doc files as if the  
> ..doc was an archive?

I am not sure but I think something alike was done. What I mean is to have
generic chain of format converters, where at the end would be plain image
or even text, that could be processed by classic rules like bayes,
replacetags etc.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759

Reply via email to