> Matus UHLAR - fantomas wrote: > >>> I'm currently working on a modular plugin for extracting text and add >>> it to SA message parts. >> >> if possible, extract images too, so the fuzzyocr and similar plugins would >> be able to look at that too. > > You meen extract images and add them as parts to the message? > > I guess that should be doable. I know that "unrtf" can extract images > from RTF files. I'll probably implement support for this, but I'll > probably not implement actually doing it right away. > >> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but >> if you manage the above, it shouldn't be hard to extract PDF's too :)
On 25.06.09 14:44, Jonas Eckerman wrote: > This I don't understand. Do they put PDFs inside .doc files as if the > ..doc was an archive? I am not sure but I think something alike was done. What I mean is to have generic chain of format converters, where at the end would be plain image or even text, that could be processed by classic rules like bayes, replacetags etc. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759