Theo Van Dinter wrote:
I would comment that plugins should probably skip parts they want to render that already has rendered text available.
Ah. That's a good idea. Now I'll have to search for a nice way to check that. :-)
I can't see how "set_rendered" would help in creating a fucntioning chain where one converter could put an arbitrary extracted object (image, pdf, whatever) where another converter could have a go at it.
If a plugin wants to get image/* parts and do something with the contents, they can do that already.
Not if the image/* parts are actually inside a document.
If you want to have a plugin do some work on a part's contents, then store that result and let another plugin pick up and continue doing other work ... There's no official method to do that.
I guessed as much. This however is what me and Matus were talking about.
You can store data as part of the Node object.
But what would be a use case for that?
Matus example was a Word document that contained as PDF wich (might in turn contain an image). A plugin that knows how to read word document could extract th text of the word document and then use "set_rendered" to make that avaiölable to SA. It cannot currently extract the PDF and make it available to any plugins that knows how tpo read PDFs though.
Matus idea about chains would be that in this example the the plugin reading the Word document would store any other objects somehow. In this case a PDF. After that, any plugin that knows how to handle PDFs will get to look at the PDF and extract text and other stuff from it. In case it extracts an image, it would then store it the same way, and any image handling plugins would find it.
I really don't know how common that is. I have never seen a Word document with a PDF inside it myself.
I have however seen many documents that contain images, and I think it would be a good idea to make those images available to things like FuzzyOCR and ImageInfo.
Arguably, there could be multiple people developing plugins for different types, but you'd need some coordination for the register_method_priority calls to figure out who goes in what order.
For some stuff coordination would be needed, yes. But not for what I'm thinking of.
The text extraction plugin I'm working on (wich started this) itself have simple extractor plugins. These plugins will be able to return arbitrary objects as well as text, and my plugin will check the return objects the same way it checks the original message parts. This way, all the extractors that are tied into my plugins will be able to extract stuff from objects extracted by other extractors. So far so good.
The most common thing to extract apart from text will most likely be images. Any OCR text extractor tied into my plugin would get to see those images, but any OCR SA plugins run after my plugin won't. It might be good to make extracted images available to those, and other image handling plugins.
My plugin is called after the message is parsed, wich is very good for a text extractor. FuzzyOCR (as an example) however works by scoring OCR output (wich may well be very different from the text in the image as we see it), and therefore has to be called at a later stage. The same gioes for ImageInfo.
It might therefore be a good idea to make the extracted images and other objects available to scoring plugins as well.
> I just found the register_method_priority() method. \o/) It's nice, isn't it? :-) I'm using it in my URLRedirect plugin.
Note: Do not try to add or remove parts in the tree. The tree is meant to represent the mime structure of the mail, and each node relates to that specific mime part. The tree is not meant to be a temporary data storage mechanism.
Ok. That makes things easier and less easy for me. I know that I'll have to implement my own list of stuff to loop though when extractors return additional parts in my plugin. That's the easy part.
The difficult part is how to make extracted stuff available to other plugins in a way they understand. I see two main ways to do this:
1: Invent a new way. This would require modifications of any plugins that should check the extracted objects.
2: Add a container part somewhere that "find_parts" would find, but wich is not actually a member of the message tree, and then add a simple way to add parts to that container. This would require modification of Mail::SpamAssassin::Message, but not of the plugins.
Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/