Re: Plugin extracting text from docs

Theo Van Dinter Thu, 25 Jun 2009 11:29:13 -0700

On Thu, Jun 25, 2009 at 1:12 PM, Jonas Eckerman<jonas_li...@frukt.org> wrote:
>> Already exists, check recent list history for "set_rendered".
>
> I though that was for text only.


It is only for text.

> In any case, any plugin looking for images, or a PDF, will most likely look
> at MIME type and/or file name, and then use the "decode" method to get the
> data, and AFAICT the "set_rendered" method doesn't have any impact on any of
> that.

Of course.  There are three states for the data in a Message::Node object:
  - raw: whatever the email had originally.  may be encoded, etc.
  - decoded: the raw content, decoded (ie: base64 or
quoted-printable).  may be binary.
  - rendered: the text content.  if it was a text part, it's the same
as decoded.  if it was a html part, the decoded data gets "rendered"
into text.  if it's anything else, the rendered text is blank because
nothing else is supported.

The goal with the plugin calls and set_rendered is to allow other
plugins to find parts that they understand how to convert into text,
and set the rendered version of the part to whatever as appropriate.
So if you want to do OCR on image/*, you can do that.  If you want to
convert PDF/DOC/whatever to text, you can do that.

I would comment that plugins should probably skip parts they want to
render that already has rendered text available.

Rules, Bayes, etc, then take all the rendered parts and use them.

> I can't see how "set_rendered" would help in creating a fucntioning chain
> where one converter could put an arbitrary extracted object (image, pdf,
> whatever) where another converter could have a go at it.

Well, you wouldn't do that because there's no point. ;)   (feel free
to disagree with me though)
If a plugin wants to get image/* parts and do something with the
contents, they can do that already.
If a plugin wants to get application/octet-stream w/ filename "*.pdf"
and do something with the contents, they can do that already.

If you want to have a plugin do some work on a part's contents, then
store that result and let another plugin pick up and continue doing
other work ...  There's no official method to do that.  You can store
data as part of the Node object.  You could potentially also write a
tempfile, though you'll want to be careful to clean up the tempfile as
necessary.

But what would be a use case for that?  I guess something like
converting a PDF to a TIFF, then OCR the TIFF?
I'd probably implement that as a single plugin w/ "ocr" as a function
that gets called from both the PDF and TIFF handlers.
Arguably, there could be multiple people developing plugins for
different types, but you'd need some coordination for the
register_method_priority calls to figure out who goes in what order.
(btw: I just found the register_method_priority() method. \o/)

Note: Do not try to add or remove parts in the tree.  The tree is
meant to represent the mime structure of the mail, and each node
relates to that specific mime part.  The tree is not meant to be a
temporary data storage mechanism.


Hope this helps.

Re: Plugin extracting text from docs

Reply via email to