Re: FuzzyOcr 2.3b released, fixes bugs and improves stability

Logan Shaw Fri, 25 Aug 2006 16:39:12 -0700

On Fri, 25 Aug 2006, Theo Van Dinter wrote:

On Fri, Aug 25, 2006 at 11:43:47PM +0200, decoder wrote:

a) It is VERY hard to realize. To preserve the message, you would need
two plugins, one that runs as first rule, converts the message to text
only, and another one that runs as last rule and puts the image back
into the message (so the message stays unchanged).

The main thing to do is make sure that the image is rendered into
text before the message body text array is cached -- and that's solved
(generally speaking) by doing the rendering in check_start().

Heck, this may be worth having a new plugin call in M::SA::parse()
which happens right after the normal parsing run, called render_parts or
something, where plugins get called with the message and main SA objects,
and are expected to only generate renderings for the non-standard types.


I was pondering this a few weeks ago, and I started thinking
about how some print spoolers (like the old System V stuff
and also CUPS) do format conversions.  Basically, they have
a directed, acyclic graph of formats and converters.  Just as
an example, you might have edges like this:

        text -> postscript ; cmd='enscript'
        postscript -> PCL ; cmd='gs -DDEVICE=SomePclDriver'
        jpeg -> pnm ; cmd='djpeg'
        gif -> pnm ; cmd='giftopnm'
        pnm -> postscript ; cmd='pnmtops'

So then you declare, "hey my printer takes PCL" input.
Then when someone enqueues a jpeg to be printed, the spooler
pieces together a converter pipeline by going backwards through
the graph:

        postscript -> PCL ; cmd='gs -DDEVICE=SomePclDriver'
        pnm -> postscript ; cmd='pnmtops'
        jpeg -> pnm ; cmd='djpeg'

which tells it it needs to do something like the following to
convert the input format into what a printer understands:

        djpeg | pnmtops | gs -DDEVICE=SomePclDriver

It struck me that this isn't entirely different from what
you might want for spam detection via deep content scanning.
The plugins are analogous to printers (they would register what
MIME types they can handle), and the spam is like a print job.
Basically, you want to start with a message and a set of
enabled plugins, then convert the message to all formats that
the plugins can recognize.

There are some differences, though.  With the printer,
you have no interest in printing the intermediate formats.
With a spam detector, you can never rule out the idea that
scanning intermediate data might be helpful because of some
tell-tale sign that it's spam.

The point is, though, it could be interesting to have a general
method for allowing spam to be converted into anything that
any plugin can understand, rather than having each plugin do
this itself.

For example, let's suppose you have a Word document with an
image in it, and that image contains spammy words that can be
recognized via OCR.  Wouldn't it be nice if the Word document
scanner could feed the images it finds back into some framework
so that anything which can scan images can scan things from
inside the Word doc?  Similarly with zip files (although I
doubt spammers will use them since everyone is too lazy to
open them up) and a million other things.

Of course, this is just an idea, and it's a little bit of an
"out there" idea, but as long as the conversion topic is being
describe, I thought I'd bring it up so the idea is on the table.

  - Logan

Re: FuzzyOcr 2.3b released, fixes bugs and improves stability

Reply via email to