On Aug 14, 2006, at 12:01 PM, decoder wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Theo Van Dinter wrote:
On Mon, Aug 14, 2006 at 08:46:51PM +0200, decoder wrote:
gocr features a nice parameter called -d. It is able to remove
smaller particles before scanning, compare these results:
So my problem with the OCR idea is that it inevitably gets to the
point where we'd need to programatically solve the same graphics as
used in CAPTCHAs, and then I don't think we're really focused on
addressing the core issue any longer.
It's mostly the same way in non-graphic spams -- catching the text
may or may not be difficult with all the obfuscation and such that
goes on. However, catching the fact that there's obfuscation is a
good indication of spam.
Just a thought.
You are absolutely right, this COULD get to a point where it gets
really pointless to scan for text in an image. But for an image it is
even harder to detect an obfuscation, than with text.
For text, I had the idea earlier to utilize a method to detect
obfuscations with approximate matching and then scoring the
obfuscation itself and not the content. But this can lead easily to
false positives, so one must pay attention on what he puts on the
wordlist.
For images, this is even harder, how would one try to recognize an
attempt to mislead OCR?
Exactly: how do you know if the OCR software didn't find text because
it wasn't there, or because it was sufficiently obfuscated?
I don't mind an arms race for this area of spam fighting. It's a race
the spammers will lose, because at some point the image will become so
unclear as to be like a captcha system, at which point: who will be
bothering to try to read the image? In essence, when it comes to this
little part of the spam arms race, we are the plains indians and they
are the buffalo. All we have to do is keep herding them toward the
cliff of "images so obfuscated as to be unreadable by humans".
Their only way out of this particular race is to just stop. It's a
lose-lose proposition for them.