I once came up with a partial solution to this problem.
I used a bunch of dictionaries to find letter pairs that did not show up.
Granted it's possible for these to cause FP's due to uncommon abbrevations
and other oddities, but they work good for me!

Maybe if the regex was modified to check if these pairs showed up after the
</html> tag?
Or even at the bottom 5% of the message?
These are used by myself for the subject (less false positives).


"Odd Letter Combinations", using a few word lists, I tried to find letter
pairs which did not exist.

  a.. Last updated: 8/11/2003
  header FVGT_s_OBFU_J Subject =~ /j(b|c|f|g|w)/i
  describe FVGT_s_OBFU_J FVGT - subject contains odd letter combination with
J
  score FVGT_s_OBFU_J 0.1

  header FVGT_s_OBFU_OTHER Subject =~ /(vj|vk|xj|xk|yy|zf|zj)/i
  describe FVGT_s_OBFU_OTHER FVGT - subject contains odd letter combinations
  score FVGT_s_OBFU_OTHER 0.1

  header FVGT_s_OBFU_Q0 Subject =~ /(j|k|p|q|t|v|w|z)q/i
  describe FVGT_s_OBFU_Q0 FVGT - subject contains odd letter combination
with Q
  score FVGT_s_OBFU_Q0 0.1

  header FVGT_s_OBFU_Q1 Subject =~ /q(a|f|h|j|k|m|n|s|y)/i
  describe FVGT_s_OBFU_Q1 FVGT - subject contains odd letter combination
with Q (2)
  score FVGT_s_OBFU_Q1 0.1

  header FVGT_s_OBFU_V Subject =~ /(f|g|q|w)v/i
  describe FVGT_s_OBFU_V FVGT - subject contains odd letter combination with
V
  score FVGT_s_OBFU_V 0.1

  header FVGT_s_OBFU_X Subject =~ /(c|g|j|k|q|s|v|z)x/i
  describe FVGT_s_OBFU_X FVGT - subject contains odd letter combination with
X
  score FVGT_s_OBFU_X 0.1

  header FVGT_s_OBFU_Z Subject =~ /(f|j|k|p|q|x)z/i
  describe FVGT_s_OBFU_Z FVGT - subject contains odd letter combination with
Z
  score FVGT_s_OBFU_Z 0.1




Frederic Tarasevicius
Internet Information Services, Inc.



William Stearns wrote:
> Good evening, all,
>
> On Wed, 8 Oct 2003, Daniel Quinlan wrote:
>
>> Scott A Crosby <[EMAIL PROTECTED]> writes:
>>
>>> The thing is that a gibberish token (not-with the statistics of
>>> $LANG, not-dictionary) should, as a new token, be given a different
>>> bayes catagory than one that is in a dictionary, etc.
>>
>> Perhaps.  It would probably be somewhat expensive to test every word
>> for gibberish.
>
> I'm almost _certain_ I'm about to look incredibly stupid here, but
> might I suggest:
> Could we simply test for letter frequency?  For a given language,
> it would seem that the frequency would stay predictable; random
> strings of characters would show up with different histograms.
> Note that I handwave over the fact that we probably don't know the
> intended langauge beforehand.  :-(
> As I said, my apologies for a one-half^Wone-quarter^Wone-eigth
> baked idea.
> Cheers,
> - Bill
>
> --------------------------------------------------------------------------
-
> "``Threads are like salt.  You like salt, I like salt, but we eat a
> lot more pasta than salt.''  The thread guys are trying to tell you
> that diet of salt is a good idea.  They are wrong, don't listen, eat
> more pasta and be happy."
> -- Larry McVoy <[EMAIL PROTECTED]>
> --------------------------------------------------------------------------
> William Stearns ([EMAIL PROTECTED]).  Mason, Buildkernel, freedups,
> p0f, rsync-backup, ssh-keyinstall, dns-check, more at:
> http://www.stearns.org Linux articles at:
> http://www.opensourcedigest.com
> --------------------------------------------------------------------------
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> SourceForge.net hosts over 70,000 Open Source Projects.
> See the people who have HELPED US provide better services:
> Click here: http://sourceforge.net/supporters.php
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to