i've seen alot of junk lately that is severly mis-spelled in the
subject...

Subject: cheeap  sooftware avaailable ! lpvapvcijv
Subject: Dallase would you pllease just listten to me

So...  i hacked up an eval test to call pspell on the subject line of
each message.... here are the results running against my corpus.  as you
can see, there is nothing overwhelming about these results.....

############################################################
# Tue Dec 30 10:06:17 CST 2003 -- beginning test of
testrule.SPELLING.txt:
header SUBJ_SPELLING_00         eval:spell_check_subject('0','10')
describe SUBJ_SPELLING_00       0-9% mis-spelled words in subject

header SUBJ_SPELLING_10         eval:spell_check_subject('10','20')
describe SUBJ_SPELLING_10       10-19% mis-spelled words in subject

header SUBJ_SPELLING_20         eval:spell_check_subject('20','30')
describe SUBJ_SPELLING_20       20-29% mis-spelled words in subject

header SUBJ_SPELLING_30         eval:spell_check_subject('30','40')
describe SUBJ_SPELLING_30       30-39% mis-spelled words in subject

header SUBJ_SPELLING_40         eval:spell_check_subject('40','50')
describe SUBJ_SPELLING_40       40-49% mis-spelled words in subject

header SUBJ_SPELLING_50         eval:spell_check_subject('50','60')
describe SUBJ_SPELLING_50       50-59% mis-spelled words in subject

header SUBJ_SPELLING_60         eval:spell_check_subject('60','70')
describe SUBJ_SPELLING_60       60-69% mis-spelled words in subject

header SUBJ_SPELLING_70         eval:spell_check_subject('70','80')
describe SUBJ_SPELLING_70       70-80% mis-spelled words in subject

header SUBJ_SPELLING_80         eval:spell_check_subject('80','90')
describe SUBJ_SPELLING_80       80-89% mis-spelled words in subject

header SUBJ_SPELLING_90         eval:spell_check_subject('90','100')
describe SUBJ_SPELLING_90       90-99% mis-spelled words in subject

header SUBJ_SPELLING_100        eval:spell_check_subject('100','100')
describe SUBJ_SPELLING_100      100% mis-spelled words in subject


############################################################
# SUBJ_SPELLING_00 -- 2283s/1850h of 10971 corpus, 2003-12-30
#
############################################################
score SUBJ_SPELLING_00 0.5

############################################################
# SUBJ_SPELLING_10 -- 1654s/1084h of 10971 corpus, 2003-12-30
#
############################################################
score SUBJ_SPELLING_10 0.5

############################################################
# SUBJ_SPELLING_20 -- 1274s/1002h of 10971 corpus, 2003-12-30
#
############################################################
score SUBJ_SPELLING_20 0.5

############################################################
# SUBJ_SPELLING_30 -- 495s/484h of 10971 corpus, 2003-12-30
#
############################################################
score SUBJ_SPELLING_30 0.5

############################################################
# SUBJ_SPELLING_40 -- 181s/263h of 10971 corpus, 2003-12-30
#
############################################################
score SUBJ_SPELLING_40 0.5

############################################################
# SUBJ_SPELLING_50 -- 118s/136h of 10971 corpus, 2003-12-30
#
############################################################
score SUBJ_SPELLING_50 0.5

############################################################
# SUBJ_SPELLING_60 -- 51s/61h of 10971 corpus, 2003-12-30             #
############################################################
score SUBJ_SPELLING_60 0.5

############################################################
# SUBJ_SPELLING_70 -- 20s/7h of 10971 corpus, 2003-12-30             #
############################################################
score SUBJ_SPELLING_70 0.5

############################################################
# SUBJ_SPELLING_80 -- 7s/1h of 10971 corpus, 2003-12-30             #
############################################################
score SUBJ_SPELLING_80 0.5

############################################################
# SUBJ_SPELLING_90 -- 0s/0h of 10971 corpus, 2003-12-30             #
############################################################
score SUBJ_SPELLING_90 0.5

############################################################
# SUBJ_SPELLING_100 -- 43s/35h of 10971 corpus, 2003-12-30             #
############################################################
score SUBJ_SPELLING_100 0.5


if you want to try it for yourself.... here is the sub you can add to
EvalTests.pm.  Make sure you install Text:Pspell from cpan, and have
aspell/ispell installed....  this is just a quick hack to get an idea of
how well spell checking would help.

# spell_check()
#
# spell_check runs on the Subject: header by default unless
# you pass it a string.  spell_check returns a percentage
# of words that are mis-spelled (ie. words not found in the dict)

sub spell_check_subject {

  use Text::Pspell;

  my ($self, $min, $max, $subject) = @_;
  local ($_);
  local $/ = "\n";              # Ensure $/ is set appropriately
  my ($result, @words, $word, $found, $notfound, $total, $x,
$found_perc, $notfound_perc);
  $subject = lc $self->get ('Subject') if (!defined $subject);

  # remove leading and trailing space
  $subject =~ s/^\s+//;
  $subject =~ s/\s+$//;
  # convert extra space to single space
  $subject =~ s/\s+/ /g;

  my $speller = Text::Pspell->new;
  return unless $speller;

  # Set some options
  $speller->set_option('language-tag','en_US');
  $speller->set_option('sug-mode','fast');

  @words = split(/\W/,$subject);
  $found = 0;
  $notfound = 0;
  $total = 0;
  foreach $word (@words) {
    $result = $speller->check( $word );
    if ($result) {
        $found++;
    }
    else {
        $notfound++;
    }
    $total++;
  }

  if ($total > 0) {
   $notfound_perc = sprintf ("%3.4f", ($notfound / $total) * 100 )
  }
  else {
   $notfound_perc = 0;
  }
  if (($min == $max) && ($notfound_perc == $max)) {
   return(1);
  }

  if (($notfound_perc >= $min) && ($notfound_perc < $max)) {
   return(1);
  }
  else {
   return(0);
  }
}


and here are the rules to add to make use of this eval test....

header SUBJ_SPELLING_00         eval:spell_check_subject('0','10')
describe SUBJ_SPELLING_00       0-9% mis-spelled words in subject

header SUBJ_SPELLING_10         eval:spell_check_subject('10','20')
describe SUBJ_SPELLING_10       10-19% mis-spelled words in subject

header SUBJ_SPELLING_20         eval:spell_check_subject('20','30')
describe SUBJ_SPELLING_20       20-29% mis-spelled words in subject

header SUBJ_SPELLING_30         eval:spell_check_subject('30','40')
describe SUBJ_SPELLING_30       30-39% mis-spelled words in subject

header SUBJ_SPELLING_40         eval:spell_check_subject('40','50')
describe SUBJ_SPELLING_40       40-49% mis-spelled words in subject

header SUBJ_SPELLING_50         eval:spell_check_subject('50','60')
describe SUBJ_SPELLING_50       50-59% mis-spelled words in subject

header SUBJ_SPELLING_60         eval:spell_check_subject('60','70')
describe SUBJ_SPELLING_60       60-69% mis-spelled words in subject

header SUBJ_SPELLING_70         eval:spell_check_subject('70','80')
describe SUBJ_SPELLING_70       70-80% mis-spelled words in subject

header SUBJ_SPELLING_80         eval:spell_check_subject('80','90')
describe SUBJ_SPELLING_80       80-89% mis-spelled words in subject

header SUBJ_SPELLING_90         eval:spell_check_subject('90','100')
describe SUBJ_SPELLING_90       90-99% mis-spelled words in subject

header SUBJ_SPELLING_100        eval:spell_check_subject('100','100')
describe SUBJ_SPELLING_100      100% mis-spelled words in subject

score SUBJ_SPELLING_00 0.5
score SUBJ_SPELLING_10 0.5
score SUBJ_SPELLING_20 0.5
score SUBJ_SPELLING_30 0.5
score SUBJ_SPELLING_40 0.5
score SUBJ_SPELLING_50 0.5
score SUBJ_SPELLING_60 0.5
score SUBJ_SPELLING_70 0.5
score SUBJ_SPELLING_80 0.5
score SUBJ_SPELLING_90 0.5
score SUBJ_SPELLING_100 0.5


now, i know there is a better way to do this, because how i did it, it
calls an eval test 10 times for each message, which causes pspell to tie
and scan the appropriate language dictionary in /usr/share/dict/ 10
times...   running the masscheck takes quite long :)

it really should be done in PerMsgStatus.pm and have the percent
calculated once and stored in a global var, and then the eval sub
spell_check_subject would just compare min and max against the global
var.  

but seeing the results of these rules, i wasnt really compelled to put
any more work into it...

feedback is welcome...

dallas


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to