i've seen alot of junk lately that is severly mis-spelled in the subject...
Subject: cheeap sooftware avaailable ! lpvapvcijv Subject: Dallase would you pllease just listten to me So... i hacked up an eval test to call pspell on the subject line of each message.... here are the results running against my corpus. as you can see, there is nothing overwhelming about these results..... ############################################################ # Tue Dec 30 10:06:17 CST 2003 -- beginning test of testrule.SPELLING.txt: header SUBJ_SPELLING_00 eval:spell_check_subject('0','10') describe SUBJ_SPELLING_00 0-9% mis-spelled words in subject header SUBJ_SPELLING_10 eval:spell_check_subject('10','20') describe SUBJ_SPELLING_10 10-19% mis-spelled words in subject header SUBJ_SPELLING_20 eval:spell_check_subject('20','30') describe SUBJ_SPELLING_20 20-29% mis-spelled words in subject header SUBJ_SPELLING_30 eval:spell_check_subject('30','40') describe SUBJ_SPELLING_30 30-39% mis-spelled words in subject header SUBJ_SPELLING_40 eval:spell_check_subject('40','50') describe SUBJ_SPELLING_40 40-49% mis-spelled words in subject header SUBJ_SPELLING_50 eval:spell_check_subject('50','60') describe SUBJ_SPELLING_50 50-59% mis-spelled words in subject header SUBJ_SPELLING_60 eval:spell_check_subject('60','70') describe SUBJ_SPELLING_60 60-69% mis-spelled words in subject header SUBJ_SPELLING_70 eval:spell_check_subject('70','80') describe SUBJ_SPELLING_70 70-80% mis-spelled words in subject header SUBJ_SPELLING_80 eval:spell_check_subject('80','90') describe SUBJ_SPELLING_80 80-89% mis-spelled words in subject header SUBJ_SPELLING_90 eval:spell_check_subject('90','100') describe SUBJ_SPELLING_90 90-99% mis-spelled words in subject header SUBJ_SPELLING_100 eval:spell_check_subject('100','100') describe SUBJ_SPELLING_100 100% mis-spelled words in subject ############################################################ # SUBJ_SPELLING_00 -- 2283s/1850h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_00 0.5 ############################################################ # SUBJ_SPELLING_10 -- 1654s/1084h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_10 0.5 ############################################################ # SUBJ_SPELLING_20 -- 1274s/1002h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_20 0.5 ############################################################ # SUBJ_SPELLING_30 -- 495s/484h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_30 0.5 ############################################################ # SUBJ_SPELLING_40 -- 181s/263h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_40 0.5 ############################################################ # SUBJ_SPELLING_50 -- 118s/136h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_50 0.5 ############################################################ # SUBJ_SPELLING_60 -- 51s/61h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_60 0.5 ############################################################ # SUBJ_SPELLING_70 -- 20s/7h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_70 0.5 ############################################################ # SUBJ_SPELLING_80 -- 7s/1h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_80 0.5 ############################################################ # SUBJ_SPELLING_90 -- 0s/0h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_90 0.5 ############################################################ # SUBJ_SPELLING_100 -- 43s/35h of 10971 corpus, 2003-12-30 # ############################################################ score SUBJ_SPELLING_100 0.5 if you want to try it for yourself.... here is the sub you can add to EvalTests.pm. Make sure you install Text:Pspell from cpan, and have aspell/ispell installed.... this is just a quick hack to get an idea of how well spell checking would help. # spell_check() # # spell_check runs on the Subject: header by default unless # you pass it a string. spell_check returns a percentage # of words that are mis-spelled (ie. words not found in the dict) sub spell_check_subject { use Text::Pspell; my ($self, $min, $max, $subject) = @_; local ($_); local $/ = "\n"; # Ensure $/ is set appropriately my ($result, @words, $word, $found, $notfound, $total, $x, $found_perc, $notfound_perc); $subject = lc $self->get ('Subject') if (!defined $subject); # remove leading and trailing space $subject =~ s/^\s+//; $subject =~ s/\s+$//; # convert extra space to single space $subject =~ s/\s+/ /g; my $speller = Text::Pspell->new; return unless $speller; # Set some options $speller->set_option('language-tag','en_US'); $speller->set_option('sug-mode','fast'); @words = split(/\W/,$subject); $found = 0; $notfound = 0; $total = 0; foreach $word (@words) { $result = $speller->check( $word ); if ($result) { $found++; } else { $notfound++; } $total++; } if ($total > 0) { $notfound_perc = sprintf ("%3.4f", ($notfound / $total) * 100 ) } else { $notfound_perc = 0; } if (($min == $max) && ($notfound_perc == $max)) { return(1); } if (($notfound_perc >= $min) && ($notfound_perc < $max)) { return(1); } else { return(0); } } and here are the rules to add to make use of this eval test.... header SUBJ_SPELLING_00 eval:spell_check_subject('0','10') describe SUBJ_SPELLING_00 0-9% mis-spelled words in subject header SUBJ_SPELLING_10 eval:spell_check_subject('10','20') describe SUBJ_SPELLING_10 10-19% mis-spelled words in subject header SUBJ_SPELLING_20 eval:spell_check_subject('20','30') describe SUBJ_SPELLING_20 20-29% mis-spelled words in subject header SUBJ_SPELLING_30 eval:spell_check_subject('30','40') describe SUBJ_SPELLING_30 30-39% mis-spelled words in subject header SUBJ_SPELLING_40 eval:spell_check_subject('40','50') describe SUBJ_SPELLING_40 40-49% mis-spelled words in subject header SUBJ_SPELLING_50 eval:spell_check_subject('50','60') describe SUBJ_SPELLING_50 50-59% mis-spelled words in subject header SUBJ_SPELLING_60 eval:spell_check_subject('60','70') describe SUBJ_SPELLING_60 60-69% mis-spelled words in subject header SUBJ_SPELLING_70 eval:spell_check_subject('70','80') describe SUBJ_SPELLING_70 70-80% mis-spelled words in subject header SUBJ_SPELLING_80 eval:spell_check_subject('80','90') describe SUBJ_SPELLING_80 80-89% mis-spelled words in subject header SUBJ_SPELLING_90 eval:spell_check_subject('90','100') describe SUBJ_SPELLING_90 90-99% mis-spelled words in subject header SUBJ_SPELLING_100 eval:spell_check_subject('100','100') describe SUBJ_SPELLING_100 100% mis-spelled words in subject score SUBJ_SPELLING_00 0.5 score SUBJ_SPELLING_10 0.5 score SUBJ_SPELLING_20 0.5 score SUBJ_SPELLING_30 0.5 score SUBJ_SPELLING_40 0.5 score SUBJ_SPELLING_50 0.5 score SUBJ_SPELLING_60 0.5 score SUBJ_SPELLING_70 0.5 score SUBJ_SPELLING_80 0.5 score SUBJ_SPELLING_90 0.5 score SUBJ_SPELLING_100 0.5 now, i know there is a better way to do this, because how i did it, it calls an eval test 10 times for each message, which causes pspell to tie and scan the appropriate language dictionary in /usr/share/dict/ 10 times... running the masscheck takes quite long :) it really should be done in PerMsgStatus.pm and have the percent calculated once and stored in a global var, and then the eval sub spell_check_subject would just compare min and max against the global var. but seeing the results of these rules, i wasnt really compelled to put any more work into it... feedback is welcome... dallas ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk