"spamassassin -D bayes" will tell you, you should see a line like: bayes: skipped token 'from' because it's in stopword list for language 'en'
Giovanni On 12/28/23 15:45, Jimmy wrote:
The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern. Thank you. On Thu, Dec 28, 2023 at 9:22 PM <giova...@paclan.it <mailto:giova...@paclan.it>> wrote: On 12/28/23 12:59, Jimmy wrote: > Hi, > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list. > > I created the regex pattern using the following code: > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string > > Afterward, I converted it to UTF-8 hex. > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns. > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to: ----------------------------------------------------------------------------------------------------------- use strict; use warnings; use Encode; use Regexp::Trie; my @input = <STDIN>; my $rt = Regexp::Trie->new; for my $w ( @input ) { chomp($w); $rt->add($w); } my $regexp = $rt->regexp; my @reg = split //, $regexp; for my $c ( @reg ) { my $char = $c; my $test; eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )"; if( $@ ) { print 'x' . sprintf("%x", ord($c)); } else { print $char; } } ----------------------------------------------------------------------------------------------------------- Giovanni
OpenPGP_signature.asc
Description: OpenPGP digital signature