I do not speak Thai but I cannot see any word in the sample email that should 
match that list.
Which word do you think should match the regexp ?
 Giovanni

On 12/29/23 10:08, Jimmy wrote:
You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>

Jimmy

On Fri, Dec 29, 2023 at 3:59 PM <giova...@paclan.it 
<mailto:giova...@paclan.it>> wrote:

    To create the stopwords regexp I used the script I shared in a previous 
email and a list of words one per line.
    Could you share the list you are using ?

        Giovanni

    On 12/29/23 09:22, Jimmy wrote:
     > I use SpamAssassin 4.0.0 (2022-12-14)
     >
     > $ spamassassin -D --lint 2>&1 | grep bayes:
     > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
     > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
     > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
     > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
     > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
     > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
     > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
     > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: 
en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
     >
     >
     > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
     > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because 
it's in stopword list for language 'en'
     >
     > You can use "บาท" that was listed in regexp pattern but somehow I don't 
know why it not show skipped token in bayes.
     >
     > Jimmy
     >
     >
     > On Fri, Dec 29, 2023 at 2:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
     >
     >     Config line produces a syntax error for me:
     >     config: failed to parse line in /etc/mail/spamassassin/local.cf 
<http://local.cf> <http://local.cf <http://local.cf>> (line 1): bayes_stopword_th
     >
     >     Could you share the word list in utf8 ?
     >     I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> and it 
produces a working regexp.
     >     Bayes stopwords languages must also be enabled using 
"bayes_stopword_languages" config keyword, by default only english is enabled.
     >        Giovanni
     >
     >     On 12/28/23 17:06, Jimmy wrote:
     >      > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> 
<https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>>>
     >      > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> 
<https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>>>
     >      >
     >      > Jimmy
     >      >
     >      >
     >      > On Thu, Dec 28, 2023 at 10:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
     >      >
     >      >     Could you share a config line and a sample you are using ?
     >      >        Giovanni
     >      >
     >      >     On 12/28/23 16:26, Jimmy wrote:
     >      >      > Yes, I have done that, and I am also editing 
Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are 
not separated by spaces, longer words may not match those patterns.
     >      >      >
     >      >      > Jimmy
     >      >      >
     >      >      > On Thu, Dec 28, 2023 at 10:13 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>> wrote:
     >      >      >
     >      >      >     "spamassassin -D bayes" will tell you, you should see 
a line like:
     >      >      >     bayes: skipped token 'from' because it's in stopword 
list for language 'en'
     >      >      >
     >      >      >        Giovanni
     >      >      >
     >      >      >     On 12/28/23 15:45, Jimmy wrote:
     >      >      >      > The pattern has successfully passed the test 
script, but it needs to check whether Bayes learning will identify and possibly exclude the 
word from matching this pattern.
     >      >      >      >
     >      >      >      > Thank you.
     >      >      >      >
     >      >      >      >
     >      >      >      > On Thu, Dec 28, 2023 at 9:22 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>>> wrote:
     >      >      >      >
     >      >      >      >     On 12/28/23 12:59, Jimmy wrote:
     >      >      >      >      > Hi,
     >      >      >      >      >
     >      >      >      >      > I'm seeking assistance in incorporating a 
stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my 
attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to 
match or skips tokens in the newly added stopword list.
     >      >      >      >      >
     >      >      >      >      > I created the regex pattern using the 
following code:
     >      >      >      >      >
     >      >      >      >      > 
Regexp::Assemble->new->add(@words)->reduce(0)->as_string
     >      >      >      >      >
     >      >      >      >      > Afterward, I converted it to UTF-8 hex.
     >      >      >      >      >
     >      >      >      >      > I'm wondering if there are any tools 
available to facilitate the creation of these regex patterns.
     >      >      >      >      >
     >      >      >      >     I have used Regexp::Trie to create Bayes 
stopwords in the past, code is similar to:
     >      >      >      >     
-----------------------------------------------------------------------------------------------------------
     >      >      >      >     use strict;
     >      >      >      >     use warnings;
     >      >      >      >
     >      >      >      >     use Encode;
     >      >      >      >     use Regexp::Trie;
     >      >      >      >
     >      >      >      >     my @input = <STDIN>;
     >      >      >      >     my $rt = Regexp::Trie->new;
     >      >      >      >     for my $w ( @input ) {
     >      >      >      >         chomp($w);
     >      >      >      >         $rt->add($w);
     >      >      >      >     }
     >      >      >      >     my $regexp = $rt->regexp;
     >      >      >      >     my @reg = split //, $regexp;
     >      >      >      >     for my $c ( @reg ) {
     >      >      >      >         my $char = $c;
     >      >      >      >         my $test;
     >      >      >      >         eval "\$test = decode( 'utf8', \$c, 
Encode::FB_CROAK )";
     >      >      >      >         if( $@ ) {
     >      >      >      >           print 'x' . sprintf("%x", ord($c));
     >      >      >      >         } else {
     >      >      >      >           print $char;
     >      >      >      >         }
     >      >      >      >     }
     >      >      >      >     
-----------------------------------------------------------------------------------------------------------
     >      >      >      >
     >      >      >      >        Giovanni
     >      >      >      >
     >      >      >
     >      >
     >


Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to