"ทุก" is not considered a word because it's part of the token 
"ทุกวันพุธเล่นชนะรับเพิ่ม".
Words must be separated by spaces, otherwise we should skip the word "theme" just because 
"the" is in english stopword list.
No idea if this makes sense for asian languages.

 Giovanni

On 12/29/23 11:04, Jimmy wrote:

The sample email and word list should contain at least these words.

ถูก
เลย
ทุก

Jimmy

On Fri, Dec 29, 2023 at 4:47 PM <giova...@paclan.it 
<mailto:giova...@paclan.it>> wrote:

    I do not speak Thai but I cannot see any word in the sample email that 
should match that list.
    Which word do you think should match the regexp ?
       Giovanni

    On 12/29/23 10:08, Jimmy wrote:
     > You can use this word list
     >
     > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
     >
     > Jimmy
     >
     > On Fri, Dec 29, 2023 at 3:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
     >
     >     To create the stopwords regexp I used the script I shared in a 
previous email and a list of words one per line.
     >     Could you share the list you are using ?
     >
     >         Giovanni
     >
     >     On 12/29/23 09:22, Jimmy wrote:
     >      > I use SpamAssassin 4.0.0 (2022-12-14)
     >      >
     >      > $ spamassassin -D --lint 2>&1 | grep bayes:
     >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
     >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
     >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
     >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
     >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
     >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
     >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
     >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
     >      > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages 
enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
     >      >
     >      >
     >      > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped 
token"
     >      > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' 
because it's in stopword list for language 'en'
     >      >
     >      > You can use "บาท" that was listed in regexp pattern but somehow I 
don't know why it not show skipped token in bayes.
     >      >
     >      > Jimmy
     >      >
     >      >
     >      > On Fri, Dec 29, 2023 at 2:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
     >      >
     >      >     Config line produces a syntax error for me:
     >      >     config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> 
<http://local.cf <http://local.cf>> <http://local.cf <http://local.cf> <http://local.cf 
<http://local.cf>>> (line 1): bayes_stopword_th
     >      >
     >      >     Could you share the word list in utf8 ?
     >      >     I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>> and it produces a working regexp.
     >      >     Bayes stopwords languages must also be enabled using 
"bayes_stopword_languages" config keyword, by default only english is enabled.
     >      >        Giovanni
     >      >
     >      >     On 12/28/23 17:06, Jimmy wrote:
     >      >      > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>>>>
     >      >      > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>>>>
     >      >      >
     >      >      > Jimmy
     >      >      >
     >      >      >
     >      >      > On Thu, Dec 28, 2023 at 10:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>> wrote:
     >      >      >
     >      >      >     Could you share a config line and a sample you are 
using ?
     >      >      >        Giovanni
     >      >      >
     >      >      >     On 12/28/23 16:26, Jimmy wrote:
     >      >      >      > Yes, I have done that, and I am also editing 
Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not 
separated by spaces, longer words may not match those patterns.
     >      >      >      >
     >      >      >      > Jimmy
     >      >      >      >
     >      >      >      > On Thu, Dec 28, 2023 at 10:13 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>>> wrote:
     >      >      >      >
     >      >      >      >     "spamassassin -D bayes" will tell you, you 
should see a line like:
     >      >      >      >     bayes: skipped token 'from' because it's in 
stopword list for language 'en'
     >      >      >      >
     >      >      >      >        Giovanni
     >      >      >      >
     >      >      >      >     On 12/28/23 15:45, Jimmy wrote:
     >      >      >      >      > The pattern has successfully passed the test 
script, but it needs to check whether Bayes learning will identify and possibly exclude the 
word from matching this pattern.
     >      >      >      >      >
     >      >      >      >      > Thank you.
     >      >      >      >      >
     >      >      >      >      >
     >      >      >      >      > On Thu, Dec 28, 2023 at 9:22 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>> <mailto:giova...@paclan.it
    <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it 
<mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>>>> wrote:
     >      >      >      >      >
     >      >      >      >      >     On 12/28/23 12:59, Jimmy wrote:
     >      >      >      >      >      > Hi,
     >      >      >      >      >      >
     >      >      >      >      >      > I'm seeking assistance in 
incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word 
lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern 
fails to match or skips tokens in the newly added stopword list.
     >      >      >      >      >      >
     >      >      >      >      >      > I created the regex pattern using the 
following code:
     >      >      >      >      >      >
     >      >      >      >      >      > 
Regexp::Assemble->new->add(@words)->reduce(0)->as_string
     >      >      >      >      >      >
     >      >      >      >      >      > Afterward, I converted it to UTF-8 
hex.
     >      >      >      >      >      >
     >      >      >      >      >      > I'm wondering if there are any tools 
available to facilitate the creation of these regex patterns.
     >      >      >      >      >      >
     >      >      >      >      >     I have used Regexp::Trie to create Bayes 
stopwords in the past, code is similar to:
     >      >      >      >      >     
-----------------------------------------------------------------------------------------------------------
     >      >      >      >      >     use strict;
     >      >      >      >      >     use warnings;
     >      >      >      >      >
     >      >      >      >      >     use Encode;
     >      >      >      >      >     use Regexp::Trie;
     >      >      >      >      >
     >      >      >      >      >     my @input = <STDIN>;
     >      >      >      >      >     my $rt = Regexp::Trie->new;
     >      >      >      >      >     for my $w ( @input ) {
     >      >      >      >      >         chomp($w);
     >      >      >      >      >         $rt->add($w);
     >      >      >      >      >     }
     >      >      >      >      >     my $regexp = $rt->regexp;
     >      >      >      >      >     my @reg = split //, $regexp;
     >      >      >      >      >     for my $c ( @reg ) {
     >      >      >      >      >         my $char = $c;
     >      >      >      >      >         my $test;
     >      >      >      >      >         eval "\$test = decode( 'utf8', \$c, 
Encode::FB_CROAK )";
     >      >      >      >      >         if( $@ ) {
     >      >      >      >      >           print 'x' . sprintf("%x", ord($c));
     >      >      >      >      >         } else {
     >      >      >      >      >           print $char;
     >      >      >      >      >         }
     >      >      >      >      >     }
     >      >      >      >      >     
-----------------------------------------------------------------------------------------------------------
     >      >      >      >      >
     >      >      >      >      >        Giovanni
     >      >      >      >      >
     >      >      >      >
     >      >      >
     >      >
     >


Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to