"ทุก" is not considered a word because it's part of the token "ทุกวันพุธเล่นชนะรับเพิ่ม". Words must be separated by spaces, otherwise we should skip the word "theme" just because "the" is in english stopword list. No idea if this makes sense for asian languages.
Giovanni On 12/29/23 11:04, Jimmy wrote:
The sample email and word list should contain at least these words. ถูก เลย ทุก Jimmy On Fri, Dec 29, 2023 at 4:47 PM <giova...@paclan.it <mailto:giova...@paclan.it>> wrote: I do not speak Thai but I cannot see any word in the sample email that should match that list. Which word do you think should match the regexp ? Giovanni On 12/29/23 10:08, Jimmy wrote: > You can use this word list > > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> > > Jimmy > > On Fri, Dec 29, 2023 at 3:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote: > > To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line. > Could you share the list you are using ? > > Giovanni > > On 12/29/23 09:22, Jimmy wrote: > > I use SpamAssassin 4.0.0 (2022-12-14) > > > > $ spamassassin -D --lint 2>&1 | grep bayes: > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi > > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi > > > > > > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token" > > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en' > > > > You can use "บาท" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes. > > > > Jimmy > > > > > > On Fri, Dec 29, 2023 at 2:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote: > > > > Config line produces a syntax error for me: > > config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> <http://local.cf <http://local.cf>> <http://local.cf <http://local.cf> <http://local.cf <http://local.cf>>> (line 1): bayes_stopword_th > > > > Could you share the word list in utf8 ? > > I tried adding "บาท" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>> and it produces a working regexp. > > Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled. > > Giovanni > > > > On 12/28/23 17:06, Jimmy wrote: > > > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>>> > > > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>>> > > > > > > Jimmy > > > > > > > > > On Thu, Dec 28, 2023 at 10:59 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>> wrote: > > > > > > Could you share a config line and a sample you are using ? > > > Giovanni > > > > > > On 12/28/23 16:26, Jimmy wrote: > > > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns. > > > > > > > > Jimmy > > > > > > > > On Thu, Dec 28, 2023 at 10:13 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>>> wrote: > > > > > > > > "spamassassin -D bayes" will tell you, you should see a line like: > > > > bayes: skipped token 'from' because it's in stopword list for language 'en' > > > > > > > > Giovanni > > > > > > > > On 12/28/23 15:45, Jimmy wrote: > > > > > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern. > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > On Thu, Dec 28, 2023 at 9:22 PM <giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>>>> wrote: > > > > > > > > > > On 12/28/23 12:59, Jimmy wrote: > > > > > > Hi, > > > > > > > > > > > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list. > > > > > > > > > > > > I created the regex pattern using the following code: > > > > > > > > > > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string > > > > > > > > > > > > Afterward, I converted it to UTF-8 hex. > > > > > > > > > > > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns. > > > > > > > > > > > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to: > > > > > ----------------------------------------------------------------------------------------------------------- > > > > > use strict; > > > > > use warnings; > > > > > > > > > > use Encode; > > > > > use Regexp::Trie; > > > > > > > > > > my @input = <STDIN>; > > > > > my $rt = Regexp::Trie->new; > > > > > for my $w ( @input ) { > > > > > chomp($w); > > > > > $rt->add($w); > > > > > } > > > > > my $regexp = $rt->regexp; > > > > > my @reg = split //, $regexp; > > > > > for my $c ( @reg ) { > > > > > my $char = $c; > > > > > my $test; > > > > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )"; > > > > > if( $@ ) { > > > > > print 'x' . sprintf("%x", ord($c)); > > > > > } else { > > > > > print $char; > > > > > } > > > > > } > > > > > ----------------------------------------------------------------------------------------------------------- > > > > > > > > > > Giovanni > > > > > > > > > > > > > > >
OpenPGP_signature.asc
Description: OpenPGP digital signature