On 12/28/23 12:59, Jimmy wrote:
Hi,I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list. I created the regex pattern using the following code: Regexp::Assemble->new->add(@words)->reduce(0)->as_string Afterward, I converted it to UTF-8 hex. I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to: ----------------------------------------------------------------------------------------------------------- use strict; use warnings; use Encode; use Regexp::Trie; my @input = <STDIN>; my $rt = Regexp::Trie->new; for my $w ( @input ) { chomp($w); $rt->add($w); } my $regexp = $rt->regexp; my @reg = split //, $regexp; for my $c ( @reg ) { my $char = $c; my $test; eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )"; if( $@ ) { print 'x' . sprintf("%x", ord($c)); } else { print $char; } } ----------------------------------------------------------------------------------------------------------- Giovanni
OpenPGP_signature.asc
Description: OpenPGP digital signature