On 12/28/23 12:59, Jimmy wrote:
Hi,

I'm seeking assistance in incorporating a stopword for Asian languages in 
Unicode. Although I possess comprehensive word lists, my attempts to generate a 
regex pattern and test it have been unsuccessful; the pattern fails to match or 
skips tokens in the newly added stopword list.

I created the regex pattern using the following code:

Regexp::Assemble->new->add(@words)->reduce(0)->as_string

Afterward, I converted it to UTF-8 hex.

I'm wondering if there are any tools available to facilitate the creation of 
these regex patterns.

I have used Regexp::Trie to create Bayes stopwords in the past, code is similar 
to:
-----------------------------------------------------------------------------------------------------------
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = <STDIN>;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
  chomp($w);
  $rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
  my $char = $c;
  my $test;
  eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
  if( $@ ) {
    print 'x' . sprintf("%x", ord($c));
  } else {
    print $char;
  }
}
-----------------------------------------------------------------------------------------------------------

 Giovanni

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to