RE: Stripping out Unicode combining characters (diacritics) -

Doran, Michael D Wed, 07 May 2008 06:51:15 -0700

I received a number of helpful suggestions and solutions.  The approach I 
decided to adopt in my larger script is to 'decode' all the incoming form input 
as UTF-8 as well as the input from the database that I'll be matching the form 
input against.  This seems to allow the '\p{M}' syntax to work as expected in a 
Perl regexp.  In my test.cgi script for form input it would like like this:


#!/usr/local/bin/perl
use strict;
use CGI;
use Encode;
my $query = CGI::new();
my $search_term = decode('UTF-8',$query->param('text'));
my $sans_diacritics  = $search_term;
$sans_diacritics =~ s/\pM*//g;
print qq(Content-type: text/plain; charset=utf-8

search_term     is $search_term
sans_diacritics is $sans_diacritics
);
exit(0);

I'm slowly figuring out how to work with Unicode in my web scripts, but still 
have a lot to learn.  Thanks for all the help. :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 05, 2008 7:27 PM
> To: [EMAIL PROTECTED]
> Cc: Perl4lib
> Subject: Stripping out Unicode combining characters (diacritics)
> 
> I'm trying to strip out combining diacritics from some form 
> input using this code:
> 
> <head>
>     <META http-equiv="Content-Type" content="text/html; 
> charset=UTF-8"> </head> <body>
>   <form action="test.cgi" accept-charset="UTF-8" method="get">
>     <input type="text" name="text" value="" size="10">
>     <input type="submit" value="submit">
>   </form>
> </body>
> </html>
> 
> #!/usr/local/bin/perl
> use CGI;
> $query = CGI::new();
> $search_term = $query->param('text');
> $sans_diacritics  = $search_term;
> $sans_diacritics  =~ s/\p{M}*//g;
> #$sans_diacritics  =~ s/o//g;
> print qq(Content-type: text/plain; charset=utf-8
> 
> $sans_diacritics
> );
> exit(0);
> 
> 
> In the form, I'm inputting the string "Bartók" with the 
> accented character being a base character (small Latin letter 
> "o") followed by a combining acute accent.  However, when I 
> print (to the web) $sans_diacritics, I get my input with no 
> change -- the combining diacritic is still there.  I know 
> that my input is not a precomposed accented character, 
> because I can strip out the base "o" and the combining accent 
> either stands alone or jumps to another character [2].
> 
> The "\p{M}" is a Unicode class name for the character class 
> of Unicode 'marks', for example accent marks [1].  I've tried 
> these variations (and many others) and none seem to be doing 
> what I want:
> 
>        $sans_diacritics =~ s#[\p{Mark}]*##g;
>        $sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
>        $sans_diacritics =~ tr#[\p{M}]##;
>        $sans_diacritics =~ s/\p{M}*//g;
>        $sans_diacritics =~ s#[\p{M}]##g;
>        $sans_diacritics =~ s#\x{0301}##g;
>        $sans_diacritics =~ s#\x{006F}\x{0301}##g;
>        $sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;
> 
> I'm pulling my hair out on this... so any help would be 
> appreciated.  If there's any other info I can provide, let me know.
> 
> My Perl version is 5.8.8 and the script is running on a 
> server running Solaris 9.
> 
> -- Michael
> 
> [1] per http://perldoc.perl.org/perlretut.html and other documentation
> 
> [2] using $sans_diacritics  =~ s/o//g;
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>

RE: Stripping out Unicode combining characters (diacritics) -

Reply via email to