RE: Stripping out Unicode combining characters (diacritics)

Doran, Michael D Tue, 06 May 2008 07:26:52 -0700

Hi Leif,

> This is what I do. You can try that.
> See if it helps:
> 
> Encode::_utf8_on($str);  # <<<
> $str =~ s/\pM*//g;


That works!  I will gladly buy the beers Leif, should we ever meet in person.

> I mean - have you for instance tried running your cgi scripts 
> in tainted mode (-T)?

No, I do not run my CGI scripts in tainted mode (although I realize that I 
probably should).  

Thanks (once again) for your help.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Leif Andersson [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 06, 2008 3:33 AM
> To: Doran, Michael D
> Subject: Re: Stripping out Unicode combining characters (diacritics)
> 
> Oh, now I see your REAL question.
> 
> This is what I do. You can try that.
> See if it helps:
> 
> Encode::_utf8_on($str);  # <<<
> $str =~ s/\pM*//g;
> 
> You are not the only one having problems with Unicode.
> Esp. in web programming it can be very confusing.
> 
> I am quite surprised that there are not more discussions of this kind.
> Not even in the "official" channels.
> 
> I mean - have you for instance tried running your cgi scripts 
> in tainted mode (-T)?
> 
> I had all my scripts set up that way. Before Unicode.
> But basic Unicode stuff became broken with -T enabled.
> Have they fixed that now?
> I have at least seen no mentioning of it.
> 
> And screen scraping. If you want to mess around with 
> javascript embedded in an HTML page, you may find that the 
> content encoding is mixed. And Perl gets very confused 
> getting mixed character encodings.
> And so do I.
> 
> You may also have to deal with mixed encodings doing SQL 
> against the Voyager database.
> 
> What would we do if we could not fall back on "use bytes"
> every now and then! ;-)
> 
> Leif
> 
> ======================================
> Leif Andersson, Systems Librarian
> Stockholm University Library
> SE-106 91 Stockholm
> SWEDEN
> Phone : +46 8 162769
> Mobile: +46 70 6904281
> 
> 
> -----Ursprungligt meddelande-----
> Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> Skickat: den 6 maj 2008 04:13
> Till: Mike Rylander
> Kopia: [EMAIL PROTECTED]; Perl4lib
> Ämne: RE: Stripping out Unicode combining characters (diacritics)
> 
> Hi Mike,
> 
> I appreciate the quick reply.  I am familiar with the 
> Unicode::Normalize module (and will also be using that), but 
> I left it out of this question because it's not relevant to 
> the problem I'm currently trying to solve.  The text I'm 
> trying to strip diacritics out of does not have precomposed 
> accented characters.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> 
> 
> -----Original Message-----
> From: Mike Rylander [mailto:[EMAIL PROTECTED]
> Sent: Mon 5/5/2008 8:52 PM
> To: Doran, Michael D
> Cc: [EMAIL PROTECTED]; Perl4lib
> Subject: Re: Stripping out Unicode combining characters (diacritics)
>  
> On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> [snip]
> >
> >  I'm pulling my hair out on this... so any help would be 
> appreciated.  If there's any other info I can provide, let me know.
> >
> 
> You'll want to transform the text to NFD format (nominally, 
> base characters plus combining marks) instead of NFC (precombined
> characters) using Unicode::Normalize:
> 
>  use Unicode::Normalize;
> 
>  my $text = NFD($original);
>  $text =~ s/\pM+//go;
> 
> Hope that helps.
> 
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts  | phone: 
> 1-877-OPEN-ILS (673-6457)  | email: [EMAIL PROTECTED]  | 
> web: http://www.esilibrary.com
> 
>

RE: Stripping out Unicode combining characters (diacritics)

Reply via email to