Hi Leif, > This is what I do. You can try that. > See if it helps: > > Encode::_utf8_on($str); # <<< > $str =~ s/\pM*//g;
That works! I will gladly buy the beers Leif, should we ever meet in person. > I mean - have you for instance tried running your cgi scripts > in tainted mode (-T)? No, I do not run my CGI scripts in tainted mode (although I realize that I probably should). Thanks (once again) for your help. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Leif Andersson [mailto:[EMAIL PROTECTED] > Sent: Tuesday, May 06, 2008 3:33 AM > To: Doran, Michael D > Subject: Re: Stripping out Unicode combining characters (diacritics) > > Oh, now I see your REAL question. > > This is what I do. You can try that. > See if it helps: > > Encode::_utf8_on($str); # <<< > $str =~ s/\pM*//g; > > You are not the only one having problems with Unicode. > Esp. in web programming it can be very confusing. > > I am quite surprised that there are not more discussions of this kind. > Not even in the "official" channels. > > I mean - have you for instance tried running your cgi scripts > in tainted mode (-T)? > > I had all my scripts set up that way. Before Unicode. > But basic Unicode stuff became broken with -T enabled. > Have they fixed that now? > I have at least seen no mentioning of it. > > And screen scraping. If you want to mess around with > javascript embedded in an HTML page, you may find that the > content encoding is mixed. And Perl gets very confused > getting mixed character encodings. > And so do I. > > You may also have to deal with mixed encodings doing SQL > against the Voyager database. > > What would we do if we could not fall back on "use bytes" > every now and then! ;-) > > Leif > > ====================================== > Leif Andersson, Systems Librarian > Stockholm University Library > SE-106 91 Stockholm > SWEDEN > Phone : +46 8 162769 > Mobile: +46 70 6904281 > > > -----Ursprungligt meddelande----- > Från: Doran, Michael D [mailto:[EMAIL PROTECTED] > Skickat: den 6 maj 2008 04:13 > Till: Mike Rylander > Kopia: [EMAIL PROTECTED]; Perl4lib > Ämne: RE: Stripping out Unicode combining characters (diacritics) > > Hi Mike, > > I appreciate the quick reply. I am familiar with the > Unicode::Normalize module (and will also be using that), but > I left it out of this question because it's not relevant to > the problem I'm currently trying to solve. The text I'm > trying to strip diacritics out of does not have precomposed > accented characters. > > -- Michael > > # Michael Doran, Systems Librarian > # University of Texas at Arlington > # 817-272-5326 office > # 817-688-1926 cell > # [EMAIL PROTECTED] > # http://rocky.uta.edu/doran/ > > > > -----Original Message----- > From: Mike Rylander [mailto:[EMAIL PROTECTED] > Sent: Mon 5/5/2008 8:52 PM > To: Doran, Michael D > Cc: [EMAIL PROTECTED]; Perl4lib > Subject: Re: Stripping out Unicode combining characters (diacritics) > > On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D > <[EMAIL PROTECTED]> wrote: > [snip] > > > > I'm pulling my hair out on this... so any help would be > appreciated. If there's any other info I can provide, let me know. > > > > You'll want to transform the text to NFD format (nominally, > base characters plus combining marks) instead of NFC (precombined > characters) using Unicode::Normalize: > > use Unicode::Normalize; > > my $text = NFD($original); > $text =~ s/\pM+//go; > > Hope that helps. > > -- > Mike Rylander > | VP, Research and Design > | Equinox Software, Inc. / The Evergreen Experts | phone: > 1-877-OPEN-ILS (673-6457) | email: [EMAIL PROTECTED] | > web: http://www.esilibrary.com > >