On Sat, Jan 12, 2013 at 12:56 PM, Charles DeRykus <dery...@gmail.com> wrote: > On Fri, Jan 11, 2013 at 2:01 PM, Christer Palm <b...@bredband.net> wrote: >> Hi! >> >> I have a perl script that parses RSS streams from different news sources and >> experience problems with national characters in a regexp function used for >> matching a keyword list with the RSS data. >> >> Everything works fine with a simple regexp for plain english i.e. words >> containing the letters A-Z, a-z, 0-9. >> >> if ( $description =~ m/\b$key/i ) {….} >> >> Keywords or RSS data with national characters don’t work at all. I’m not >> really surprised this was expected as character sets used in the different >> RSS streams are outside my control. >> >> I am have the ”use utf8;” function activated but I’m not really sure if it >> is needed. I can’t see any difference used or not. >> >> If a convert all the national characters used in the keyword list to html >> type ”å” and so on. Changes every occurrence of octal, unicode >> characters used i.e. decimal and hex to html type in the RSS data in a >> character parser everything works fine but takes time that I don’t what to >> avoid. >> >> Do you have suggestions on this character issue? Is it possible to determine >> the character set of a text efficiently? Is it other ways to solve the >> problem? >> ...
> #!/usr/bin/perl > use strict; > use warnings; > > binmode(STDOUT, ":utf8"); > $cosa = "my \x{263a}"; > print "cosa=$cosa\n"; > > print "found smiley at \\b\n" if $cosa =~ /\b\x{263a}/; > print "found smiley (no \\b)" if $cosa =~ /\x{263a}/; > > The output: > cosa=my ☺ > found smiley (no \b) > From: http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries ----------------------------------------------------------------------------------- Most regular expression engines allow a test for word boundaries (such as by "\b" in Perl). They generally use a very simple mechanism for determining word boundaries: one example of that would be having word boundaries between any pair of characters where one is a <word_character> and the other is not, or at the start and end of a string. This is not adequate for Unicode regular expressions. ------------------------------------------------------------------------------------- Based on the above, Perl's \b semantics appear to be "not adequate for Unicode regular expressions" since, it doesn't address extended code points of Unicode, only values in the alphanumeric range and underscore. So, you may possibly want to try a preceding space to delimit the keyword print "match" if "my \x{263a}" =~ / \x{263a}/; # matches! #print "match" if "my \b\x{263a}" =~ /\b\x{263a/; # would not match -- Charles DeRykus -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/