Re: character setts in a regexp

Charles DeRykus Mon, 14 Jan 2013 14:40:55 -0800

On Sat, Jan 12, 2013 at 12:56 PM, Charles DeRykus <dery...@gmail.com> wrote:
> On Fri, Jan 11, 2013 at 2:01 PM, Christer Palm <b...@bredband.net> wrote:
>> Hi!
>>
>> I have a perl script that parses RSS streams from different news sources and 
>> experience problems with national characters in a regexp function used for 
>> matching a keyword list with the RSS data.
>>
>> Everything works fine with a simple regexp for plain english i.e. words 
>> containing the letters A-Z, a-z, 0-9.
>>
>> if ( $description =~ m/\b$key/i ) {….}
>>
>> Keywords or RSS data with national characters don’t work at all. I’m not 
>> really surprised this was expected as character sets used in the different 
>> RSS streams are outside my control.
>>
>> I am have the ”use utf8;” function activated but I’m not really sure if it 
>> is needed. I can’t see any difference used or not.
>>
>> If a convert all the national characters used in the keyword list to html 
>> type ”&aring” and so on. Changes every occurrence of octal, unicode 
>> characters used i.e. decimal and hex to html type in the RSS data in a 
>> character parser everything works fine but takes time that I don’t what to 
>> avoid.
>>
>> Do you have suggestions on this character issue? Is it possible to determine 
>> the character set of a text efficiently? Is it other ways to solve the 
>> problem?
>>
...


> #!/usr/bin/perl
> use strict;
> use warnings;
>
> binmode(STDOUT, ":utf8");
> $cosa = "my \x{263a}";
> print "cosa=$cosa\n";
>
> print "found smiley at \\b\n" if $cosa =~ /\b\x{263a}/;
> print "found smiley (no \\b)"  if $cosa =~ /\x{263a}/;
>
> The output:
> cosa=my ☺
> found smiley (no \b)
>

From: http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
-----------------------------------------------------------------------------------
Most regular expression engines allow a test for word boundaries (such
as by "\b" in Perl). They generally use a very simple mechanism for
determining word boundaries: one example of that would be having word
boundaries between any pair of characters where one is a
<word_character> and the other is not, or at the start and end of a
string. This is not adequate for Unicode regular expressions.
-------------------------------------------------------------------------------------

Based on the above, Perl's \b semantics appear to be "not adequate
for Unicode regular expressions" since, it doesn't address extended
code points of Unicode, only values in the alphanumeric range and
underscore.

So, you may possibly want to try a preceding space to delimit the
keyword

print "match" if "my \x{263a}"    =~ / \x{263a}/;   # matches!
#print "match" if "my \b\x{263a}" =~ /\b\x{263a/;   # would not match

-- 
Charles DeRykus

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: character setts in a regexp

Reply via email to