On Fri, 26 Sep 2014, dar...@chaosreigns.com wrote:

I wrote a script that takes a list of words with UTF-8 characters, and
generates rules matching them:

http://chaosreigns.com/code/dl/sawordrule.pl

For example:

$ echo "anĂ¡lisis" | perl ./sawordrule.pl SPANISH_
body SPANISH_ANALISIS /\ban[\x{C1}\x{E1}]lisis\b/i # anĂ¡lisis

How do you get a one byte match for two-byte-long UTF-8-encoded accented characters? Shouldn't it generate this:

   /\ban[\xc3][\xa1]lisis\b/i

I didn't think normalization had been implemented yet.

Your rule doesn't hit in my test environment (though I just pasted that word into an existing message to test...)

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  How do you argue with people to whom math is an opinion? -- Unknown
-----------------------------------------------------------------------
 848 days since the first successful private support mission to ISS (SpaceX)

Reply via email to