Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

John Hardin Fri, 26 Sep 2014 13:57:12 -0700

On Fri, 26 Sep 2014, dar...@chaosreigns.com wrote:

I wrote a script that takes a list of words with UTF-8 characters, and
generates rules matching them:


http://chaosreigns.com/code/dl/sawordrule.pl

For example:

$ echo "análisis" | perl ./sawordrule.pl SPANISH_
body SPANISH_ANALISIS /\ban[\x{C1}\x{E1}]lisis\b/i # análisis

How do you get a one byte match for two-byte-long UTF-8-encoded accentedcharacters? Shouldn't it generate this:


   /\ban[\xc3][\xa1]lisis\b/i

I didn't think normalization had been implemented yet.

Your rule doesn't hit in my test environment (though I just pasted thatword into an existing message to test...)


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  How do you argue with people to whom math is an opinion? -- Unknown
-----------------------------------------------------------------------
 848 days since the first successful private support mission to ISS (SpaceX)

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Reply via email to