Re: __DRUG_MUSCLE1 false-positives

2012-05-18 Thread Matus UHLAR - fantomas
On 18/05/12 03:18, David F. Skoll wrote: I looked at the regex and it seems that Perl treats är as having a word boundary in the \b sense between the "ä" and the "r" On 18.05.12 07:26, Jason Haar wrote: A bit OT, but is it because your perl is running under "C" locale instead of se? i.e. would

Re: __DRUG_MUSCLE1 false-positives

2012-05-17 Thread David F. Skoll
On Fri, 18 May 2012 08:37:07 +1200 Jason Haar wrote: > I'm no linguist but this is probably an extremely hard problem to > solve. An email can have mixtures of languages, so in a perfect world > we should be able to change locale per word (or per char? - eeek!). The only sane solution is to re-e

Re: __DRUG_MUSCLE1 false-positives

2012-05-17 Thread Jason Haar
On 18/05/12 07:54, dar...@chaosreigns.com wrote: > Locale handling is a known problem is SA: > https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062 bug opened in 2004 :-( I'm no linguist but this is probably an extremely hard problem to solve. An email can have mixtures of languages, so i

Re: __DRUG_MUSCLE1 false-positives

2012-05-17 Thread David F. Skoll
On Fri, 18 May 2012 07:26:56 +1200 Jason Haar wrote: > > I looked at the regex and it seems that Perl treats är as having a > > word boundary in the \b sense between the "ä" and the "r" > A bit OT, but is it because your perl is running under "C" locale > instead of se? Ah... could be. Hmm, ok.

Re: __DRUG_MUSCLE1 false-positives

2012-05-17 Thread darxus
On 05/18, Jason Haar wrote: > A bit OT, but is it because your perl is running under "C" locale > instead of se? i.e. would the word boundary definition change under > different localization contexts? Doesn't help solve the problem for you, > but it certainly flags a potential issue with a tonne of

Re: __DRUG_MUSCLE1 false-positives

2012-05-17 Thread Jason Haar
On 18/05/12 03:18, David F. Skoll wrote: > > I looked at the regex and it seems that Perl treats är as having a > word boundary in the \b sense between the "ä" and the "r" A bit OT, but is it because your perl is running under "C" locale instead of se? i.e. would the word boundary definition change