Darn it. I came in too late on this subject to make any sense out of
it.
But I have a similar issue that I ran into and I'm wondering if you
might be able to shed some light on it -- like am I going in the
right direction.
I have a bayesian spam filter that I wrote in perl and have been
developing into something useful for about a month now.
The features that make it different from SA is it's speed and the
overall approach of managing tokens and data.
So far it's extremely fast but I suspect I will lost most or all of
that in the near future.
The problem that I have run into a little bit is how to parse tokens
with non-ascii text.
Example
I want to break up every email based on a token defined as /(\w\w\w+)/g;
This will give me every "word" of three or more letters.
But when I'm getting mail that is in UTF-8 format this doesn't work
that way I want it to as I can't see an umlat (or similar) as
matching a '\w'.
My data storage can handle these characters, but I can't.
Any suggestions?
(not sure how SA does this either)
Someone suggested 'use encoding "utf8"' but I actually couldn't find
much on that either.
So this seems to be a good list to ask these sort of questions.
On May 10, 2007, at 4:36 AM, Justin Mason wrote:
Kevin W. Gagel writes:
Thanks for straightening me out on that Vincent.
Folks - for completeness here are some instructions for the
WORKAROUND.
Locate your Message.pm module and edit the section in the begining as
indicated below.
I have been running this now for a couple of hours with no adverse
affects
(that I can see at the moment).
PS
Thanks beginners@perl.org for your help. I'm up and running
without any
further errors.
----- Forwarded Message -----
Vincent,
Where in the Message.pm module do I but "use bytes"? Right here
(below)
and do I just add it below the warnings line with a ";" ending it?
Yes, you are right, after "use warnings;". I ran SA3.2 on my site
with
"use bytes;" added, no problem so far. But it seems SA developers
did not
mention this, they might have their reasons (break
normalize_charset for
one reason).
Yes, exactly -- breaking one of the major 3.2.0 features is not a good
thing. :(
--j.
---paste---
package Mail::SpamAssassin::Message;
use strict;
use warnings;
use Mail::SpamAssassin;
use Mail::SpamAssassin::Message::Node;
use Mail::SpamAssassin::Message::Metadata;
use Mail::SpamAssassin::Constants qw(:sa);
use Mail::SpamAssassin::Logger;
use vars qw(@ISA);
---end paste---
=================================
Vincent Li
http://bl0g.blogdns.com
=================================
Kevin W. Gagel
Network Administrator
Information Technology Services
(250) 562-2131 local 448
My Blog:
http://mail.cnc.bc.ca/blogs/gagel
-------------------------------------------------------------------
The College of New Caledonia, Visit us at http://www.cnc.bc.ca
Virus scanning is done on all incoming and outgoing email.
Anti-spam information for CNC can be found at http://avas.cnc.bc.ca
-------------------------------------------------------------------
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/