Darn it. I came in too late on this subject to make any sense out of it. But I have a similar issue that I ran into and I'm wondering if you might be able to shed some light on it -- like am I going in the right direction.

I have a bayesian spam filter that I wrote in perl and have been developing into something useful for about a month now. The features that make it different from SA is it's speed and the overall approach of managing tokens and data. So far it's extremely fast but I suspect I will lost most or all of that in the near future.

The problem that I have run into a little bit is how to parse tokens with non-ascii text.

Example
I want to break up every email based on a token defined as /(\w\w\w+)/g;
This will give me every "word" of three or more letters.

But when I'm getting mail that is in UTF-8 format this doesn't work that way I want it to as I can't see an umlat (or similar) as matching a '\w'.

My data storage can handle these characters, but I can't.

Any suggestions?
(not sure how SA does this either)

Someone suggested 'use encoding "utf8"' but I actually couldn't find much on that either.
So this seems to be a good list to ask these sort of questions.

On May 10, 2007, at 4:36 AM, Justin Mason wrote:


Kevin W. Gagel writes:
Thanks for straightening me out on that Vincent.
Folks - for completeness here are some instructions for the WORKAROUND.

Locate your Message.pm module and edit the section in the begining as
indicated below.

I have been running this now for a couple of hours with no adverse affects
(that I can see at the moment).

PS
Thanks beginners@perl.org for your help. I'm up and running without any
further errors.
----- Forwarded Message -----
Vincent,

Where in the Message.pm module do I but "use bytes"? Right here (below)
and do I just add it below the warnings line with a ";" ending it?

Yes, you are right, after "use warnings;". I ran SA3.2 on my site with "use bytes;" added, no problem so far. But it seems SA developers did not mention this, they might have their reasons (break normalize_charset for
one reason).

Yes, exactly -- breaking one of the major 3.2.0 features is not a good
thing. :(

--j.

---paste---
package Mail::SpamAssassin::Message;

use strict;
use warnings;

use Mail::SpamAssassin;
use Mail::SpamAssassin::Message::Node;
use Mail::SpamAssassin::Message::Metadata;
use Mail::SpamAssassin::Constants qw(:sa);
use Mail::SpamAssassin::Logger;

use vars qw(@ISA);
---end paste---

=================================

Vincent Li
http://bl0g.blogdns.com

=================================
Kevin W. Gagel
Network Administrator
Information Technology Services
(250) 562-2131 local 448
My Blog:
http://mail.cnc.bc.ca/blogs/gagel

-------------------------------------------------------------------
The College of New Caledonia, Visit us at http://www.cnc.bc.ca
Virus scanning is done on all incoming and outgoing email.
Anti-spam information for CNC can be found at http://avas.cnc.bc.ca
-------------------------------------------------------------------

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to