Also an interesting concept, adding to spamdb that is.  You haven't found
that spam changes over time enough that it makes sense to start with a new
db for every rebuild?

Some thoughts:

 MaintBayesCollection says: "ASSP will delete the oldest files until the
number of files per folder reaches MaxFiles."  However, MaxBayesFileAge says
that it only runs if MaintBayesCollection is checked.   Which runs first?
It might be good to have MaxBayesFileAge and the other cleanup functions run
if set, regardless of if MaintBayesCollection is checked.

Come to think of it, we should pull the deleting out of MaintBayesCollection
alltogether, and have that checkbox turn on or off all of the functions if
set.

I'd like to see several new or modified settings:

change *MainBayesCollection*: simple on and off.  Do the functions below
apply.

add *TrimCollectionMethod*: dropdown with these options:
a) By Date - this will delete the oldest files until the number of files per
folder reaches MaxFiles
b) Randomly - this will delete files randomly until the number of files per
folder reaches MaxFiles (better for Bayesian filtering)
c) Off
add *TrimCollectionMinFileAge: *when deleting files using
TrimCollectionMethod, do not ever delete files newer than this many days.
If set to 0, all files are eligible for deletion.  Setting to a value higher
than 0 will protect newer files from deletion, insuring that BlockReporting
will not be broken by a deleted file.

add *TrimCollectionByDuplicateSubject:* If checked and if
UseSubjectsAsMaillogNames is checked, this will delete files with the exact
same subject filename.  3 files of each subject are maintained to account
for variance in the body.  This runs before the TrimCollectionMethod ndomly
delete or delete by date.  This will break block reporting, but if we're
deleting messages with the same subject, chances are pretty high that it's
spam and won't need to be resent.

keep *MaxBayesFileAge *as is.  Make sure this runs before the
TrimCollectionMethod and that the admin interface says as such.  If set to
0, doesn't run.

modify *MaxCorrectedDays*.  Change to *MaxCorrectedFileAge* to be more
consistent.  Again, have this run before TrimCollectionMethod and say as
such. If set to 0, doesn't run.


As for my deleting code, here's the revised concept algorithm if you want to
code it yourself (or I'm happy to do it if you want).

1) Make sure that MaintBayesCollection doesn't delete based on date
automaticaly and that this setting just turns maintenance functions on or
off
2) Delete files based on age in the collections based on MaxBayesFileAge and
MaxCorrectedDays (changing to MaxCorrectedFileAge hopefully)
3) Delete same subject messages
4) Delete files based on the method in TrimCollectionMethod.  If by date,
use the functionality that you already had built into MaintBayesCollection.
If random,
a) count the files and count the percentage that we need to reduce.
b) cycle through each file, run rand(100).  If rand(100) is under the
percentage that we calculated, delete the file
for example, we have maxfiles set to 2000.  There are 2500 files in a
folder.  We're 25% over, and we need to reduce the 2500 files by 20%
(2000/2500) or 500 files.  Look at each file, if rand(100) <= 20, delete the
file.  Since rand(100) should in theory 20 or less about 20% of the time,
this will reduce the corpus to approximately 2000.

4b takes some time, since we need to go through each file.  I'm not aware of
a way of cycling through files in a random order, not by date or internal
id.  If there is a way, we could simply cycle through that way for the first
500 and delete them.

Just my 1 1/2 cents.
On Tue, Sep 15, 2009 at 2:17 AM, Thomas Eckardt/eck <
thomas.ecka...@thockar.com> wrote:

> OK - doing 1) and 2) would be no problem - but there is an option you can
> use to get rid of the problem.
>
> -try to get a good corpus (spamdb) ignoring that there are more files than
> maxfiles.
> - if your spamdb is good enough, set the age- (maintenance) values to your
> needs to reduce the number of files in corpus
> - and set   'ReplaceOldSpamdb'  to 0 - now the left messages in corpus
> will be processed and the resulting records will be added to the (good)
> spamdb
>
> If you use any DB for the spamdb and the result of one of the last
> rebuilds is how ever bad - you can easy go back (up to) 10 versions, by
> importing any older spamdb backup
>
> How ever - I will try to add an option to maintain the corpus strictly to
> maxfiles per collection folder  1) 2) !
>
> Thomas
>
>
>
>
>
>
> K Post <nntp.p...@gmail.com>
> 15.09.2009 04:19
> Bitte antworten an
> ASSP development mailing list <assp-test@lists.sourceforge.net>
>
>
> An
> ASSP development mailing list <assp-test@lists.sourceforge.net>
> Kopie
>
> Thema
> Re: [Assp-test] Antwort: Re: Antwort: Re: fixes and news in 2.0.1_RC0.4.12
>
>
>
>
>
>
> I wish we could just use move2numb or straight number logging, but they
> want
> us to review the found spam periodically, and that would just be
> impossible
> if we couldn't use the subjects as a start...  Plus, move2numb would break
> the block reporting.
> Very good points on the fact that bayesian is last and that it'll never be
> perfect.  Most of our catches are happening before that too, but the
> bayesian filters are still catching a bunch.
>
> We do not have a well developed bombre, nor the time to keep one updataed,
> which is a big problem.  This is a for a charity, and there's just not the
> resources.
>
> As for duplicate messages not affecting the corpus, I agree, but with this
> limitation of needing subject filenames, I can't think of another way of
> limiting the number of messages in the file system other than randomly
> deleting messages.  If I'm going to do that, might as well first get rid
> of
> the duplicates to help keep the most diverse corpus, no?
>
> We don't have the luxury of another filter on the real smtp server either.
>
> We do get a lot of messages with the same subject.  The spammers will send
> the same message to 100 or more users.  If it makes it to the collection
> point of the code, we'll get the same message stored 100 times.  Our
> regexes
> catch a lot, but there's plenty of times that an email with a subject like
> "V__F_iGAra" will get past everything but bayesian filters.
>
> Sure, there's a strong possibility of the same subject having a different
> body, but deleting 98 of the 100 messages leaves room for 98 other
> messages
> which will most likely be more diverse.
>
> Since we definitely need to delete in some manner (which we want to be
> random) and move2numb isn't an option, my plan is to edit the maintenance
> functionality if subject logging is on and move2numb is off to:
> 1) First find messages with the same subject and delete all in excess of 3
> then
> 2) calculate if we're over the maxfiles for each folder and then randomly
> delete files based on the percent that we're over, in NO way based on file
> date.
> I've been doing this for 5 or so years with various 1.x installations, but
> in a scheduled script, not in ASSP.
>
> This change will honor the MaxBayesFileAge, MaxCorrectedDays and
> maxNoBayesFileAge (you might want to make MaxCorrectDays use a naming
> convention more consistent like MaxCorrectedFileAge).
>
> Would you be intersted in this code?  Doesn't it make sense for people
> like
> me who need to stick to subject logging?  Again, I'd much prefer an
> alternative from an expert like yourself, that still affords the ability
> to
> leave subject logging in place.
>
>
> - it was my pleasure to help clean up the admin descriptions.  I intend to
> do more as soon as there's a new version with the cleanup in place.
>
> On Mon, Sep 14, 2009 at 1:37 PM, Thomas Eckardt/eck <
> thomas.ecka...@thockar.com> wrote:
>
> > >What method do you recommend using to keep the number of files down?
> > Simply
> > >deleting by age isn't going to cut it if you want a diverse corpus is
> it?
> >
> > The best way is:
> >
> > - do not use subject for filenames or use move2num
> > - never delete any file by age
> > - set maxfiles high enough to get a good corpus
> >
> > This is the way assp works for years. The new features are for special
> > usage - for example: I do not use the bayes engine (only for testing the
> > code). I have a very good spamdb, which is not changed for 2 years
> (doing
> > some small scoring) - the meen way to detect spam is done by PB and
> > IP-tests (mx,ptr,helo,FBMVT......). Because some mails bypassing (going
> > other ways) assp, I use a bayes engine in Lotus Domino - just for fun -
> > there are 1-5 messages blocked per week.  But I love to see the subject
> in
> > Blockreports.
> >
> > And keep in mind - bayesian checks should be only a small part of spam
> > detection - because the simple (???) mathematics is not  (could never
> be)
> > perfect.
> >
> > Setting up the bomb regexes and all PB-valence values the right way,
> helps
> > much more than the bayes check. It will take some time (and some work)
> to
> > find out the best way (values) for you.
> > I think, Fritz and I (and possible some others) - we have found that
> point
> > - our ASSP detects 99.99% (or even more) spams. I have not seen a
> blocked
> > good email for over one year.
> >
> > And at the end you have to weight - accept that from 200 users 10 are
> > getting one spam per day (having 30.000 or more connections) - or
> > analysing tonns of  mails and logs to get rid of the 10.
> >
> >
> > >So what are your thoughts, Thomas, of my remove same subjects,
> >
> > same subject - same email  => ignored by assp, only one mail is
> processed
> > in rebuildspamdb
> >
> > but
> >
> > same subject - different body ......  oh, the subject is only one part
> of
> > the header (from to msg-id ip forwarder .............) - where is the
> > end????
> > Just a joke! This will help to reduce the number of mails (files) for a
> > while, but not to increase the quality of the corpus (->spamdb) - what
> is
> > more important for you?
> > If you get tonns of spams with the same subject, do not use bayes to
> block
> > them - use subjectRe or headerRe or blackRe or ........    .
> >
> > The bayesian check is one of the latest checks of assp - so try to
> detect
> > spams before.
> >
> > Thank you for your help fixing the mistakes in config descriptions.
> >
> >
> > Thomas
> >
> >
> >
> >
> >
> >
> > K Post <nntp.p...@gmail.com>
> > 14.09.2009 16:53
> > Bitte antworten an
> > ASSP development mailing list <assp-test@lists.sourceforge.net>
> >
> >
> > An
> > ASSP development mailing list <assp-test@lists.sourceforge.net>
> > Kopie
> >
> > Thema
> > Re: [Assp-test] Antwort: Re: fixes and news in 2.0.1_RC0.4.12
> >
> >
> >
> >
> >
> >
> > Right, BUT if we're limiting the total number of files in the directory,
> > wouldn't it be better to delete these duplicates to give a more diverse
> > corpus?
> >
> > We can look at their subject name and then delete them.  This leave room
> > for
> > other files.
> >
> > Using subject logging and NOT move2numb, I guess I need some more
> > clarification at this point.
> >
> > What method do you recommend using to keep the number of files down?
> > Simply
> > deleting by age isn't going to cut it if you want a diverse corpus is
> it?
> >
> > So what are your thoughts, Thomas, of my remove same subjects, remove
> > really
> > old by date, then remove a percentage (selecting randomly) based on the
> > overage in each folder method?
> >
> >
> >  Note: in the MaxBayesFileAge, you've got:
> > Do not define this option, if you use the bayesian engine of ASSP.
> > Deleting
> > files because of there age, is wrong in this case!!!!! It should be
> "their
> > age."  There's a bunch of other errors like this which I privately
> emailed
> > to Fritz, on request.  Should I send you that email too?
> >
> >
> > THANKS
> >
> > On Sun, Sep 13, 2009 at 2:00 AM, Thomas Eckardt/eck <
> > thomas.ecka...@thockar.com> wrote:
> >
> > > >What do you think about deleting redundant corpus emails
> > > >based on the subject?
> > >
> > > Redundant corpus emails are skipped/deleted based on there content
> (md5
> > > hash).
> > >
> > > Thomas
> > >
> > >
> > >
> > >
> > > K Post <nntp.p...@gmail.com>
> > > 13.09.2009 03:28
> > > Bitte antworten an
> > > ASSP development mailing list <assp-test@lists.sourceforge.net>
> > >
> > >
> > > An
> > > ASSP development mailing list <assp-test@lists.sourceforge.net>
> > > Kopie
> > >
> > > Thema
> > > Re: [Assp-test] fixes and news in 2.0.1_RC0.4.12
> > >
> > >
> > >
> > >
> > >
> > >
> > >  This is great, and thanks SO much for adding my idea of the max days
> > for
> > > corrected spam.  What do you think about deleting redundant corpus
> > emails
> > > based on the subject?
> > >
> > > On Sat, Sep 12, 2009 at 1:18 PM, Thomas Eckardt/eck <
> > > thomas.ecka...@thockar.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm back.
> > > >
> > > > fixed in 4.12:
> > > >
> > > > - for some messages the mail header was transfered two times
> > > > - changing the display language was not working in any case
> > > > - the hintbox in the config part of the GUI has shown wrong
> > > > updated/changed values
> > > >
> > > > added in 4.12
> > > >
> > > > MaxCorrectedDays
> > > > msg008590=Max Corrected File Age
> > > > msg008591=This is the number of days a error report will be kept in
> > the
> > > > correctednotspam and correctedspam folders. These folders are the
> > > longterm
> > > > memory of ASSP, therefore the default is 1000 days.
> > > >
> > > > changed in 4.12:
> > > >
> > > > - the change language part is moved to the main config form !
> > > >
> > > >
> > > >
> > > > Thomas
> > > >
> > > > DISCLAIMER:
> > > > *******************************************************
> > > > This email and any files transmitted with it may be confidential,
> > > legally
> > > > privileged and protected in law and are intended solely for the use
> of
> > > the
> > > >
> > > > individual to whom it is addressed.
> > > > This email was multiple times scanned for viruses. There should be
> no
> > > > known virus in this email!
> > > > *******************************************************
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008
> > > 30-Day
> > > > trial. Simplify your report design, integration and deployment - and
> > > focus
> > > > on
> > > > what you do best, core application coding. Discover what's new with
> > > > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > > > _______________________________________________
> > > > Assp-test mailing list
> > > > Assp-test@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/assp-test
> > > >
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008
> > > 30-Day
> > > trial. Simplify your report design, integration and deployment - and
> > focus
> > > on
> > > what you do best, core application coding. Discover what's new with
> > > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > > _______________________________________________
> > > Assp-test mailing list
> > > Assp-test@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >
> > >
> > >
> > >
> > > DISCLAIMER:
> > > *******************************************************
> > > This email and any files transmitted with it may be confidential,
> > legally
> > > privileged and protected in law and are intended solely for the use of
> > the
> > >
> > > individual to whom it is addressed.
> > > This email was multiple times scanned for viruses. There should be no
> > > known virus in this email!
> > > *******************************************************
> > >
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008
> > 30-Day
> > > trial. Simplify your report design, integration and deployment - and
> > focus
> > > on
> > > what you do best, core application coding. Discover what's new with
> > > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > > _______________________________________________
> > > Assp-test mailing list
> > > Assp-test@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > Let Crystal Reports handle the reporting - Free Crystal Reports 2008
> > 30-Day
> > trial. Simplify your report design, integration and deployment - and
> focus
> > on
> > what you do best, core application coding. Discover what's new with
> > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> > _______________________________________________
> > Assp-test mailing list
> > Assp-test@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
> >
> >
> > DISCLAIMER:
> > *******************************************************
> > This email and any files transmitted with it may be confidential,
> legally
> > privileged and protected in law and are intended solely for the use of
> the
> >
> > individual to whom it is addressed.
> > This email was multiple times scanned for viruses. There should be no
> > known virus in this email!
> > *******************************************************
> >
> >
> >
>
> ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart your
> > developing skills, take BlackBerry mobile applications to market and
> stay
> > ahead of the curve. Join us from November 9&#45;12, 2009. Register
> now&#33;
> > http://p.sf.net/sfu/devconf
> >  _______________________________________________
> > Assp-test mailing list
> > Assp-test@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9&#45;12, 2009. Register
> now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to