>If we do this, we can remove what's probably the same message

rebuildspamdb parses the content of all files (except marked for deletion) 
and is doing a MD5 hash over the MIME-decoded and cleaned up body - so we 
know exactly (not probably!!) if the same message content was processed 
before.
If so, we ignoring all other equal messages. If the content is not equal 
but similar the processing of such files makes the spamdb more perfect - 
which is the target!

>since duplicates aren't considered by assp

This is not and was never the case!!! Where has you got this 'knowledge' 
from?

>and then delete randomly

If we do anything randomly - we lose controll - I prefer using an 
intelligent way, even if it takes more time.


But you are right - it could be possible that many messages with the same 
subject could be collected. This is because, assp (V2) adds a unique 
(counter 1 - 999.999) number to the filename (since RC0.2.xx) if a file is 
collected in discarded folder or 'UseSubjectsAsMaillogNames' is selected.
(In the early version this was not a problem, The next message with the 
same subject has overwitten the file of the last one.)
This is done for the following reasons: it could be possible that more 
than one SMTP-worker receives a message with the same subject at the same 
time and all want to open/write the same collection file at the same time 
- which could cause stucking workers or unexpected restarts and 
if a file was overwritten and a user requested a resend of a blocked mail, 
it was possible that he has got a wrong mail. 

There is a good reason to delete oldest files first - because (I've told 
you before) randomly deletion of files (completely ignoring the file age) 
will possibly break the BlockReport - resend function. 

But the discussion passes the concept of the rebuildspamdb in V2.

The concept of the rebuildspamdb in V2 - I hope it is clear enough:

What should you do?:
1)
Collect a good corpus - ignoring the file age (MaintBayesCollection == 0) 
- maintain the correction folders (doing/try your very best !!!). This 
will result in a good spamdb. You may also use a good spamdb from your 
friend.
2)
>From this time, there is no need to have old files in corpus, because we 
have the result of that files in our spamdb. Now set ReplaceOldSpamdb to 0 
and in future only new and corrected word pairs (we have to pay attention 
to the correction folders) are written in to spamdb. To maintain the 
number of files in the corpus to your needs, setup MaintBayesCollection, 
MaxBayesFileAge, MaxCorrectedDays, MaxNoBayesFileAge and 
MaxFileAgeSchedule to your best values. Keep an eye on the correction 
folders to prevent bad corrections!

More details!:
As you can see, the concept has got a major change - we use not only the 
corpus - we use the existing spamdb and the corpus, which is much more 
exact. Large parts of the long term memory of our corpus are moved in to 
the spamdb. Or better, our long term memory has extremly increased. So 
even if the corpus is getting bad or corrupt because of wrong 
collection/correction, new spammers behavior or any other coincidence - 
our spamdb will be in a consistent actual state. We left noting to chance, 
by doing anything randomly!

If someone wants to use the old concept (build a completely new spamdb 
depending on the currently existing more or less randomly build corpus) - 
this is possible by leaving ReplaceOldSpamdb on the default value (1) and 
deselect  MaintBayesCollection. But in this case, also the old collection 
concept should be used (deselect UseSubjectsAsMaillogNames or use 
doMove2Num) and you have to accept, that some of the new features will not 
work as expected (for example: BlockReports - resend).

Maybe the config description is not clear enough to understand the 
concepts - but a description made by a developer is never the best.

hope this helps

Thomas







K Post <nntp.p...@gmail.com> 
15.09.2009 23:55
Bitte antworten an
ASSP development mailing list <assp-test@lists.sourceforge.net>


An
ASSP development mailing list <assp-test@lists.sourceforge.net>
Kopie

Thema
Re: [Assp-test] Antwort: Re: Antwort: Re: Antwort: Re: Antwort: 
Re:fixesandnewsin 2.0.1_RC0.4.12






I'll try to simplify my discussion a bit

1) It's my understanding that currently files are only deleted with 
subject
logging on and move2numb off by date.  Yes?  I want to see random deletion
in 0.4.14

2) We agree that deletion by date isn't the best for bayesian filtering
yes?  If so, then I want to keep the number of files closer to maxfile by
first removing what is probably a duplicate email.  Easiest way to do this
that I've thought of: delete based on subject names.  If we do this, we 
can
remove what's probably the same message, and then delete randomly to get
down to the maxfiles number of files.  That'll leave more unique messages
which is important since duplicates aren't considered by assp.
3) I'm confused by the MaintBayesCollection option.  I use bayesian, I do
NOT want the folders to have files removed automatically, oldest first to
get to maxfiles.  I want to do it by subject trimming first, then 
randomly.
My point previously is that the description in admin for
MaintBayesCollection suggests that files will be deleted by date.  THis
doesn't have anything to do with MaxNoBayesFileAge, etc does it?   The max
file age options say things like "A value of 0 disables this feature and 
no
file will be deleted because of its age" but does this override the
processing that the admin servers says will happen if maintbayescollection
is checked? (deleting based on age to get to maxfiles)

4) You don't have the min option in ASSP now do you?  I think that Brett 
and
I are basically saying the same thing here.  I like the TTL language, 
though
min would be more consistent IMO.
On Tue, Sep 15, 2009 at 1:31 PM, Thomas Eckardt/eck <
thomas.ecka...@thockar.com> wrote:

> I do not understand the discussion !
>
> There are all wishes build in (assp) except removing mails with the same
> subject - I do not love this idea, because the subject is ignored by
> rebuildspamdb - only the body is used and mails with the same body are
> ignored (except one) and will be deleted 60 days later .
>
> -------------------------------------------
> ['MaintBayesCollection','Maintenance for Bayesian
> Collection',0,\&checkbox,'','(.*)',undef,
>  'Set this to on, if you want ASSP to run a maintenance tasks on the
> bayesian collection folders ( spamlog , notspamlog , correctedspam ,
> correctednotspam ). ASSP will delete the oldest files until the number 
of
> files per folder reaches MaxFiles. If you want ASSP to delete files
> because of their age instead of the number of files ( MaxFiles ), setup
> MaxBayesFileAge and/or MaxCorrectedDays to your needs.<br />
>  This option is usefull, if UseSubjectsAsMaillogNames is set to on and
> doMove2Num is set to off, because in this case the number of files in
> every collection folder will grow
> infinite.',undef,undef,'msg006140','msg006141'],
>
> ['MaxBayesFileAge','Max Age of Bayes
> Files',10,\&textinput,0,'(\d+)',undef,
>  'The maximum file age in days of every file in every bayesian 
collection
> folder ( spamlog , notspamlog ). If MaintBayesCollection is set to on 
and
> a file is older than this number in days, the file will be deleted.
> Default is 0. A value of 0 disables this feature and no file will be
> deleted because of its age.<br />
>  <span class = "negative">Do not define this option, if you use the
> bayesian engine of ASSP. Deleting files because of there age, is wrong 
in
> this case!!!!!</span>',undef,undef,'msg006150','msg006151'],
>
> ['MaxCorrectedDays','Max Corrected File
> Age',5,\&textinput,'1000','(\d+)',undef,'This is the number of days a
> error report will be kept in the correctednotspam and correctedspam
> folders. These folders are the longterm memory of ASSP, therefore the
> default is 1000 days. ',undef,undef,'msg008590','msg008591'],
>
> ['MaxNoBayesFileAge','Max Age of non Bayes
> Files',10,\&textinput,0,'(\d+)',undef,
>  'The maximum file age in days of every file in every non bayesian
> collection folder ( incomingOkMail , discarded , viruslog ). If defined
> and a file is older than this number in days, the file will be deleted.
> Default is 0. A value of 0 disables this feature and no file will be
> deleted because of its age.',undef,undef,'msg006160','msg006161'],
> ---------------------------------------------
>
> If MaintBayesCollection is set to on -it is your choice to set the rest 
to
> your needs.
>
> - MaxBayesFileAge/MaxNoBayesFileAge   ==   0       - reduce the number 
of
> files to maxfiles by deleting the oldest
> - MaxBayesFileAge/MaxNoBayesFileAge   !=   0       - reduce the number 
of
> files by deleting all that are older than XX
>
> -MaxCorrectedDays - this files should never be deleted (use 1000000)
>
> And keep in mind - if the number of files per folder is reduced to
> maxfiles at 1:00 AM and rebuildspamdb is running at 11:00 PM -
> rebuildspamdb has to process possibly much more than maxfiles!
>
> Currently there is a mistake in this maint-task: the files with the
> filedate set to 60 days in future, are the last files that will be 
deleted
> - this will be fixed in 4.14
>
> Thomas
>
>
>
>
>
>
> "GrayHat" <gray...@gmx.net>
> 15.09.2009 18:35
> Bitte antworten an
> GrayHat <gray...@gmx.net>; Bitte antworten an
> ASSP development mailing list <assp-test@lists.sourceforge.net>
>
>
> An
> "ASSP development mailing list" <assp-test@lists.sourceforge.net>
> Kopie
>
> Thema
> Re: [Assp-test] Antwort: Re: Antwort: Re: Antwort: Re:fixesandnewsin
> 2.0.1_RC0.4.12
>
>
>
>
>
>
>  >> Hmm... that sounds like an idea which was brought on some
> >> time ago (John was still the dev for ASSP at the time); that
> >> is, set up some kind of TTL parameter for corpus files so
> >> that the spamdb rebuild should check the file date/time and
> >> if over the TTL (say "n" days) it should then delete the file.
>
> > My thought is that the "TTL" would only be in effect for the purpose
> > of keeping BlockReporting working (for however many days or
> > weeks you wish the emails to be guaranteed resendable).
> > After that time, the TTL is null and the files are game for
> > replacement.  I thought it a simple idea for working around
> > the BlockReporting problem Thomas mentioned.
>
> I see, but there's no need to store something along with files,
> the regular filesystem timestamp for each file will just work
> fine, just remove all files if "(today - filetime) > TTL"
>
> > On a low-to-medium traffic box, though, this would not be a
> >  problem. We already deal with bunches of identical
> > messages from time-to-time (nothing new).
>
> there may be a solution for that too, assuming the spam and
> notspam folders gets cleaned up using the TTL, the files may
> be saved using (e.g.) an MD5 hash (or the like) as the name
> so that identical messages won't be stored more than one
> time; by the way that may have some side effects and may
> need some more thinking but...
>
> >> Bottom line; the bayes filter should work by /learning/ this
> >> means that it should NOT discard the previous data, but
> >> rather REFINE them from further data coming in; so maybe the
> >> whole bayes approach used inside ASSP should be revised NOT
> >> to deal just with the latest data but to learn/improve during time
>
> > Just an idea, but how do you "NOT" discard data while keeping
> > rebuild times low and maintaining free hard drive space
> > (realistically)?
>
> Using some kind of "digest" of the previous bases stored in a
> more compact format
>
>
>
>
> 
------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and 
stay
> ahead of the curve. Join us from November 9&#45;12, 2009. Register
> now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, 
legally
> privileged and protected in law and are intended solely for the use of 
the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
> 
------------------------------------------------------------------------------
>  Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and 
stay
> ahead of the curve. Join us from November 9&#45;12, 2009. Register 
now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register 
now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to