Re: Bitcoin ransom mail

2019-12-11 Thread Olivier
Mauricio Tavares writes: > On Wed, Dec 11, 2019 at 2:05 PM Giovanni Bechis wrote: >> >> On 12/11/19 8:00 PM, Mauricio Tavares wrote: >> >> > I asked the project owner if I could put fuzzyocr on github. He said >> > go for it, so it is now at https://github.com/raubvogel/FuzzyOcr. >> > >> Cool, >

Re: Bitcoin ransom mail

2019-12-11 Thread Mauricio Tavares
On Wed, Dec 11, 2019 at 2:05 PM Giovanni Bechis wrote: > > On 12/11/19 8:00 PM, Mauricio Tavares wrote: > > On Wed, Dec 11, 2019 at 1:58 PM Giovanni Bechis wrote: > >> > >> On 12/11/19 3:17 PM, Bill Cole wrote: > >>> On 11 Dec 2019, at 2:39, Giovanni Bechis wrote: > >>> > On 12/11/19 6:21 AM

Re: Bitcoin ransom mail

2019-12-11 Thread Giovanni Bechis
On 12/11/19 8:00 PM, Mauricio Tavares wrote: > On Wed, Dec 11, 2019 at 1:58 PM Giovanni Bechis wrote: >> >> On 12/11/19 3:17 PM, Bill Cole wrote: >>> On 11 Dec 2019, at 2:39, Giovanni Bechis wrote: >>> On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote: > Hi PFA... > > On 12/11/2019 12:36

Re: Bitcoin ransom mail

2019-12-11 Thread Mauricio Tavares
On Wed, Dec 11, 2019 at 1:58 PM Giovanni Bechis wrote: > > On 12/11/19 3:17 PM, Bill Cole wrote: > > On 11 Dec 2019, at 2:39, Giovanni Bechis wrote: > > > >> On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote: > >>> Hi PFA... > >>> > >>> On 12/11/2019 12:36 AM, Giovanni Bechis wrote: > On 12/10/19 7:

Re: Bitcoin ransom mail

2019-12-11 Thread Giovanni Bechis
On 12/11/19 3:17 PM, Bill Cole wrote: > On 11 Dec 2019, at 2:39, Giovanni Bechis wrote: > >> On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote: >>> Hi PFA... >>> >>> On 12/11/2019 12:36 AM, Giovanni Bechis wrote: On 12/10/19 7:49 PM, Michael Storz wrote: [...] > My copy hit > > BODY

Re: Bitcoin ransom mail

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 04:05:50PM +0100, Benny Pedersen wrote: > Bill Cole skrev den 2019-12-11 15:17: > > >FuzzyOcr is unmaintained and doesn't even have an authoritative > >repository as far as I can tell. It is computationally very expensive, > >to the degree that it isn't safe to just add it

Re: Bitcoin ransom mail

2019-12-11 Thread Benny Pedersen
Bill Cole skrev den 2019-12-11 15:17: FuzzyOcr is unmaintained and doesn't even have an authoritative repository as far as I can tell. It is computationally very expensive, to the degree that it isn't safe to just add it to an existing mail system which does not have a lot of idle CPU and memory

Re: Bitcoin ransom mail

2019-12-11 Thread Bill Cole
On 11 Dec 2019, at 2:39, Giovanni Bechis wrote: On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote: Hi PFA... On 12/11/2019 12:36 AM, Giovanni Bechis wrote: On 12/10/19 7:49 PM, Michael Storz wrote: [...] My copy hit BODY_SINGLE_WORD=1.347, HTML_IMAGE_ONLY_04=1.172, MPART_ALT_DIFF=0.79 not enoug

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Bill Cole
On 11 Dec 2019, at 7:58, Henrik K wrote: SA does not and should not do any kind of content decoding/mangling for text/plain contents. Minor point: SA does (as it should) decode Base64 or Quoted-Printable text/* MIME parts to a canonical binary form of whatever character set is being used,

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 01:58:03PM +0100, Matus UHLAR - fantomas wrote: > > My question was, if there's a bug in the bayes code, causing it to eat too > much of memory. Both ~750B per token with file-based bayes or ~600B per > token in redis-based BAYES looks like too much for me. Not so much a b

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 01:44:35PM +0100, Matus UHLAR - fantomas wrote: > > old school "attachment" no Content-Type, plaintext, uuencode inline. > I don't think SA decodes that. > I don't know if it should, but at least detection should be OK. When Content-Type is missing, part is assumed to be t

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Matus UHLAR - fantomas
>On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote: >>On 11.12.19 11:43, Henrik K wrote: >>>Wow 6 million tokens.. :-) >>> >>>I assume the big uuencoded blob content-type is text/* since it's tokenized? >>yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M m

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Matus UHLAR - fantomas
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote: will it prefer test parts and try to avoid uuencoded or base64 parts? (or maybe decode them?) On 11.12.19 14:35, Henrik K wrote: To clarify, of course SA decodes base64 parts. Base64 is standard MIME transfer encoding. I

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote: > > will it prefer test parts and try to avoid uuencoded or base64 parts? > (or maybe decode them?) To clarify, of course SA decodes base64 parts. Base64 is standard MIME transfer encoding. It's decoded to reveal the actual

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 01:12:46PM +0100, Matus UHLAR - fantomas wrote: > >On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote: > >>On 11.12.19 11:43, Henrik K wrote: > >>>Wow 6 million tokens.. :-) > >>> > >>>I assume the big uuencoded blob content-type is text/* since it's > >

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Matus UHLAR - fantomas
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote: On 11.12.19 11:43, Henrik K wrote: >Wow 6 million tokens.. :-) > >I assume the big uuencoded blob content-type is text/* since it's tokenized? yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail. gr

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote: > On 11.12.19 11:43, Henrik K wrote: > >Wow 6 million tokens.. :-) > > > >I assume the big uuencoded blob content-type is text/* since it's tokenized? > > yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M m

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Matus UHLAR - fantomas
On Wed, Dec 11, 2019 at 10:04:56AM +0100, Matus UHLAR - fantomas wrote: >>hmmm, the machine has 4G of RAM and SA now takes 4.5. >>The check rund out of time but produces ~450K debug file. >> >>This is where it hangs: >> >>Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens On 10.

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Henrik K
On Wed, Dec 11, 2019 at 10:04:56AM +0100, Matus UHLAR - fantomas wrote: > >>hmmm, the machine has 4G of RAM and SA now takes 4.5. > >>The check rund out of time but produces ~450K debug file. > >> > >>This is where it hangs: > >> > >>Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 toke

Re: SA memory (Re: ".*" in body rules)

2019-12-11 Thread Matus UHLAR - fantomas
hmmm, the machine has 4G of RAM and SA now takes 4.5. The check rund out of time but produces ~450K debug file. This is where it hangs: Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens On 10.12.19 22:52, RW wrote: What are the full counts if you put it through 'grep tokeniz