Mauricio Tavares writes:
> On Wed, Dec 11, 2019 at 2:05 PM Giovanni Bechis wrote:
>>
>> On 12/11/19 8:00 PM, Mauricio Tavares wrote:
>>
>> > I asked the project owner if I could put fuzzyocr on github. He said
>> > go for it, so it is now at https://github.com/raubvogel/FuzzyOcr.
>> >
>> Cool,
>
On Wed, Dec 11, 2019 at 2:05 PM Giovanni Bechis wrote:
>
> On 12/11/19 8:00 PM, Mauricio Tavares wrote:
> > On Wed, Dec 11, 2019 at 1:58 PM Giovanni Bechis wrote:
> >>
> >> On 12/11/19 3:17 PM, Bill Cole wrote:
> >>> On 11 Dec 2019, at 2:39, Giovanni Bechis wrote:
> >>>
> On 12/11/19 6:21 AM
On 12/11/19 8:00 PM, Mauricio Tavares wrote:
> On Wed, Dec 11, 2019 at 1:58 PM Giovanni Bechis wrote:
>>
>> On 12/11/19 3:17 PM, Bill Cole wrote:
>>> On 11 Dec 2019, at 2:39, Giovanni Bechis wrote:
>>>
On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote:
> Hi PFA...
>
> On 12/11/2019 12:36
On Wed, Dec 11, 2019 at 1:58 PM Giovanni Bechis wrote:
>
> On 12/11/19 3:17 PM, Bill Cole wrote:
> > On 11 Dec 2019, at 2:39, Giovanni Bechis wrote:
> >
> >> On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote:
> >>> Hi PFA...
> >>>
> >>> On 12/11/2019 12:36 AM, Giovanni Bechis wrote:
> On 12/10/19 7:
On 12/11/19 3:17 PM, Bill Cole wrote:
> On 11 Dec 2019, at 2:39, Giovanni Bechis wrote:
>
>> On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote:
>>> Hi PFA...
>>>
>>> On 12/11/2019 12:36 AM, Giovanni Bechis wrote:
On 12/10/19 7:49 PM, Michael Storz wrote:
[...]
> My copy hit
>
> BODY
On Wed, Dec 11, 2019 at 04:05:50PM +0100, Benny Pedersen wrote:
> Bill Cole skrev den 2019-12-11 15:17:
>
> >FuzzyOcr is unmaintained and doesn't even have an authoritative
> >repository as far as I can tell. It is computationally very expensive,
> >to the degree that it isn't safe to just add it
Bill Cole skrev den 2019-12-11 15:17:
FuzzyOcr is unmaintained and doesn't even have an authoritative
repository as far as I can tell. It is computationally very expensive,
to the degree that it isn't safe to just add it to an existing mail
system which does not have a lot of idle CPU and memory
On 11 Dec 2019, at 2:39, Giovanni Bechis wrote:
On 12/11/19 6:21 AM, KADAM, SIDDHESH wrote:
Hi PFA...
On 12/11/2019 12:36 AM, Giovanni Bechis wrote:
On 12/10/19 7:49 PM, Michael Storz wrote:
[...]
My copy hit
BODY_SINGLE_WORD=1.347, HTML_IMAGE_ONLY_04=1.172,
MPART_ALT_DIFF=0.79
not enoug
On 11 Dec 2019, at 7:58, Henrik K wrote:
SA does not and should not do any kind of content decoding/mangling
for
text/plain contents.
Minor point:
SA does (as it should) decode Base64 or Quoted-Printable text/* MIME
parts to a canonical binary form of whatever character set is being
used,
On Wed, Dec 11, 2019 at 01:58:03PM +0100, Matus UHLAR - fantomas wrote:
>
> My question was, if there's a bug in the bayes code, causing it to eat too
> much of memory. Both ~750B per token with file-based bayes or ~600B per
> token in redis-based BAYES looks like too much for me.
Not so much a b
On Wed, Dec 11, 2019 at 01:44:35PM +0100, Matus UHLAR - fantomas wrote:
>
> old school "attachment" no Content-Type, plaintext, uuencode inline.
> I don't think SA decodes that.
> I don't know if it should, but at least detection should be OK.
When Content-Type is missing, part is assumed to be t
>On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>>On 11.12.19 11:43, Henrik K wrote:
>>>Wow 6 million tokens.. :-)
>>>
>>>I assume the big uuencoded blob content-type is text/* since it's tokenized?
>>yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M m
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
will it prefer test parts and try to avoid uuencoded or base64 parts?
(or maybe decode them?)
On 11.12.19 14:35, Henrik K wrote:
To clarify, of course SA decodes base64 parts. Base64 is standard MIME
transfer encoding. I
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>
> will it prefer test parts and try to avoid uuencoded or base64 parts?
> (or maybe decode them?)
To clarify, of course SA decodes base64 parts. Base64 is standard MIME
transfer encoding. It's decoded to reveal the actual
On Wed, Dec 11, 2019 at 01:12:46PM +0100, Matus UHLAR - fantomas wrote:
> >On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
> >>On 11.12.19 11:43, Henrik K wrote:
> >>>Wow 6 million tokens.. :-)
> >>>
> >>>I assume the big uuencoded blob content-type is text/* since it's
> >
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
On 11.12.19 11:43, Henrik K wrote:
>Wow 6 million tokens.. :-)
>
>I assume the big uuencoded blob content-type is text/* since it's tokenized?
yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
gr
On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
> On 11.12.19 11:43, Henrik K wrote:
> >Wow 6 million tokens.. :-)
> >
> >I assume the big uuencoded blob content-type is text/* since it's tokenized?
>
> yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M m
On Wed, Dec 11, 2019 at 10:04:56AM +0100, Matus UHLAR - fantomas wrote:
>>hmmm, the machine has 4G of RAM and SA now takes 4.5.
>>The check rund out of time but produces ~450K debug file.
>>
>>This is where it hangs:
>>
>>Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens
On 10.
On Wed, Dec 11, 2019 at 10:04:56AM +0100, Matus UHLAR - fantomas wrote:
> >>hmmm, the machine has 4G of RAM and SA now takes 4.5.
> >>The check rund out of time but produces ~450K debug file.
> >>
> >>This is where it hangs:
> >>
> >>Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 toke
hmmm, the machine has 4G of RAM and SA now takes 4.5.
The check rund out of time but produces ~450K debug file.
This is where it hangs:
Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens
On 10.12.19 22:52, RW wrote:
What are the full counts if you put it through 'grep tokeniz
20 matches
Mail list logo