Hi,

On Sat, Aug 17, 2024 at 12:14 PM <giova...@paclan.it> wrote:

> On 8/16/24 2:03 PM, Alex wrote:
> > The body was empty with a PDF attachment. It's too big for pastebin.
> >
> https://drive.google.com/file/d/1FzBgTKoBgRp7TWkqjWqSqqESYmCGH0G2/view?usp=sharing
> <
> https://drive.google.com/file/d/1FzBgTKoBgRp7TWkqjWqSqqESYmCGH0G2/view?usp=sharing
> >
> >
> > Any success stories with setting up zbar for QR code spam would also be
> appreciated :-)
>
> With this rule the QR-code is extracted correctly.
>
> extracttext_external    zbar            /usr/local/bin/zbarimg -q -D {}
> extracttext_use         zbar            .jpg .png .pdf .webp
> image/(?:jpeg|png) application/pdf
> add_header              all             ExtractText-Uris _EXTRACTTEXTURIS_
>

Is it possible zbar is competing with pdftotext for which content it
contains? Looks like it's either unable to identify the image or unable to
extract the link, perhaps because pdftotext is processing it instead?

X-Spam-ExtractText-Uris:
X-Spam-ExtractText-Chars: 323
X-Spam-ExtractText-Words: 35
X-Spam-ExtractText-Tools: pdftotext
X-Spam-ExtractText-Types: application/pdf
X-Spam-ExtractText-Extensions: pdf
X-Spam-ExtractText-Flags:

Here's my ExtractText.cf. I've verified all paths exist. Hopefully gmail
doesn't truncate the lines. It does hit EXTRACTTEXT.

extracttext_external  pdftotext  /usr/bin/pdftotext -nopgbrk -layout -enc
UTF-8 {} -
extracttext_use       pdftotext  .pdf application/pdf

# http://docx2txt.sourceforge.net
extracttext_external  docx2txt   /usr/local/bin/docx2txt.pl {} -
extracttext_use       docx2txt   .docx application/docx

extracttext_external  antiword   /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
extracttext_use       antiword   .doc application/(?:vnd\.?)?ms-?word.*

extracttext_external  unrtf      /usr/bin/unrtf --nopict {}
extracttext_use       unrtf      .doc .rtf application/rtf text/rtf

extracttext_external  odt2txt    /usr/bin/odt2txt --encoding=UTF-8 {}
extracttext_use       odt2txt    .odt .ott application/.*?opendocument.*text
extracttext_use       odt2txt    .sdw .stw application/(?:x-)?soffice
application/(?:x-)?starwriter

extracttext_external  tesseract  {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c
page_separator= {} -
extracttext_use       tesseract  .jpg .png .bmp .tif .tiff
image/(?:jpeg|png|x-ms-bmp|tiff)

# QR-code decoder
extracttext_external    zbar            /usr/bin/zbarimg -q -D {}
extracttext_use         zbar            .jpg .png .pdf .webp
image/(?:jpeg|png) application/pdf
add_header              all             ExtractText-Uris _EXTRACTTEXTURIS_

add_header   all          ExtractText-Flags _EXTRACTTEXTFLAGS_
header       PDF_NO_TEXT  X-ExtractText-Flags =~ /\bpdftotext_NoText\b/
describe     PDF_NO_TEXT  PDF without text
score        PDF_NO_TEXT  0.001

header       DOC_NO_TEXT  X-ExtractText-Flags =~
/\b(?:antiword|openxml|unrtf|odt2txt)_NoText\b/
describe     DOC_NO_TEXT  Document without text
score        DOC_NO_TEXT  0.001

header       EXTRACTTEXT  exists:X-ExtractText-Flags
describe     EXTRACTTEXT  Email processed by extracttext plugin
score        EXTRACTTEXT  0.001

Reply via email to