https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #21 from tagwer...@innerjoin.org ---
Created attachment 143869
  --> https://bugs.kde.org/attachment.cgi?id=143869&action=edit
pdftotext results from
https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

(In reply to Adam Fontenot from comment #20)
> ... The file, in their view, is pathological ...
Applying a modicum of patience, running:

    nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

took 37 hours on a machine with 16GB memory 8-]

The process gradually ate memory, reaching 10 GB. There wasn't an obvious
impact on performance - but I would expect you'd see that bite when reaching
the limits/starting to swap.

Attaching the output file - just in case anyone else wants to see the result.

When moving the source file to an indexed folder it was picked up by baloo and
indexed by baloo_file_extractor. Similarly 37hrs and 10.1 GB.

Alas wasn't quick enough to notice what happened to the baloo_file_extractor
memory usage when the indexing finished - the process terminated (and released
memory) when it had nothing more to do

The details of the index records:

    $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
    1546b20000fc01 64513 1394354
Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
[/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf]
            Mtime: 1637335759 2021-11-19T16:29:19
            Ctime: 1637335813 2021-11-19T16:30:13
            Cached properties:
                    Title: R Graphics Output
                    Document Generated By: R 3.6.0
                    Page Count: 1
                    Creation Date: 2019-09-13T11:01:30.000Z

    Internal Info
    Terms: 0 100000 150000 200000 50000 Mapplication Mpdf T5 X15-graphics
X15-output X15-r X17-3.6.0 X17-r X18-1 X24-2019-09-13T11:01:30Z a1 a2 b1 b2 c
graphics output qagr qchr qkel qpal r vcf − ●
    File Name Terms: Fpdf Fqmvqwhpuqke7retn5f9tisea7
    XAttr Terms:
    generator: 3.6.0 r
    pageCount: 1
    title: graphics output r
    creationDate: 2019-09-13T11:01:30Z

and...

    $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt
    140a610000fc01 64513 1313377
Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt
[/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt]
            Mtime: 1637519014 2021-11-21T19:23:34
            Ctime: 1637519014 2021-11-21T19:23:34
            Cached properties:
                    Line Count: 4352

    Internal Info
    Terms: 0 100000 150000 200000 50000 Mplain Mtext T5 T8 X20-4352 a1 a2 b1 b2
c qagr qchr qkel qpal vcf − ●
    File Name Terms: Fqmvqwhpuqke7retn5f9tisea7 Ftxt
    XAttr Terms:
    lineCount: 4352

So, for this instance, not a lot of indexable text but the metadata was
recognised (in the PDF, it was not extracted to the text) and it was possible
to search for the title:

    $ baloosearch "R Graphics Output"

or...

    $ baloosearch title:"R Graphics Output"
    /home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

I think with enough RAM and patience baloo can cope with even this pathological
test case but the the requirement definitely _is_ "enough Ram and patience". It
would certainly make sense to be able to say to baloo_file_extractor "give up
after 10 minutes" and flag the file as failed.

I'll update Bug 400704, which has become a collection point for these
misbehavin' reports. See:

    https://bugs.kde.org/show_bug.cgi?id=400704#c31

and onwards.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to