https://bugs.kde.org/show_bug.cgi?id=380456
--- Comment #21 from tagwer...@innerjoin.org --- Created attachment 143869 --> https://bugs.kde.org/attachment.cgi?id=143869&action=edit pdftotext results from https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC (In reply to Adam Fontenot from comment #20) > ... The file, in their view, is pathological ... Applying a modicum of patience, running: nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf took 37 hours on a machine with 16GB memory 8-] The process gradually ate memory, reaching 10 GB. There wasn't an obvious impact on performance - but I would expect you'd see that bite when reaching the limits/starting to swap. Attaching the output file - just in case anyone else wants to see the result. When moving the source file to an indexed folder it was picked up by baloo and indexed by baloo_file_extractor. Similarly 37hrs and 10.1 GB. Alas wasn't quick enough to notice what happened to the baloo_file_extractor memory usage when the indexing finished - the process terminated (and released memory) when it had nothing more to do The details of the index records: $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf 1546b20000fc01 64513 1394354 Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf [/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf] Mtime: 1637335759 2021-11-19T16:29:19 Ctime: 1637335813 2021-11-19T16:30:13 Cached properties: Title: R Graphics Output Document Generated By: R 3.6.0 Page Count: 1 Creation Date: 2019-09-13T11:01:30.000Z Internal Info Terms: 0 100000 150000 200000 50000 Mapplication Mpdf T5 X15-graphics X15-output X15-r X17-3.6.0 X17-r X18-1 X24-2019-09-13T11:01:30Z a1 a2 b1 b2 c graphics output qagr qchr qkel qpal r vcf − ● File Name Terms: Fpdf Fqmvqwhpuqke7retn5f9tisea7 XAttr Terms: generator: 3.6.0 r pageCount: 1 title: graphics output r creationDate: 2019-09-13T11:01:30Z and... $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt 140a610000fc01 64513 1313377 Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt [/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt] Mtime: 1637519014 2021-11-21T19:23:34 Ctime: 1637519014 2021-11-21T19:23:34 Cached properties: Line Count: 4352 Internal Info Terms: 0 100000 150000 200000 50000 Mplain Mtext T5 T8 X20-4352 a1 a2 b1 b2 c qagr qchr qkel qpal vcf − ● File Name Terms: Fqmvqwhpuqke7retn5f9tisea7 Ftxt XAttr Terms: lineCount: 4352 So, for this instance, not a lot of indexable text but the metadata was recognised (in the PDF, it was not extracted to the text) and it was possible to search for the title: $ baloosearch "R Graphics Output" or... $ baloosearch title:"R Graphics Output" /home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf I think with enough RAM and patience baloo can cope with even this pathological test case but the the requirement definitely _is_ "enough Ram and patience". It would certainly make sense to be able to say to baloo_file_extractor "give up after 10 minutes" and flag the file as failed. I'll update Bug 400704, which has become a collection point for these misbehavin' reports. See: https://bugs.kde.org/show_bug.cgi?id=400704#c31 and onwards. -- You are receiving this mail because: You are watching all bug changes.