Re: How can i judge a PDF is a Scanned PDF?

2024-11-20 Thread Lachezar Dobrev
Modern(-ish) scanners have an option to perform OCR on scanned documents. I've seen such PDF files that have a big image of the scanned documents as a back-ground, with lots of transparent text on top. That allows for the user to copy-paste text (OCR-ed) from such scanned documents. I vaguel

Re: How can i judge a PDF is a Scanned PDF?

2024-11-20 Thread Ulf Dittmer
I'm not quite sure what you mean by "scanned pdf", but if each page basically consists of one image, and no text, that might be a strong indication. On Wed, 20 Nov 2024, 11:05 achilles, <1743702...@qq.com.invalid> wrote: > hi: >   How can i judge a PDF is a Scanned PDF use pdfbox? >   i don't fin

Re: How can i judge a PDF is a Scanned PDF?

2024-11-20 Thread Constantine Dokolas
A sure sign that the text is the product of OCR, is that it is rendered in mode 3 (command "3 Tr"); i.e. invisible. See PDF 1.7 specification (32000-1:2008), section 9.3.6. Unless the PDF producer adds some kind of visible watermark using text, all text will be instructed to render in this mode.