On 2022-01-20, Siard <shi...@mailbox.org> wrote: > Bob Bernstein wrote: >> Executing 'apt-cache search tesseract' brings up a multitude of >> packages. >> >> My need is simple enough, I think: I like to scan (using an >> Epson scanner) pages of printed books -- almost one hundred per >> cent text -- and then use OCR to produce pages from which I can >> copy 'n paste snippets of text for note-taking purposes. >> >> What do the assembled multitudes suggest for a tesseract package >> (that's the OCR I've been encouraged to use) on my bullseye >> system, ... > > Once you have a PDF containing the images (img2pdf may be used for > that), I think the cleverest way is to use ocrmypdf. > It adds an OCR text layer to the PDF file, so the PDF text becomes > selectable and can be copied. > It uses the Tesseract OCR engine. > > $ ocrmypdf -f inputfile.pdf outputfile.pdf >
ocrmypdf has quite a few dependencies on my machine. The multitude of packages corresponds more or less to the multiple languages of the human multitude. I guess the OP's working in English ('tesseract-ocr-eng', pulled in with all the others here when installing the above).