Hi Artman, Im working on a similar project to convert PDF to image to text 
to editor PDF for ML. Could you please shar your github code?

On Friday, 14 January 2022 at 23:51:46 UTC+5:30 ArtmanDC wrote:

> In my project I am scanning images on microfilm, then using Tesseract (v. 
> 5.0.0) to create a PDF including the OCR'ed text layer.
>
> The input images are text (monospaced typewriter), and I combine several 
> (2-8 typically) images in a multipage tif.
>
> I use the following command in Windows 10—
>
> tesseract multipage.tif output --psm 1 pdf
>
> This works as expected, producing a multi-page output.pdf. (I added the 
> <--psm 4> after I discovered that when several consecutive lines had word 
> spaces above each other, the program interpreted this as a gap between 
> columns, leading to unwanted results.)
>
> As a check in my workflow, I highlight the image in the PDF (CTRL-A) and 
> copy/paste into my editor (notepad++). This pastes the OCR text from all 
> pages in the document.
>
> The result is reasonably good except that paragraph and page breaks are 
> not indicated. Line breaks are.
>
> If I replace the <pdf> with a <txt> in the command, the resulting text 
> file has a blank line between paragraphs <LF LF> (Linux style, even though 
> I'm using Windows) and a page break <FF>  at the end of each page.
>
> I would like my PDF text layer to have the more user-friendly display that 
> tesseract deploys in a text file. 
>
> Is this possible?  If so, how?
>
> Thanks!
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7831fb46-e8ed-4c13-8b38-d082f81daa8bn%40googlegroups.com.

Reply via email to