I have an OCR program that tries to read and interpret many documents of different composition. Some documents are pdfs that have an image as the first page with text on the second (or later) pages. When processing, it can take several minutes or more just to get past the first page of the pdf on the GetText() call when it is an image with little or no text on it. The application is .net based on Winforms. Pdf Pages with lots of text work fine.
The relevant code in c# is var ocr = new TesseractEngine(..."tessdata5.2", "eng", EngineMode.LstmOnly); using var page = ocr.Process(img, PageSegMode.AutoOsd); ocrtext = page.GetText(); /* long time here */ img img = PixConverter.ToPix(save_bitmap); I do need to collect text from subsequent pages for indexing documents. Thanks in advance for any comments you may have. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/daff593f-01f3-4d09-acc4-a72ed39d4a98n%40googlegroups.com.