[tesseract-ocr] Combining output from multiple jobs into one hOCR file

Vidar Thu, 04 Feb 2021 12:20:10 -0800

Hi,

I'm running some processing on a Windows machine using the recent Mannheim 
5.0 alpha builds, outputting to hOCR. When I run it on a job with a few 
hundred pages, the CPU usage constantly hovers around 10% (1 thread), and 
memory/GPU usage doesn't seem to change much.


Now, while I could split the jobs by pages, and run them in parallel (or 
split across multiple machines), and then write a little script to combine 
the different hOCR outputs together, I can't help but wonder if there is a 
better way to do this? Is there some intermediate format from tesseract 
that I can get, and then feed them all into one hOCR file directly?

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cca3e86e-5c32-4ca5-bc0f-88debde27b34n%40googlegroups.com.

[tesseract-ocr] Combining output from multiple jobs into one hOCR file

Reply via email to