Hi Vidar, On 04/02/2021 21:11, Vidar wrote: > > Hi, > > I'm running some processing on a Windows machine using the recent Mannheim > 5.0 alpha builds, outputting to hOCR. When I run it on a job with a few > hundred pages, the CPU usage constantly hovers around 10% (1 thread), and > memory/GPU usage doesn't seem to change much. > > Now, while I could split the jobs by pages, and run them in parallel (or > split across multiple machines), and then write a little script to combine > the different hOCR outputs together, I can't help but wonder if there is a > better way to do this? Is there some intermediate format from tesseract > that I can get, and then feed them all into one hOCR file directly?
I ran into this exact problem, and I used hocr-combine from hocr-tools [1] to solve this problem. But I ran into limitations of that program, it doesn't read/write in a streaming manner, and runs out of memory. I wrote a streaming replacement here [2], which will not use a lot of ram. Cheers, Merlijn [1] https://github.com/ocropus/hocr-tools [2] https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3726b11c-6f1d-f2c1-d8d7-97aefb97cbac%40archive.org.