Re: [tesseract-ocr] Combining output from multiple jobs into one hOCR file

Vidar Thu, 04 Feb 2021 16:25:16 -0800

Thanks a million, both of these seem like excellent options! :D

On Thursday, February 4, 2021 at 8:36:27 PM UTC Merlijn Wajer wrote:


> Hi Vidar,
>
> On 04/02/2021 21:11, Vidar wrote:
> > 
> > Hi,
> > 
> > I'm running some processing on a Windows machine using the recent 
> Mannheim 
> > 5.0 alpha builds, outputting to hOCR. When I run it on a job with a few 
> > hundred pages, the CPU usage constantly hovers around 10% (1 thread), 
> and 
> > memory/GPU usage doesn't seem to change much.
> > 
> > Now, while I could split the jobs by pages, and run them in parallel (or 
> > split across multiple machines), and then write a little script to 
> combine 
> > the different hOCR outputs together, I can't help but wonder if there is 
> a 
> > better way to do this? Is there some intermediate format from tesseract 
> > that I can get, and then feed them all into one hOCR file directly?
>
> I ran into this exact problem, and I used hocr-combine from hocr-tools
> [1] to solve this problem. But I ran into limitations of that program,
> it doesn't read/write in a streaming manner, and runs out of memory.
>
> I wrote a streaming replacement here [2], which will not use a lot of ram.
>
> Cheers,
> Merlijn
>
> [1] https://github.com/ocropus/hocr-tools
> [2]
>
> https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4349be38-356d-4f4e-898f-707b329c295dn%40googlegroups.com.

Re: [tesseract-ocr] Combining output from multiple jobs into one hOCR file

Reply via email to