Hi,

On 20/10/2021 11:31, juan carlos hernández wrote:
> Hi all
> 
> I'm managing a project that needs to OCR documents in real time. We
> expect to have multiple users scanning and OCRing documents in the order
> of tens of users simultaneously, maybe 100 users at a time or more. We
> need to get OCR done for documents with about 50 pages in less than 20
> seconds. Our documents will be scaned with 300dpi.
> As we are in a huge organization in a public administration, we can
> afford to buy very powerful servers to run tesseract.
> 
> Do you have any advice on what HW is best suited for tesseract? 
> I've revised the Intel Xeon family of processors, and I think that
> choosing the Xeon Platinum processors would be a good option. 
> Apart from having fast processors, what other components affect the
> performance of tesseract, amount and speed of memory, having SSD or a
> RamDisk?

Just a few vague suggestions based on experience running it on a cluster
(ymmv):

* ramdisk could help reduce wear on SSDs, but I don't think it will
matter much in processing speed, the majority of time is not spent in
I/O if you use SSDs
* Run tesseract with only one thread to get the most out of your CPUs
(disable OpenMP) - this will maximise your throughput
* The average peak ram (max in the process lifetime) from Tesseract (at
archive.org) is about 100MB, with occasional max spikes to 2GB of ram
(likely for big images/newspapers) - upper 90 percentile is about
200MB-300MB.
* The average OCR time per page (at archive.org) is about 7.5 seconds,
but we have a lot of old CPU cores mixed in (some only have sse2).

Maybe with the with the average runtime & ram usage you can figure out
what you need.

Finally, keep in mind that in some cases Tesseract can run for many
minutes or hours and slowly consume ram - this happens very rarely, but
does happen on some inputs, so be sure to cap the running time / ram if
you run it on a big cluster.

Cheers,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3a24290f-91d1-e7b2-d53d-02630b89fb44%40archive.org.

Reply via email to