Hi, On 20/10/2021 11:31, juan carlos hernández wrote: > Hi all > > I'm managing a project that needs to OCR documents in real time. We > expect to have multiple users scanning and OCRing documents in the order > of tens of users simultaneously, maybe 100 users at a time or more. We > need to get OCR done for documents with about 50 pages in less than 20 > seconds. Our documents will be scaned with 300dpi. > As we are in a huge organization in a public administration, we can > afford to buy very powerful servers to run tesseract. > > Do you have any advice on what HW is best suited for tesseract? > I've revised the Intel Xeon family of processors, and I think that > choosing the Xeon Platinum processors would be a good option. > Apart from having fast processors, what other components affect the > performance of tesseract, amount and speed of memory, having SSD or a > RamDisk?
Just a few vague suggestions based on experience running it on a cluster (ymmv): * ramdisk could help reduce wear on SSDs, but I don't think it will matter much in processing speed, the majority of time is not spent in I/O if you use SSDs * Run tesseract with only one thread to get the most out of your CPUs (disable OpenMP) - this will maximise your throughput * The average peak ram (max in the process lifetime) from Tesseract (at archive.org) is about 100MB, with occasional max spikes to 2GB of ram (likely for big images/newspapers) - upper 90 percentile is about 200MB-300MB. * The average OCR time per page (at archive.org) is about 7.5 seconds, but we have a lot of old CPU cores mixed in (some only have sse2). Maybe with the with the average runtime & ram usage you can figure out what you need. Finally, keep in mind that in some cases Tesseract can run for many minutes or hours and slowly consume ram - this happens very rarely, but does happen on some inputs, so be sure to cap the running time / ram if you run it on a big cluster. Cheers, Merlijn -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a24290f-91d1-e7b2-d53d-02630b89fb44%40archive.org.