[tesseract-ocr] New release for tessdata_{fast,best}?

2021-01-27 Thread Merlijn B.W. Wajer
Hi, With Tesseract now switching to regular (alpha) releases of 5.0.0; does it make sense to consider some versioning for language files as well? The Internet Archive has switched to using Tesseract for all our OCR, and I'm hoping that we can record exactly what version of language files was used

Re: [tesseract-ocr] New release for tessdata_{fast,best}?

2021-01-27 Thread Merlijn B.W. Wajer
Hi, On 27/01/2021 12:42, Shree Devi Kumar wrote: >> The Internet Archive has switched to using Tesseract for all our OCR, > > I am so happy to hear this. It will be great to have the Indic languages > that were marked as non-ocrable so far be converted to text correctly on > Internet Archive. Ri

Re: [tesseract-ocr] Re: New release for tessdata_{fast,best}?

2021-02-01 Thread Merlijn B.W. Wajer
Hi Tom, On 30/01/2021 21:25, Tom Morris wrote: > On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote: > > > The Internet Archive has switched to using Tesseract for all our OCR, > > > That's great to hear! It's certainly been a long time coming. Nick White > & I tried to

Re: [tesseract-ocr] Combining output from multiple jobs into one hOCR file

2021-02-04 Thread Merlijn B.W. Wajer
Hi Vidar, On 04/02/2021 21:11, Vidar wrote: > > Hi, > > I'm running some processing on a Windows machine using the recent Mannheim > 5.0 alpha builds, outputting to hOCR. When I run it on a job with a few > hundred pages, the CPU usage constantly hovers around 10% (1 thread), and > memory/GPU

Re: [tesseract-ocr] Output layout

2021-02-20 Thread Merlijn B.W. Wajer
Hi, On 20/02/2021 17:38, karim abed el hadi wrote: > Hello, > > Is there a way to let the OCR output have the same text layout as the input > image? > if yes, can you please tell me how? You could output hOCR and use this tool to visualise the output: https://github.com/kba/hocrjs#demo Cheers,

Re: [tesseract-ocr] Detecting language automatically

2021-03-20 Thread Merlijn B.W. Wajer
Hi, On 19/03/2021 10:11, Charles Cho wrote: > Hello, > I'm working on a ocr android app based on tesseract. > I want to add feature that detects language automatically and recognize > at least 2 languages at once. > I have investigated on that for a while so I know that I have to specify > languag

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Merlijn B.W. Wajer
Hi, On 25/03/2021 19:04, Charles Cho wrote: > Hi. > > Thank you very much for your kind help, shree. > I tried to detect script by your help and it worked. Great. > > I have some questions. > 1. If the image contains texts of different languages in a page, is there > any way to detect all of th

Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

2021-04-14 Thread Merlijn B.W. Wajer
Hi, On 14/04/2021 13:52, Sharp Subbu wrote: > Dear friends, > > Kindly guide/help us to find solution for the below point: > = > How to reduce the size of a OCRed pdf file using Tesseract OCR APIs. > === Not sure exactly what use case you h

Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

2021-04-19 Thread Merlijn B.W. Wajer
Hi, On 20/04/2021 01:04, Sharp Subbu wrote: > Dear Merlijn, > > Kindly reply to my previous mail. I will reply tomorrow -- off-list so that we don't bother others on this list. Regards, Merlijn > Thanks and Regards, > Subramanyam > > On Saturday, April 17, 2021 at 11:32:31 PM UTC+5:30 Sharp S

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

2021-05-10 Thread Merlijn B.W. Wajer
Hi Ben, On 09/05/2021 21:33, Ben Crowell wrote: > I'm trying to OCR a book that is written in interspersed Greek and English: > > https://archive.org/details/odysseyofhomerco01gile/page/n5/mode/2up > > Here is a sample of text from the first page: > > [image: a.jpg] > > I'm running tesseract 4

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

2021-05-10 Thread Merlijn B.W. Wajer
Hi Ben, On 10/05/2021 15:09, Ben Crowell wrote: > Hi Merlijn, > > Thanks very much for your reply. It's encouraging that you were able to get > somewhat better results. However, I'm not able to reproduce them. When I > use -l eng+ell, the results are still very poor: > > 1. Evverre declare wot

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

2021-05-14 Thread Merlijn B.W. Wajer
Hi Ben, On 13/05/2021 02:34, Ben Crowell wrote: > > Only 68% of Greek words are correctly recognized as Greek, and even of > those, some are misread. Extremely common words like μοι, ὁς, and και are > not recognized, although they are mostly recognized when I OCR the text > with the language

Re: [tesseract-ocr] Improve text extraction when some text is inverted

2021-07-02 Thread Merlijn B.W. Wajer
Hi, On 01/07/2021 18:39, 'Chris' via tesseract-ocr wrote: > I am experimenting with Tesseract 4.1.1 using C# to extract text from black > and white or greyscale TIF images of semi structured forms that are 300 > dpi. > > The results are really promising except when some of the text is inverted

Re: [tesseract-ocr] Quality of Fraktur OCR too bad, any mistake on my side?

2021-08-23 Thread Merlijn B.W. Wajer
Hi Andreas, Using a newer data file and a newer Tesseract might help - see inline. On 28/07/2021 18:17, Andreas Groß wrote: > I work on Kubuntu 20.04 with gImageReader 3.3.1 () and tesseract 4.1.1 > and had installed fracture model with this command > > sudo apt-get install tesseract-ocr-script-

Re: [tesseract-ocr] Doubt about using 5.0.0-beta-20210916 before release version is available

2021-10-19 Thread Merlijn B.W. Wajer
Hi, On 19/10/2021 11:08, juan carlos hernández wrote: > Hello > I'm working in a project that needs OCR and we have choosed to use > Tesseract. We would like to use v5.0.0, but our IT Infrastructure team > doesn't want to install it because it is a beta version. I've read in > another conversation

Re: [tesseract-ocr] Doubt about using 5.0.0-beta-20210916 before release version is available

2021-10-19 Thread Merlijn B.W. Wajer
Hi, On 19/10/2021 16:47, Lorenzo Bolzani wrote: > Hi Merlijn, > out of curiosity, did you note an impovement over the previous version? Yes. Speed and stability is better, and accuracy is also up (IMHO). See (for example) this link: https://github.com/tesseract-ocr/tesseract/pull/3141 Regards, M

Re: [tesseract-ocr] Recomended HW for Tesseract

2021-10-20 Thread Merlijn B.W. Wajer
Hi, On 20/10/2021 11:31, juan carlos hernández wrote: > Hi all > > I'm managing a project that needs to OCR documents in real time. We > expect to have multiple users scanning and OCRing documents in the order > of tens of users simultaneously, maybe 100 users at a time or more. We > need to get

Re: [tesseract-ocr] Microscopy label, poor recognition

2021-12-21 Thread Merlijn B.W. Wajer
Hi Martin, Some of the advice below applies to Tesseract 5 only... On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote: > > > I have an image (label of a microscopy slide), which I thought would be > easy to OCR, because it is easily readable for humans. I am using the > latest T

Re: [tesseract-ocr] Too many diacritics can make process die?

2022-02-13 Thread Merlijn B.W. Wajer
Hi, On 12/02/2022 22:13, Alberto Simoes wrote: Hi I am OCRing a lot of documents. I have a document with very poor quality, and surely nothing will be recognized. But I need a stable pipeline, and while I was expecting tesseract just to return an empty document, I am getting this error: De

Re: [tesseract-ocr] Incorrect OCR of 4-digit number

2022-02-27 Thread Merlijn B.W. Wajer
Hi, On 27/02/2022 08:55, Zdenko Podobny wrote: tesseract fix_size.png - 0326 0939 1552 2206 See doc for explaining: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling Thanks for th

Re: [tesseract-ocr] "Line separator regions" capabilities?

2022-04-27 Thread Merlijn B.W. Wajer
Hi, On 27/04/2022 19:07, Brad wrote: For V5.10.0 of Tesseract, one of the changes is: (correction: version 5.1.0) Handle image and line separator regions in ALTO, hOCR and text output formats. I'm curious about what this means. Can Tesseract be used to identify rectangles and such on an im

Re: [tesseract-ocr] Recognition contains an extra letter

2022-06-13 Thread Merlijn B.W. Wajer
Hi, On 13/06/2022 10:21, 'Yunlong Liu' via tesseract-ocr wrote: Dear developers, I had read carefully the online material about how to use Tesseract for OCR tasks. It works well for most of the data on my side. However, I found one weird thing which confuses me quite a lot. Here are the detai

Re: [tesseract-ocr] Tesseract OCR on PDF without converting into images

2022-08-12 Thread Merlijn B.W. Wajer
Hi Banti, On 11/08/2022 12:11, Banti Kumar wrote: Can I use tesseract on pdf without converting pages into images? I have some pdf pages with digital text and Images with text, I just want to apply ocr on images but not on the digital text regions so I can get better accuracy for searchable pd

Re: [tesseract-ocr] Re: Possibility to call the pdf creation

2022-10-30 Thread Merlijn B.W. Wajer
Hi, On 26/09/2022 18:45, Max Rehberg wrote: I would like to do that as well. Is it possible? D schrieb am Dienstag, 8. Dezember 2020 um 19:19:30 UTC+1: Hey guys, I produce a .hocr file with Google Cloud Vision and gcv2hocr. I would like to know if there is an easy method to call

Re: [tesseract-ocr] sending image data directly to Tesseract

2023-02-13 Thread Merlijn B.W. Wajer
Hi, On 13/02/2023 20:40, Flávio. wrote: Hi, I'm a beginner, I wonder if there is a way to send image byte data directly to Tesseract from memory without having to write a file to disk . I'm building a program that uses Dart, but I can also write Python, Java or JS. If you write/echo an ima

Re: [tesseract-ocr] sending image data directly to Tesseract

2023-02-14 Thread Merlijn B.W. Wajer
Hi, On 14/02/2023 19:10, Flávio. wrote: Sorry, how can I do that?  I'm trying to send image binary data, not a path. The goal is to not write a file to disk and use only memory. Could you please write a code that sends the data (binary) to the stdin of tesseract? it can be in Python, Dart or

Re: [tesseract-ocr] sending image data directly to Tesseract

2023-02-14 Thread Merlijn B.W. Wajer
Hi, On 14/02/2023 21:16, Flávio. wrote: Thanks, i'm still trying to figure it out. It seems when the PIL image is saved, unfortunately it saves a temp file to disk. My goal is to not write to disk, because this application will read a lot of files and I want to spare my SSD. My code receives b

Re: [tesseract-ocr] sending image data directly to Tesseract

2023-02-14 Thread Merlijn B.W. Wajer
Hi, On 14/02/2023 21:59, Flávio. wrote: I'll look into that Linux option :)  as for the save method, I used the show method on the object and it had a path in the temp directory. So I asked ChatGPT how the file could be on disk and it told me that the save method created it. If you're right th

Re: [tesseract-ocr] Any success story?

2023-11-14 Thread Merlijn B.W. Wajer
Hi, On 14/11/2023 06:55, Des Bw wrote: It looks like every one is having issues with tesseract. I am not able to find any one who has a great success with this software. It would be really encouraging to hear any success story from any language. Here's one for you: https://blog.archive.org/2

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Merlijn B.W. Wajer
Hi Mark, On 07/03/2024 20:53, Mark Pellegrino wrote: I found more info here: https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277 Glyphless appears to be an 'invisible font' and all that Tesseract supports. It seems like the solution it to use Tesseract to generate hO

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Merlijn B.W. Wajer
Hi Mark, On 08/03/2024 20:24, Mark Pellegrino wrote: Thank you Merlijn, this is very helpful. I'm very interested in IA's process so I'll have a deep dive through those tools.  This confirms my suspicions that there's no way to use an off-the-shelf text editor with a glyphless font. I'll explo