[tesseract-ocr] Re: Tesseract mixung up digits in a low-res font

Michał Śmielak Mon, 29 Mar 2021 18:01:28 -0700

Ok, so if anyone is interested, I ended up creating a custom font based on 
the actual digits that I extracted from the photo, then using this custom 
font to train data and it worked 100%. Took me a couple of days tweaking.
I described it in details here: 
https://msmielak.github.io/post/2021-03-29-extracting-date-and-time-from-photo-using-ocr-engine-tesseract/
niedziela, 28 marca 2021 o 20:44:50 UTC+3 Michał Śmielak napisał(a):


> Hi everyone, new user here.
> I have an issue with tesseract (run as an R library if that makes any 
> difference).
> I am trying to read data from a camera trap photo - an example:
> [image: pic (1).jpg]
> The photo is low res - 640x480 px, but as you can see the number are 
> easily readable by a human. I managed to tune it a bit by clipping parts of 
> the picture, reversing image, upscaling etc and I have something like that:
>
> [image: date_49.jpg]
>
> You would think it is an easy thing to read so I created a subset of these 
> outputs, merged into a tiff, manually fixed the automatic detection, but 
> tesseract is consistently misreading some digits sometimes, for instance, 
> this is read as 10/16/2015, while this time:
> [image: time_55.jpg]
> is read as 16:66:04.
> I find this very weird as these numbers are embedded into the photo by 
> camera trap itself and are very consistent. The size is always the same, 
> the digits are identical, yet the same digit is read by the software in 
> different ways, and sometimes not read at all. And 8 is always read as 
> something else.
>
> I would appreciate any advice on how to fix that. My training data was 140 
> dates and 140 times and still when I generated boxes (I used jTessBoxEditor 
> for that) sometimes that would be read fine, and then the next one would be 
> read as letters that are not even similar. Could the "pixelated" type of 
> font be the issue? Digits are originally 8 px high.
>
> Alternatively, can you advise me on a method to read these values 
> correctly?
>
> Thanks in advance everyone.
> Michal
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0aa0afe9-a2fb-4127-aab1-fbde0d34051bn%40googlegroups.com.

[tesseract-ocr] Re: Tesseract mixung up digits in a low-res font

Reply via email to