[tesseract-ocr] Tesseract mixung up digits in a low-res font

Michał Śmielak Sun, 28 Mar 2021 10:44:50 -0700

Hi everyone, new user here.
I have an issue with tesseract (run as an R library if that makes any 
difference).
I am trying to read data from a camera trap photo - an example:
[image: pic (1).jpg]
The photo is low res - 640x480 px, but as you can see the number are easily 
readable by a human. I managed to tune it a bit by clipping parts of the 
picture, reversing image, upscaling etc and I have something like that:


[image: date_49.jpg]

You would think it is an easy thing to read so I created a subset of these 
outputs, merged into a tiff, manually fixed the automatic detection, but 
tesseract is consistently misreading some digits sometimes, for instance, 
this is read as 10/16/2015, while this time:
[image: time_55.jpg]
is read as 16:66:04.
I find this very weird as these numbers are embedded into the photo by 
camera trap itself and are very consistent. The size is always the same, 
the digits are identical, yet the same digit is read by the software in 
different ways, and sometimes not read at all. And 8 is always read as 
something else.

I would appreciate any advice on how to fix that. My training data was 140 
dates and 140 times and still when I generated boxes (I used jTessBoxEditor 
for that) sometimes that would be read fine, and then the next one would be 
read as letters that are not even similar. Could the "pixelated" type of 
font be the issue? Digits are originally 8 px high.

Alternatively, can you advise me on a method to read these values correctly?

Thanks in advance everyone.
Michal


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e1e8b04c-e85a-4d9b-8ad4-87b1a47b8ef4n%40googlegroups.com.

[tesseract-ocr] Tesseract mixung up digits in a low-res font

Reply via email to