Ok, so if anyone is interested, I ended up creating a custom font based on the actual digits that I extracted from the photo, then using this custom font to train data and it worked 100%. Took me a couple of days tweaking. I described it in details here: https://msmielak.github.io/post/2021-03-29-extracting-date-and-time-from-photo-using-ocr-engine-tesseract/ niedziela, 28 marca 2021 o 20:44:50 UTC+3 Michał Śmielak napisał(a):
> Hi everyone, new user here. > I have an issue with tesseract (run as an R library if that makes any > difference). > I am trying to read data from a camera trap photo - an example: > [image: pic (1).jpg] > The photo is low res - 640x480 px, but as you can see the number are > easily readable by a human. I managed to tune it a bit by clipping parts of > the picture, reversing image, upscaling etc and I have something like that: > > [image: date_49.jpg] > > You would think it is an easy thing to read so I created a subset of these > outputs, merged into a tiff, manually fixed the automatic detection, but > tesseract is consistently misreading some digits sometimes, for instance, > this is read as 10/16/2015, while this time: > [image: time_55.jpg] > is read as 16:66:04. > I find this very weird as these numbers are embedded into the photo by > camera trap itself and are very consistent. The size is always the same, > the digits are identical, yet the same digit is read by the software in > different ways, and sometimes not read at all. And 8 is always read as > something else. > > I would appreciate any advice on how to fix that. My training data was 140 > dates and 140 times and still when I generated boxes (I used jTessBoxEditor > for that) sometimes that would be read fine, and then the next one would be > read as letters that are not even similar. Could the "pixelated" type of > font be the issue? Digits are originally 8 px high. > > Alternatively, can you advise me on a method to read these values > correctly? > > Thanks in advance everyone. > Michal > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0aa0afe9-a2fb-4127-aab1-fbde0d34051bn%40googlegroups.com.