Hello everyone, this is my first time posting to this group but I am hoping 
perhaps I can get some questions answered here. For the record I am using 
Tesseract 5.0.1 on a x64 Windows 10 machine...first let me provide a bit of 
background on this problem I'm having.

So I have built a data scraper in order to collect data that is structured 
in a specific way. The structure of the data is extremely simple, just a 
floating point number followed by an integer number with a space in between 
i.e. 8.51  800. The tessdata_best eng.traineddata file works best for this 
but it has certain issues with the data I am feeding it. On occasion it 
will make certain errors like omitting a decimal point from the first 
number, or adding a phantom number that isn't there i.e. 8.561 instead of 
8.51. Occasionally it'll misinterpret 9 as 0. And last but not least 
sometimes if a bit of data has more than just 2 digits after the decimal 
point it'll omit the space in between the numbers i.e. the output will look 
like 119.9089500 instead of 119.9089  500. I have made every effort to 
clean up the data I am feeding it to mitigate these issues with some decent 
success considering the output was vastly inferior in the beginning of my 
experiments. This concludes the background of the issues.

So now the issue has come to the point where I need to be able to increase 
accuracy of my data output because errors like missing decimal points and 
missing spaces and misinterpreted characters are unacceptable for my 
application. In order to do this I first attempted to clean up the data 
further, which like I said above has had good success but has not 
eliminated every issue. I have attached some sample data files that are 
being OCR'd by Tesseract and will provide others if required. It was at 
this point I decided to retrain Tesseract with tesstrain in order to 
attempt to increase the accuracy further.

Because I am on Windows 10 and don't have easy access to a Linux box to use 
I resolved to make tesstrain work on windows, which after finagling around 
with the Makefile and using cygwin I was finally able to start the 
training/fine-tuning process.

I know that Tesseract trains best with randomized data so I attempted at 
first to capture real world data by hand and save those off to convert into 
the .tif and .gt.txt files. After collecting 500-600 data lines of 
different types I proceeded to run fine-tuning.

This is where the issues begin...with fine-tuning I was finding that I was 
getting an issue where I would no longer have missing decimal 
points...instead I was getting double decimal points and even colons and 
decimal points where they shouldn't be i.e. 8..51  5:0.0. I thought that 
this may be due to the fact that I also had in my training data clock 
outputs i.e. 08:09:56, so I went and tried with just the other data but 
that resulted in about the same thing just with no colons anymore. Then 
after attempting all different kinds of learning rates, max_iterations, and 
what not I decided to try something different.

Because it appeared my data was not randomized enough due to the nature of 
the problem I'm facing, I decided instead I would create a charset out of 
different captures from my application and randomly generate synthetic 
training data, complete with .gt.txt transcriptions. I attempted fine 
tuning like this as well but to no avail, so I am now thinking I may need 
to just try training from scratch. My reasoning is that I might have better 
luck with training from scratch especially since this is such a constrained 
problem. No other structure of data exists except for what is shown above 
and in the attached sample files.

I am stumped at this point, I've read through much of the documentation in 
the github as well as cruising other sources I've found all throughout the 
Internet. I've attempted using different versions, different psm modes, 
even using the legacy engine at one point with somewhat better results but 
still missing decimal points and with the added disadvantage of slowing 
down considerably, which is also unacceptable.

Can anyone please try to help me figure out what is going on here? I have 
attached the tesstrain Makefile as well as maybe there's something missing 
there I don't know about, and there may be quite a few things like that 
because this is my first exposure to Tesseract. I've done a lot of research 
into AI as I'm trying to break into that area with my career but my 
experience is still somewhat limited and definitely there are things that 
I'm not 100% sure about. Thanks in advance :)

Tim Prishtina

P.S. I almost forgot to say that each time I did the training I let the 
BCER and BWER outputs get to 0.34% or so each time I did the training 
process. With fine tuning it took only about 100K iterations to reach that 
but with from scratch it took almost 400K iteration. Just thought I'd 
include a bit more data. Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8f46a60b-298e-4b5e-8d08-1da54c6993ben%40googlegroups.com.

Attachment: Makefile
Description: Binary data

Reply via email to