I do not know about the internal algorithms used by tesseract. If you are having accuracy issues with certain letters and digits, I will suggest that you fine-tune for impact using the images or similar font.
Please see wiki page on training 4.0 for the command - look for fine tuning for new font/impact. Use eng.traineddata as base, 50-100 lines of training text and 300-400 iterations max. On Fri 10 Aug, 2018, 8:39 PM , <da...@maxcommunications.co.uk> wrote: > Hi Shree, just a quick update. > > I've now looked into this output tesseract.log further and now understand > how it works and how it will go through different choices and eventually > decides on a "best choice". However the output doesn't explain how it then > decides what has overriding priority on giving the best outcome. The fact > that even after it scours through the "fo" dictionary, it decides on best > choice for this dictionary, immediately it will move onto eng dictionary > and seems to decide to use eng dictionary output because (i'm guessing), it > regards it as more accurate. This means your theory about our custom "fo" > dictionary not being trained to a high enough accuracy level seems to be > correct. Is there any possible way i can train either eng or fo to improve > it's accuracy or override another dictionary on specific characters it's > getting wrong? for example, in our case, the eng.traneddata dictionary > sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's. > > Your help on this would be greatly appreciated! > > Kind Regards, > > Damon > > On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote: >> >> output tesseract.log file should be produced in the directory from where >> you are running the command, usually where your OCR output is created. >> >> On Thu, Aug 9, 2018 at 3:48 PM <da...@maxcommunications.co.uk> wrote: >> >>> Hello Shree, thank you for your prompt reply. >>> >>> I have now changed the logfile as instructed. Where can i find the >>> output tesseract.log file? will it be produced in the same location as the >>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm >>> guessing the tesseract.log file will be produced once i've used logfile in >>> the commands. >>> >>> Kind Regards, >>> >>> Damon >>> >>> >>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote: >>>> >>>> i think this could be if your new traineddats is not trained to as high >>>> a accuracy level as the eng traineddata. >>>> >>>> You can setup a debug log to verify this. see >>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865 >>>> for details >>>> >>>> On Wed, Aug 8, 2018 at 6:04 PM <da...@maxcommunications.co.uk> wrote: >>>> >>>>> i'm trying to use the combination of two traineddata dictionaries >>>>> together due to one of them being able to recognise specific numbers >>>>> better >>>>> than the other. >>>>> >>>>> Here is an example of the code line. >>>>> >>>>> $codeLine .= '<br>magick convert "'.$filePath.'" >>>>> -quality 90 -density 300x300 -units PixelsPerInch "'.$output.'.jpg"'; // >>>>> $codeLine .= '<br>tesseract "'.$output.'.jpg" >>>>> "'.$output.'" -l fo+eng txt pdf'; >>>>> >>>>> Despite the fact i put "fo" in front (this is the one that recognises >>>>> the number 4 better), it still gives me an output text file that is >>>>> exactly >>>>> identical to the "eng" dictionary output when i run that solo on it's own. >>>>> >>>>> For some reason, it chooses to not just prioritise eng but also >>>>> completely ignoring the fo traineddata file completely. >>>>> >>>>> The "fo" file definitely works as i've tested it solo. >>>>> >>>>> I have attached an image example of the text i'd like to OCR and the >>>>> two relevant traineddata files. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6f5c53f8-1e5f-46f5-a452-f7d485ead9c8%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6f5c53f8-1e5f-46f5-a452-f7d485ead9c8%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX5krDDD4epswLwLHnWLrNVLLLf-H2uLnZbGzR-iEUPqw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.