I do not know about the internal algorithms used by tesseract.

If you are having accuracy issues with certain letters and digits, I will
suggest that you fine-tune  for impact using the images or similar font.

Please see wiki page on training 4.0 for the command - look for fine tuning
for new font/impact. Use eng.traineddata as base, 50-100 lines of training
text and 300-400 iterations max.

On Fri 10 Aug, 2018, 8:39 PM , <da...@maxcommunications.co.uk> wrote:

> Hi Shree, just a quick update.
>
> I've now looked into this output tesseract.log further and now understand
> how it works and how it will go through different choices and eventually
> decides on a "best choice". However the output doesn't explain how it then
> decides what has overriding priority on giving the best outcome. The fact
> that even after it scours through the "fo" dictionary, it decides on best
> choice for this dictionary, immediately it will move onto eng dictionary
> and seems to decide to use eng dictionary output because (i'm guessing), it
> regards it as more accurate. This means your theory about our custom "fo"
> dictionary not being trained to a high enough accuracy level seems to be
> correct. Is there any possible way i can train either eng or fo to improve
> it's accuracy or override another dictionary on specific characters it's
> getting wrong? for example, in our case, the eng.traneddata dictionary
> sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's.
>
> Your help on this would be greatly appreciated!
>
> Kind Regards,
>
> Damon
>
> On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>>
>> output tesseract.log file should be produced in the directory from where
>> you are running the command, usually where your OCR output is created.
>>
>> On Thu, Aug 9, 2018 at 3:48 PM <da...@maxcommunications.co.uk> wrote:
>>
>>> Hello Shree, thank you for your prompt reply.
>>>
>>> I have now changed the logfile as instructed. Where can i find the
>>> output tesseract.log file? will it be produced in the same location as the
>>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm
>>> guessing the tesseract.log file will be produced once i've used logfile in
>>> the commands.
>>>
>>> Kind Regards,
>>>
>>> Damon
>>>
>>>
>>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>>>
>>>> i think this could be if your new traineddats is not trained to as high
>>>> a accuracy level as the eng traineddata.
>>>>
>>>> You can setup a debug log to verify this. see
>>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>>>> for details
>>>>
>>>> On Wed, Aug 8, 2018 at 6:04 PM <da...@maxcommunications.co.uk> wrote:
>>>>
>>>>> i'm trying to use the combination of two traineddata dictionaries
>>>>> together due to one of them being able to recognise specific numbers 
>>>>> better
>>>>> than the other.
>>>>>
>>>>> Here is an example of the code line.
>>>>>
>>>>>                  $codeLine .= '<br>magick convert "'.$filePath.'"
>>>>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>>>>                  $codeLine .= '<br>tesseract "'.$output.'.jpg"
>>>>> "'.$output.'" -l fo+eng txt pdf';
>>>>>
>>>>> Despite the fact i put "fo" in front (this is the one that recognises
>>>>> the number 4 better), it still gives me an output text file that is 
>>>>> exactly
>>>>> identical to the "eng" dictionary output when i run that solo on it's own.
>>>>>
>>>>> For some reason, it chooses to not just prioritise eng but also
>>>>> completely ignoring the fo traineddata file completely.
>>>>>
>>>>> The "fo" file definitely works as i've tested it solo.
>>>>>
>>>>> I have attached an image example of the text i'd like to OCR and the
>>>>> two relevant traineddata files.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6f5c53f8-1e5f-46f5-a452-f7d485ead9c8%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6f5c53f8-1e5f-46f5-a452-f7d485ead9c8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX5krDDD4epswLwLHnWLrNVLLLf-H2uLnZbGzR-iEUPqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to