I did fine-tuning with the eng.traineddata, using about 200 text lines from
the training text and 1100 iterations , CER of 0.01. The resulting model is
small because it does not have the dictionary files and is compressed to
fast/integer model.

On Wed, Mar 31, 2021, 03:37 marvin thielk <marvin.thi...@gmail.com> wrote:

> oops, missed this delivery failure. The ttf file is too large to attach
> because it contains asian characters. I can upload it somewhere if you're
> interested, but I plan on training a model for my own edification. Original
> message below:
>
> This is awesome, thank you so much!
>
> What hyperparameters did you use for training? number of pages? epochs?
>
> Which model did you start with? your file seems smaller than other
> eng.traineddata files.
>
> Thanks,
> ~Marvin
>
> On Sun, Mar 28, 2021 at 10:16 AM Shree Devi Kumar <shreesh...@gmail.com>
> wrote:
>
>> Finetuning with font will help.
>>
>> I retrained using "Oleo Script Swash Caps Bold" font which had
>> numerals similar to the test image. And the numbers get recognized now.
>>
>> (base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png -
>> V7
>> (base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png -
>> --tessdata-dir /home/ubuntu/tesstrain/data/   -l engtuned
>> Failed to load any lstm-specific dictionaries for lang engtuned!!
>> 717
>>
>> Finetuned traineddata File is attached.
>>
>> On Sat, Mar 27, 2021 at 10:14 PM Marvin Thielk <marvin.thi...@gmail.com>
>> wrote:
>>
>>>  I do have the font available as a ttf file. It is probably copyright
>>> protected but I could post it if it would be useful.
>>> No I need to recognize letters and numbers, and I've been able to
>>> extract text from other regions of the images, its just this region of
>>> numbers and .%'s
>>>
>>> Thanks,
>>> ~Marvin
>>>
>>> On Saturday, March 27, 2021 at 9:50:46 AM UTC-4 shree wrote:
>>>
>>>> Do you have the font used in the sample?
>>>> Do you only need to recognise numbers in it?
>>>>
>>>> On Sat, Mar 27, 2021, 16:10 Marvin Thielk <marvin...@gmail.com> wrote:
>>>>
>>>>> I've tried a variety of pre-processing attempts and different configs,
>>>>> but this feels like it should be an easy detection task.
>>>>>
>>>>> I've tried with several different psm and oem settings. Even
>>>>> restricting to numerical characters. Nothing seems to help.
>>>>>
>>>>> Is the next step to re-train it?
>>>>>
>>>>> version info if it helps:
>>>>> tesseract v5.0.0-alpha.20201127
>>>>>  leptonica-1.78.0
>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
>>>>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>>>>  Found AVX2
>>>>>  Found AVX
>>>>>  Found FMA
>>>>>  Found SSE
>>>>>  Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6
>>>>> liblz4/1.7.5
>>>>>  Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN
>>>>> libssh2/1.7.0 nghttp2/1.31.0
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1bb67d51-2bd3-4d4e-9ba1-8b39b7f3ee43n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1bb67d51-2bd3-4d4e-9ba1-8b39b7f3ee43n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/361e0ed0-c2c6-4a80-8509-31237ae551f4n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/361e0ed0-c2c6-4a80-8509-31237ae551f4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/j3An1bBB_S0/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJRvd%2Bbf%2B1HgCPNmtFLO%3Dk_8-xZOEVd%2BMEEqzjaF_hkQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJRvd%2Bbf%2B1HgCPNmtFLO%3Dk_8-xZOEVd%2BMEEqzjaF_hkQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> Marvin Thielk
> Neuroscience PhD candidate at UCSD
> 775 964 8726
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAHqNQh7Mkm-%2Bo77gr%3DE0kuzKd%2Bys%3Dct7wH0iYGCq6xZ9G7B4Mw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAHqNQh7Mkm-%2Bo77gr%3DE0kuzKd%2Bys%3Dct7wH0iYGCq6xZ9G7B4Mw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXECP9%2BfGRDKVmSs0%2BoQX%3D7XrUHKCJ2Zss-n56jLZ3gjA%40mail.gmail.com.

Reply via email to