I did fine-tuning with the eng.traineddata, using about 200 text lines from the training text and 1100 iterations , CER of 0.01. The resulting model is small because it does not have the dictionary files and is compressed to fast/integer model.
On Wed, Mar 31, 2021, 03:37 marvin thielk <marvin.thi...@gmail.com> wrote: > oops, missed this delivery failure. The ttf file is too large to attach > because it contains asian characters. I can upload it somewhere if you're > interested, but I plan on training a model for my own edification. Original > message below: > > This is awesome, thank you so much! > > What hyperparameters did you use for training? number of pages? epochs? > > Which model did you start with? your file seems smaller than other > eng.traineddata files. > > Thanks, > ~Marvin > > On Sun, Mar 28, 2021 at 10:16 AM Shree Devi Kumar <shreesh...@gmail.com> > wrote: > >> Finetuning with font will help. >> >> I retrained using "Oleo Script Swash Caps Bold" font which had >> numerals similar to the test image. And the numbers get recognized now. >> >> (base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png - >> V7 >> (base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png - >> --tessdata-dir /home/ubuntu/tesstrain/data/ -l engtuned >> Failed to load any lstm-specific dictionaries for lang engtuned!! >> 717 >> >> Finetuned traineddata File is attached. >> >> On Sat, Mar 27, 2021 at 10:14 PM Marvin Thielk <marvin.thi...@gmail.com> >> wrote: >> >>> I do have the font available as a ttf file. It is probably copyright >>> protected but I could post it if it would be useful. >>> No I need to recognize letters and numbers, and I've been able to >>> extract text from other regions of the images, its just this region of >>> numbers and .%'s >>> >>> Thanks, >>> ~Marvin >>> >>> On Saturday, March 27, 2021 at 9:50:46 AM UTC-4 shree wrote: >>> >>>> Do you have the font used in the sample? >>>> Do you only need to recognise numbers in it? >>>> >>>> On Sat, Mar 27, 2021, 16:10 Marvin Thielk <marvin...@gmail.com> wrote: >>>> >>>>> I've tried a variety of pre-processing attempts and different configs, >>>>> but this feels like it should be an easy detection task. >>>>> >>>>> I've tried with several different psm and oem settings. Even >>>>> restricting to numerical characters. Nothing seems to help. >>>>> >>>>> Is the next step to re-train it? >>>>> >>>>> version info if it helps: >>>>> tesseract v5.0.0-alpha.20201127 >>>>> leptonica-1.78.0 >>>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : >>>>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >>>>> Found AVX2 >>>>> Found AVX >>>>> Found FMA >>>>> Found SSE >>>>> Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 >>>>> liblz4/1.7.5 >>>>> Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN >>>>> libssh2/1.7.0 nghttp2/1.31.0 >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/1bb67d51-2bd3-4d4e-9ba1-8b39b7f3ee43n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1bb67d51-2bd3-4d4e-9ba1-8b39b7f3ee43n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/361e0ed0-c2c6-4a80-8509-31237ae551f4n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/361e0ed0-c2c6-4a80-8509-31237ae551f4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/j3An1bBB_S0/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJRvd%2Bbf%2B1HgCPNmtFLO%3Dk_8-xZOEVd%2BMEEqzjaF_hkQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJRvd%2Bbf%2B1HgCPNmtFLO%3Dk_8-xZOEVd%2BMEEqzjaF_hkQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > Marvin Thielk > Neuroscience PhD candidate at UCSD > 775 964 8726 > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAHqNQh7Mkm-%2Bo77gr%3DE0kuzKd%2Bys%3Dct7wH0iYGCq6xZ9G7B4Mw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAHqNQh7Mkm-%2Bo77gr%3DE0kuzKd%2Bys%3Dct7wH0iYGCq6xZ9G7B4Mw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXECP9%2BfGRDKVmSs0%2BoQX%3D7XrUHKCJ2Zss-n56jLZ3gjA%40mail.gmail.com.