Oh boy....

Well, there are some steps to do (again, a made this looking on google, if
someone knows a better way, please let me know). I'll enumerate them with a
short description, if you need some more details, we can talk later.

   1. Prepare the training set: you'll need some examples to work with. The
   more, the merrier. After that, you need to standardize the training set. I
   found better results with 300 dpi images, in TIFF format.
   2. Process the training set: one of the mistakes I made was applying
   some filters to the images and not applying the same filter on the training
   set. If you use some processing or filter (I used binarization and noise
   removal), you need to apply that to the training set as well.
   3. Create the truth files: the training will be on the result of these
   truth files. In early versions of tesseract, you have to cut the images and
   provide some text files. It's easier now, you can create .box files of your
   images, using the tesseract. The command is *tesseract <image>.tiff
   <output_name> -l <language> wordstrbox*
   4. Change the .box files: with the truth files (these .box), correct
   them. These files will be the base for the fine tuning. If the output was
   an "a" and it must be a "s", change it in these files.
   5. Create the training files: after correcting every box file you have
   for the training set, create the training files. The command is *tesseract
   <image>.tiff <output_name> lstm.train*
   6. Generate the training base file: no mystery here, the training
   requires a file with the path for ALL lstmf files created in the previous
   step. In linux, you could achieve this with the command *ls -1 *.lstmf >
   all_lstmf.txt*
   7. Tuning: now comes the real training. The command is:

*lstmtraining \*
*--model_output <path_output> \*
*--continue_from <path_language_lstm> \*
*--trainineddata <path_traineddata> \*
*--train_listfile <path_all_lstmf.txt> \*
*--max_iterations <max_iterations>*

Some  considerations in the command above: you'll need the lstm file from
the language you are fine tuning. You can get it from the github of the
tesseract (ALWAYS USE THE BEST FOLDER). You need the traineddata of this
language too. Again, use the BEST.

After the training finishes, create the traineddata for the new fine tuned
language:
*lstmtraining \*
*--stop_training \*
*--continue_from <path_output>_checkpoint \*
*--traineddata <path_traineddata> \*
*--model_output <path_output_new_language>.traineddata*

With these steps, you'll have a new .traineddata file. Put it on your
tessdata directory and you're ready to go.

I could've missed something, I doing this by heart, but I'm almost sure
that's all I did.

Hope can help.

Best regards.


Em dom., 11 de abr. de 2021 às 14:46, Winston Shaji Jacob <
technofrea...@gmail.com> escreveu:

> How did you fine tune?
>
> On Friday, March 12, 2021 at 2:01:48 AM UTC-5 pron...@gmail.com wrote:
>
>> I really don't know if it's the correct way, but I achieved this with a
>> fine tunning.
>>
>> If there is a better way, I would be happy to know.
>>
>>
>>
>> Em quinta-feira, 11 de março de 2021 às 16:56:40 UTC-3,
>> techno...@gmail.com escreveu:
>>
>>> Im suprised theres no easy way to extract marked and unmarked checkboxes
>>> (ballot boxes),
>>> basically the U+2610  ☐   and U+2612 ☒
>>> I cant figure out how to make tesseract recognize this
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8b7cee3d-f413-4738-ab84-21f42281f85fn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8b7cee3d-f413-4738-ab84-21f42281f85fn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Netão

“*The trouble with being punctual is that nobody's there to appreciate it*.”
Franklin P. Jones

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPNp8DR7mUxnbrivFnd7dNpEM%3Dfb50hrzZxhfAEgkDv7tVBb_Q%40mail.gmail.com.

Reply via email to