Oh boy.... Well, there are some steps to do (again, a made this looking on google, if someone knows a better way, please let me know). I'll enumerate them with a short description, if you need some more details, we can talk later.
1. Prepare the training set: you'll need some examples to work with. The more, the merrier. After that, you need to standardize the training set. I found better results with 300 dpi images, in TIFF format. 2. Process the training set: one of the mistakes I made was applying some filters to the images and not applying the same filter on the training set. If you use some processing or filter (I used binarization and noise removal), you need to apply that to the training set as well. 3. Create the truth files: the training will be on the result of these truth files. In early versions of tesseract, you have to cut the images and provide some text files. It's easier now, you can create .box files of your images, using the tesseract. The command is *tesseract <image>.tiff <output_name> -l <language> wordstrbox* 4. Change the .box files: with the truth files (these .box), correct them. These files will be the base for the fine tuning. If the output was an "a" and it must be a "s", change it in these files. 5. Create the training files: after correcting every box file you have for the training set, create the training files. The command is *tesseract <image>.tiff <output_name> lstm.train* 6. Generate the training base file: no mystery here, the training requires a file with the path for ALL lstmf files created in the previous step. In linux, you could achieve this with the command *ls -1 *.lstmf > all_lstmf.txt* 7. Tuning: now comes the real training. The command is: *lstmtraining \* *--model_output <path_output> \* *--continue_from <path_language_lstm> \* *--trainineddata <path_traineddata> \* *--train_listfile <path_all_lstmf.txt> \* *--max_iterations <max_iterations>* Some considerations in the command above: you'll need the lstm file from the language you are fine tuning. You can get it from the github of the tesseract (ALWAYS USE THE BEST FOLDER). You need the traineddata of this language too. Again, use the BEST. After the training finishes, create the traineddata for the new fine tuned language: *lstmtraining \* *--stop_training \* *--continue_from <path_output>_checkpoint \* *--traineddata <path_traineddata> \* *--model_output <path_output_new_language>.traineddata* With these steps, you'll have a new .traineddata file. Put it on your tessdata directory and you're ready to go. I could've missed something, I doing this by heart, but I'm almost sure that's all I did. Hope can help. Best regards. Em dom., 11 de abr. de 2021 às 14:46, Winston Shaji Jacob < technofrea...@gmail.com> escreveu: > How did you fine tune? > > On Friday, March 12, 2021 at 2:01:48 AM UTC-5 pron...@gmail.com wrote: > >> I really don't know if it's the correct way, but I achieved this with a >> fine tunning. >> >> If there is a better way, I would be happy to know. >> >> >> >> Em quinta-feira, 11 de março de 2021 às 16:56:40 UTC-3, >> techno...@gmail.com escreveu: >> >>> Im suprised theres no easy way to extract marked and unmarked checkboxes >>> (ballot boxes), >>> basically the U+2610 ☐ and U+2612 ☒ >>> I cant figure out how to make tesseract recognize this >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/8b7cee3d-f413-4738-ab84-21f42281f85fn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/8b7cee3d-f413-4738-ab84-21f42281f85fn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Netão “*The trouble with being punctual is that nobody's there to appreciate it*.” Franklin P. Jones -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPNp8DR7mUxnbrivFnd7dNpEM%3Dfb50hrzZxhfAEgkDv7tVBb_Q%40mail.gmail.com.