Because I'm getting encoding errors. I checked the unicharset that it generated and it did not have enough characters so I would like to create my own unicharset with all the characters.
On Wednesday, January 13, 2021 at 12:48:52 AM UTC-6 shree wrote: > Unicharset is extracted from training text, because those are the samples > that will be used for training. > > Why do you want to use a different unicharset? > > > On Tue, Jan 12, 2021, 23:47 Kamui 7 <qntmm...@gmail.com> wrote: > >> >> >> Great! The PR that you submitted fixed issue #3. All that's left is the >> encoding string problem. I wonder if it's a problem with the unicharset >> extractor? >> On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote: >> >>> Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for >>> updates >>> >>> On Saturday, January 9, 2021 at 10:19:02 PM UTC+5:30 qntmm...@gmail.com >>> wrote: >>> >>>> >>>> How do I create my own custom unicharset file? The tesstrain script >>>> seems to be generating one based on the training text but I want to pass >>>> in >>>> my own unicharset file. >>>> On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote: >>>> >>>>> Are any of these vertical fonts? >>>>> >>>>> Encoding errors could be if the characters in training text are not in >>>>> the unicharset. >>>>> >>>>> On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmm...@gmail.com> wrote: >>>>> >>>>>> Looks like that fixed bug #1. Now it is able to successfully create >>>>>> 400 pages. Do you have any ideas as to why the other 2 errors are >>>>>> occurring? >>>>>> On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: >>>>>> >>>>>>> Your training text file is only 175 lines, so the rendered image >>>>>>> fits in 4 pages. You need to use a larger text if you want more pages. >>>>>>> >>>>>>> Also check that your fonts support both English and Japanese as the >>>>>>> text seems to have samples of both languages. >>>>>>> >>>>>>> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote: >>>>>>> >>>>>>>> I did a find command in the root directory and searched for the >>>>>>>> tesstrain script. It could only find the script that i pulled from the >>>>>>>> latest tesseract git repo. My training script calls that specific >>>>>>>> tesstrain >>>>>>>> script using a relative path so it couldn't be an older version >>>>>>>> >>>>>>>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote: >>>>>>>> >>>>>>>>> Old versions of tesstrain.sh used to limit training to 3 pages. >>>>>>>>> Looks like you may have an old version in the path somewhere. >>>>>>>>> >>>>>>>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have a script to train tesseract and I ran it on Arch Linux, >>>>>>>>>> Debian, and even a docker container and they all produce the same >>>>>>>>>> errors. I >>>>>>>>>> checked to make sure the script is correct as well. >>>>>>>>>> >>>>>>>>>> Bug 1: >>>>>>>>>> This happens when tesstrain runs text2image. The max pages >>>>>>>>>> parameter does not work at all. It ends up only rendering 4 pages >>>>>>>>>> regardless of what I pass in for the maxpages parameter. I even >>>>>>>>>> tried >>>>>>>>>> hardcoding it into the tesstrain_utils.sh file and it still does the >>>>>>>>>> same >>>>>>>>>> thing. >>>>>>>>>> >>>>>>>>>> Bug 2: >>>>>>>>>> After it finishes producing those 4 pages, i finetune it with >>>>>>>>>> lstmtraining and the resulting output is full of "Encoding of string >>>>>>>>>> failed!" errors. >>>>>>>>>> >>>>>>>>>> Bug 3: >>>>>>>>>> Along with those encoding errors, it also outputs the following >>>>>>>>>> text: >>>>>>>>>> >>>>>>>>>> "Image too small to scale!! (2x48 vs min width of 3) >>>>>>>>>> Line cannot be recognized!! >>>>>>>>>> Image not trainable" >>>>>>>>>> >>>>>>>>>> I will upload my script along with the Dockerfile if anyone wants >>>>>>>>>> to take a look. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com >>>>>>>>>> >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ____________________________________________________________ >>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> >>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e8bfeca9-d94e-4a5c-b810-29fef48217c4n%40googlegroups.com.