Because I'm getting encoding errors. I checked the unicharset that it 
generated and it did not have enough characters so I would like to create 
my own unicharset with all the characters. 

On Wednesday, January 13, 2021 at 12:48:52 AM UTC-6 shree wrote:

> Unicharset is extracted from training text, because those are the samples 
> that will be used for training.
>
> Why do you want to use a different unicharset?
>
>
> On Tue, Jan 12, 2021, 23:47 Kamui 7 <qntmm...@gmail.com> wrote:
>
>>
>>
>> Great! The PR that you submitted fixed issue #3. All that's left is the 
>> encoding string problem. I wonder if it's a problem with the unicharset 
>> extractor?
>> On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote:
>>
>>> Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for 
>>> updates
>>>
>>> On Saturday, January 9, 2021 at 10:19:02 PM UTC+5:30 qntmm...@gmail.com 
>>> wrote:
>>>
>>>>
>>>> How do I create my own custom unicharset file? The tesstrain script 
>>>> seems to be generating one based on the training text but I want to pass 
>>>> in 
>>>> my own unicharset file. 
>>>> On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote:
>>>>
>>>>> Are any of these vertical fonts?
>>>>>
>>>>> Encoding errors could be if the characters in training text are not in 
>>>>> the unicharset.
>>>>>
>>>>> On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmm...@gmail.com> wrote:
>>>>>
>>>>>> Looks like that fixed bug #1. Now it is able to successfully create 
>>>>>> 400 pages. Do you have any ideas as to why the other 2 errors are 
>>>>>> occurring?
>>>>>> On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote:
>>>>>>
>>>>>>> Your training text file is only 175 lines, so the rendered image 
>>>>>>> fits in 4 pages. You need to use a larger text if you want more pages.
>>>>>>>
>>>>>>> Also check that your fonts support both English and Japanese as the 
>>>>>>> text seems to have samples of both languages.
>>>>>>>
>>>>>>> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I did a find command in the root directory and searched for the 
>>>>>>>> tesstrain script. It could only find the script that i pulled from the 
>>>>>>>> latest tesseract git repo. My training script calls that specific 
>>>>>>>> tesstrain 
>>>>>>>> script using a relative path so it couldn't be an older version
>>>>>>>>
>>>>>>>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote:
>>>>>>>>
>>>>>>>>> Old versions of tesstrain.sh used to limit training to 3 pages. 
>>>>>>>>> Looks like you may have an old version in the path somewhere.
>>>>>>>>>
>>>>>>>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have a script to train tesseract and I ran it on Arch Linux, 
>>>>>>>>>> Debian, and even a docker container and they all produce the same 
>>>>>>>>>> errors. I 
>>>>>>>>>> checked to make sure the script is correct as well. 
>>>>>>>>>>
>>>>>>>>>> Bug 1:
>>>>>>>>>> This happens when tesstrain runs text2image. The max pages 
>>>>>>>>>> parameter does not work at all. It ends up only rendering 4 pages 
>>>>>>>>>> regardless of what I pass in for the maxpages parameter. I even 
>>>>>>>>>> tried 
>>>>>>>>>> hardcoding it into the tesstrain_utils.sh file and it still does the 
>>>>>>>>>> same 
>>>>>>>>>> thing. 
>>>>>>>>>>
>>>>>>>>>> Bug 2:
>>>>>>>>>> After it finishes producing those 4 pages, i finetune it with 
>>>>>>>>>> lstmtraining and the resulting output is full of "Encoding of string 
>>>>>>>>>> failed!" errors.
>>>>>>>>>>
>>>>>>>>>> Bug 3:
>>>>>>>>>> Along with those encoding errors, it also outputs the following 
>>>>>>>>>> text:
>>>>>>>>>>
>>>>>>>>>> "Image too small to scale!! (2x48 vs min width of 3)
>>>>>>>>>> Line cannot be recognized!!
>>>>>>>>>> Image not trainable"
>>>>>>>>>>
>>>>>>>>>> I will upload my script along with the Dockerfile if anyone wants 
>>>>>>>>>> to take a look. 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>> ____________________________________________________________
>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>
>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>
>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e8bfeca9-d94e-4a5c-b810-29fef48217c4n%40googlegroups.com.

Reply via email to