Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-13 Thread Kamui 7
Because I'm getting encoding errors. I checked the unicharset that it generated and it did not have enough characters so I would like to create my own unicharset with all the characters. On Wednesday, January 13, 2021 at 12:48:52 AM UTC-6 shree wrote: > Unicharset is extracted from training te

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-12 Thread Shree Devi Kumar
Unicharset is extracted from training text, because those are the samples that will be used for training. Why do you want to use a different unicharset? On Tue, Jan 12, 2021, 23:47 Kamui 7 wrote: > > > Great! The PR that you submitted fixed issue #3. All that's left is the > encoding string pr

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-12 Thread Kamui 7
Great! The PR that you submitted fixed issue #3. All that's left is the encoding string problem. I wonder if it's a problem with the unicharset extractor? On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote: > Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for > up

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-11 Thread shree
Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for updates On Saturday, January 9, 2021 at 10:19:02 PM UTC+5:30 qntmm...@gmail.com wrote: > > How do I create my own custom unicharset file? The tesstrain script seems > to be generating one based on the training text but I wan

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-09 Thread Kamui 7
How do I create my own custom unicharset file? The tesstrain script seems to be generating one based on the training text but I want to pass in my own unicharset file. On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote: > Are any of these vertical fonts? > > Encoding errors could be i

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Are any of these vertical fonts? Encoding errors could be if the characters in training text are not in the unicharset. On Fri, Jan 8, 2021, 00:46 Kamui 7 wrote: > Looks like that fixed bug #1. Now it is able to successfully create 400 > pages. Do you have any ideas as to why the other 2 errors

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
Looks like that fixed bug #1. Now it is able to successfully create 400 pages. Do you have any ideas as to why the other 2 errors are occurring? On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: > Your training text file is only 175 lines, so the rendered image fits in 4 > pages. Yo

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
I replaced the training text with the one from the official langdata repo and now it seems to only produce 30 pages. Is there any place to get the training text that the official jpn.traineddata was trained on? I have also checked to make sure the fonts support english and japanese as well On

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Your training text file is only 175 lines, so the rendered image fits in 4 pages. You need to use a larger text if you want more pages. Also check that your fonts support both English and Japanese as the text seems to have samples of both languages. On Thu, Jan 7, 2021, 22:40 Kamui 7 wrote: > I

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
I did a find command in the root directory and searched for the tesstrain script. It could only find the script that i pulled from the latest tesseract git repo. My training script calls that specific tesstrain script using a relative path so it couldn't be an older version On Thursday, January

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Old versions of tesstrain.sh used to limit training to 3 pages. Looks like you may have an old version in the path somewhere. On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: > I have a script to train tesseract and I ran it on Arch Linux, Debian, and > even a docker container and they all produce

[tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
I have a script to train tesseract and I ran it on Arch Linux, Debian, and even a docker container and they all produce the same errors. I checked to make sure the script is correct as well. Bug 1: This happens when tesstrain runs text2image. The max pages parameter does not work at all. It en