Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Wed, 07 Aug 2019 17:35:41 -0700

OK, I'll do so.  I need to reorganize naming and so on a little bit.  Will 
be out there soon.


2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>
>
>
> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>
>> HI,
>>
>> I'm thinking of sharing it of course.  What is the best way to do it?  
>> After all this, the contribution part of mine is only how I prepared the 
>> training text.  Even that is consist of Shree's text and mine.  The 
>> instructions and tools I used already exist.
>>
> If you have a Github account just create a repo and publish the data and 
> instructions. 
>
>>
>> ElMagoElGato
>>
>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>
>>> Hello,
>>> Are you planning to release the dataset or models?
>>> I'm working on the same subject and planning to share both under BSD 
>>> terms
>>>
>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>
>>>> Hi,
>>>>
>>>> FWIW, I got to the point where I can feel happy with the accuracy. As 
>>>> the images of the previous post show, the symbols, especially on-us symbol 
>>>> and amount symbol, were causing mix-up each other or to another character. 
>>>>  
>>>> I added much more more symbols to the training text and formed words that 
>>>> start with a symbol.  One example is as follows:
>>>>
>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>
>>>>
>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>>>> iteration was almost good.  Amount symbol still is confused a little when 
>>>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>>>> I'll have to think of something to make further improvement.
>>>>
>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>> became 
>>>> very solid.
>>>>
>>>> I thought I might have to do image and box file training but I guess 
>>>> it's not needed this time.
>>>>
>>>> ElMagoElGato
>>>>
>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> HI,
>>>>>
>>>>> Well, I read the description of ScrollView (
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and 
>>>>> it says:
>>>>>
>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>
>>>>>
>>>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>>>> image and ends up with a blue screen.  Anyway, it shows each box 
>>>>> separately 
>>>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll 
>>>>> do 
>>>>> it on my own.
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>> characters comes from.
>>>>>>
>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>>>>>> 98.864532'>&lt;</span>
>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>>>>>> 99.018097'>;</span>
>>>>>>
>>>>>>
>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>> character that exists on x axis.  The first character only has 3 points 
>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>
>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>
>>>>>>
>>>>>> There seem to be some more cases to cause phantom characters.  I'll 
>>>>>> look them in.  But I have a trivial question now.  I made ScrollView 
>>>>>> show 
>>>>>> these displays by accidentally clicking Display->Blamer menu.  There is 
>>>>>> Bounding Boxes menu below but it ends up showing a blue screen though it 
>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll be 
>>>>>> very 
>>>>>> useful.
>>>>>>
>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>
>>>>>>
>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>
>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>
>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>
>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>
>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>>>>>> show 
>>>>>>>>> bounding box of each character.  I only shows the box for a whole 
>>>>>>>>> word.
>>>>>>>>>
>>>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>>>> request 2576.  How can I do that?  Do I have to look in the source 
>>>>>>>>> code and 
>>>>>>>>> manipulate such an output on my own?
>>>>>>>>>
>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>
>>>>>>>>>> Lorenzo,
>>>>>>>>>>
>>>>>>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>>>>>>> after I see how it goes with bounding boxes.
>>>>>>>>>>
>>>>>>>>>> Shree,
>>>>>>>>>>
>>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>> latest 
>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>> complex 
>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>
>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>
>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We 
>>>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>>> characters. 
>>>>>>>>>>> However the results using psm 10 are worse than passing the whole 
>>>>>>>>>>> line in. 
>>>>>>>>>>> Yet when we pass the whole line in we get these phantom characters. 
>>>>>>>>>>>
>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where 
>>>>>>>>>>> there clearly is one. I can supply a test case if it is expected to 
>>>>>>>>>>> work 
>>>>>>>>>>> well. 
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>
>>>>>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>>>>>> problem would save a lot of people.
>>>>>>>>>>>>
>>>>>>>>>>>> Shree,
>>>>>>>>>>>>
>>>>>>>>>>>> I pulled the current head of master branch but it doesn't seem 
>>>>>>>>>>>> to contain the merges you pointed that have been merged 3 to 4 
>>>>>>>>>>>> days ago.  
>>>>>>>>>>>> How can I get them?
>>>>>>>>>>>>
>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved 
>>>>>>>>>>>>> the situation but did not solve it. Also I could not use it in 
>>>>>>>>>>>>> some other 
>>>>>>>>>>>>> cases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The proper solution is very likely doing more training with 
>>>>>>>>>>>>> more data, some data augmentation might probably help if data is 
>>>>>>>>>>>>> scarce.
>>>>>>>>>>>>> Also doing less training might help is the training is not 
>>>>>>>>>>>>> done correctly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for 
>>>>>>>>>>>>> each "pixel column" does this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>
>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the example above an M is partially seen as an N, this is 
>>>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) 
>>>>>>>>>>>>> tries to 
>>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>
>>>>>>>>>>>>> are what gives the phantom characters. More training should 
>>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the 
>>>>>>>>>>>>> bounding 
>>>>>>>>>>>>> boxes might fix some cases.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm 
>>>>>>>>>>>>>> option didn't solve my problem though I see different output.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>>>> results anyway.  How did you get bounding box for each 
>>>>>>>>>>>>>> character?  Alto and 
>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe 
>>>>>>>>>>>>>>> this was also improved.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, 
>>>>>>>>>>>>>>> etc. But it 
>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>>> reliable with 
>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it 
>>>>>>>>>>>>>>> is quite ugly 
>>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the 
>>>>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it 
>>>>>>>>>>>>>>>> will sometimes 
>>>>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 
>>>>>>>>>>>>>>>> characters total 
>>>>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet 
>>>>>>>>>>>>>>>> but I 
>>>>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can 
>>>>>>>>>>>>>>>> discard .. 
>>>>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone 
>>>>>>>>>>>>>>>> else has 
>>>>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain 
>>>>>>>>>>>>>>>> the phantom 
>>>>>>>>>>>>>>>> characters in the first place) id be interested. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing so, 
>>>>>>>>>>>>>>>>> I'd like to investigate results a little bit.  The hocr and 
>>>>>>>>>>>>>>>>> lstmbox options 
>>>>>>>>>>>>>>>>> give some details of positions of characters.  The results 
>>>>>>>>>>>>>>>>> show positions 
>>>>>>>>>>>>>>>>> that perfectly correspond to letters in the image.  But the 
>>>>>>>>>>>>>>>>> text output 
>>>>>>>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far 
>>>>>>>>>>>>>>>>> more information.  I hope it explains what happened with each 
>>>>>>>>>>>>>>>>> character.  
>>>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if 
>>>>>>>>>>>>>>>>> someone could 
>>>>>>>>>>>>>>>>> tell me how to read it because it's really complex.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser.  
>>>>>>>>>>>>>>>>>>> Each of the 
>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that 
>>>>>>>>>>>>>>>>>>> you'd be able to update it?  It gets tripple D very often 
>>>>>>>>>>>>>>>>>>> where there's 
>>>>>>>>>>>>>>>>>>> only one, and so on.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I 
>>>>>>>>>>>>>>>>>>> need to change the language-specific.sh.  It specifies some 
>>>>>>>>>>>>>>>>>>> parameters for 
>>>>>>>>>>>>>>>>>>> each language.  Do you have any guidance for it?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I 
>>>>>>>>>>>>>>>>>>>>> didn't find any.  I see free fonts and commercial OCR 
>>>>>>>>>>>>>>>>>>>>> software but not 
>>>>>>>>>>>>>>>>>>>>> traineddata.  Tessdata repository obviously doesn't have 
>>>>>>>>>>>>>>>>>>>>> one, either.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata 
>>>>>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the 
>>>>>>>>>>>>>>>>>>>>>>> following format,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in 
>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols.  This 
>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines 
>>>>>>>>>>>>>>>>>>>>>>> of training text.  It's amazing that this thing 
>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out 
>>>>>>>>>>>>>>>>>>>>>>> of nowhere.  But then it is not very good.  For example:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but 
>>>>>>>>>>>>>>>>>>>>>>> the last '<' in the result text doesn't exist on the 
>>>>>>>>>>>>>>>>>>>>>>> image.  It's a small 
>>>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training 
>>>>>>>>>>>>>>>>>>>>>>>    text
>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could 
>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the fine-tuning-from-full 
>>>>>>>>>>>>>>>>>>>>>>> method.  It seems a 
>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also stops 
>>>>>>>>>>>>>>>>>>>>>>> at a 'good' level.  I 
>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright 
>>>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many lines 
>>>>>>>>>>>>>>>>>>>>>>>>>> would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends 
>>>>>>>>>>>>>>>>>>>>>>>>>> to pick other character than in E13B that only has 
>>>>>>>>>>>>>>>>>>>>>>>>>> 14 characters, 0 through 
>>>>>>>>>>>>>>>>>>>>>>>>>> 9 and 4 symbols.  I thought training from scratch 
>>>>>>>>>>>>>>>>>>>>>>>>>> would eliminate such 
>>>>>>>>>>>>>>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text 
>>>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are 
>>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with 
>>>>>>>>>>>>>>>>>>>>>>>>>>> your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%,  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> rate=0, Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint.  The generated 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?</di
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> </d
>>>>>>>>>>
>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/57f7e7c6-68ca-40dc-9a54-d5405c2cc495%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to