Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Thu, 08 Aug 2019 22:31:38 -0700

Here's my sharing on GitHub.  Hope it's of any use for somebody.

https://github.com/ElMagoElGato/tess_e13b_training


2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>
> OK, I'll do so.  I need to reorganize naming and so on a little bit.  Will 
> be out there soon.
>
> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>
>>
>>
>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>
>>> HI,
>>>
>>> I'm thinking of sharing it of course.  What is the best way to do it?  
>>> After all this, the contribution part of mine is only how I prepared the 
>>> training text.  Even that is consist of Shree's text and mine.  The 
>>> instructions and tools I used already exist.
>>>
>> If you have a Github account just create a repo and publish the data and 
>> instructions. 
>>
>>>
>>> ElMagoElGato
>>>
>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>
>>>> Hello,
>>>> Are you planning to release the dataset or models?
>>>> I'm working on the same subject and planning to share both under BSD 
>>>> terms
>>>>
>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> FWIW, I got to the point where I can feel happy with the accuracy. As 
>>>>> the images of the previous post show, the symbols, especially on-us 
>>>>> symbol 
>>>>> and amount symbol, were causing mix-up each other or to another 
>>>>> character.  
>>>>> I added much more more symbols to the training text and formed words that 
>>>>> start with a symbol.  One example is as follows:
>>>>>
>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>
>>>>>
>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>>>>> iteration was almost good.  Amount symbol still is confused a little when 
>>>>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>>>>> I'll have to think of something to make further improvement.
>>>>>
>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>> became 
>>>>> very solid.
>>>>>
>>>>> I thought I might have to do image and box file training but I guess 
>>>>> it's not needed this time.
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> HI,
>>>>>>
>>>>>> Well, I read the description of ScrollView (
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and 
>>>>>> it says:
>>>>>>
>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>
>>>>>>
>>>>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>>>>> image and ends up with a blue screen.  Anyway, it shows each box 
>>>>>> separately 
>>>>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>>>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll 
>>>>>> do 
>>>>>> it on my own.
>>>>>>
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>>> characters comes from.
>>>>>>>
>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>>>>>>> 98.864532'>&lt;</span>
>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>>>>>>> 99.018097'>;</span>
>>>>>>>
>>>>>>>
>>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>>> character that exists on x axis.  The first character only has 3 points 
>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>
>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>>
>>>>>>>
>>>>>>> There seem to be some more cases to cause phantom characters.  I'll 
>>>>>>> look them in.  But I have a trivial question now.  I made ScrollView 
>>>>>>> show 
>>>>>>> these displays by accidentally clicking Display->Blamer menu.  There is 
>>>>>>> Bounding Boxes menu below but it ends up showing a blue screen though 
>>>>>>> it 
>>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll be 
>>>>>>> very 
>>>>>>> useful.
>>>>>>>
>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>>
>>>>>>>
>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>>
>>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>>
>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>>
>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>>
>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the 
>>>>>>>>>> pull request 2554.  It shows the candidates for each character but 
>>>>>>>>>> doesn't 
>>>>>>>>>> show bounding box of each character.  I only shows the box for a 
>>>>>>>>>> whole word.
>>>>>>>>>>
>>>>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>>>>> request 2576.  How can I do that?  Do I have to look in the source 
>>>>>>>>>> code and 
>>>>>>>>>> manipulate such an output on my own?
>>>>>>>>>>
>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>>
>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>
>>>>>>>>>>> I haven't been checking psm too much.  Will turn to those 
>>>>>>>>>>> options after I see how it goes with bounding boxes.
>>>>>>>>>>>
>>>>>>>>>>> Shree,
>>>>>>>>>>>
>>>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>>> latest 
>>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>>> complex 
>>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>>
>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We 
>>>>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>>>> characters. 
>>>>>>>>>>>> However the results using psm 10 are worse than passing the whole 
>>>>>>>>>>>> line in. 
>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom 
>>>>>>>>>>>> characters. 
>>>>>>>>>>>>
>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where 
>>>>>>>>>>>> there clearly is one. I can supply a test case if it is expected 
>>>>>>>>>>>> to work 
>>>>>>>>>>>> well. 
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>>>>>>> problem would save a lot of people.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't seem 
>>>>>>>>>>>>> to contain the merges you pointed that have been merged 3 to 4 
>>>>>>>>>>>>> days ago.  
>>>>>>>>>>>>> How can I get them?
>>>>>>>>>>>>>
>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it 
>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not 
>>>>>>>>>>>>>> use it in 
>>>>>>>>>>>>>> some other cases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The proper solution is very likely doing more training with 
>>>>>>>>>>>>>> more data, some data augmentation might probably help if data is 
>>>>>>>>>>>>>> scarce.
>>>>>>>>>>>>>> Also doing less training might help is the training is not 
>>>>>>>>>>>>>> done correctly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for 
>>>>>>>>>>>>>> each "pixel column" does this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this is 
>>>>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) 
>>>>>>>>>>>>>> tries to 
>>>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> are what gives the phantom characters. More training should 
>>>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the 
>>>>>>>>>>>>>> bounding 
>>>>>>>>>>>>>> boxes might fix some cases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm 
>>>>>>>>>>>>>>> option didn't solve my problem though I see different output.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>>>>> results anyway.  How did you get bounding box for each 
>>>>>>>>>>>>>>> character?  Alto and 
>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe 
>>>>>>>>>>>>>>>> this was also improved.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, 
>>>>>>>>>>>>>>>> etc. But it 
>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>>>> reliable with 
>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it 
>>>>>>>>>>>>>>>> is quite ugly 
>>>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using 
>>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for a 0 
>>>>>>>>>>>>>>>>> it will 
>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus 
>>>>>>>>>>>>>>>>> outputting 45 
>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the 
>>>>>>>>>>>>>>>>> bounding box output 
>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added somewhere 
>>>>>>>>>>>>>>>>> that I can 
>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding 
>>>>>>>>>>>>>>>>> box. If anyone 
>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output 
>>>>>>>>>>>>>>>>> doesn’t contain the 
>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing so, 
>>>>>>>>>>>>>>>>>> I'd like to investigate results a little bit.  The hocr and 
>>>>>>>>>>>>>>>>>> lstmbox options 
>>>>>>>>>>>>>>>>>> give some details of positions of characters.  The results 
>>>>>>>>>>>>>>>>>> show positions 
>>>>>>>>>>>>>>>>>> that perfectly correspond to letters in the image.  But the 
>>>>>>>>>>>>>>>>>> text output 
>>>>>>>>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far 
>>>>>>>>>>>>>>>>>> more information.  I hope it explains what happened with 
>>>>>>>>>>>>>>>>>> each character.  
>>>>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if 
>>>>>>>>>>>>>>>>>> someone could 
>>>>>>>>>>>>>>>>>> tell me how to read it because it's really complex.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser. 
>>>>>>>>>>>>>>>>>>>>  Each of the 
>>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that 
>>>>>>>>>>>>>>>>>>>> you'd be able to update it?  It gets tripple D very often 
>>>>>>>>>>>>>>>>>>>> where there's 
>>>>>>>>>>>>>>>>>>>> only one, and so on.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that 
>>>>>>>>>>>>>>>>>>>> I need to change the language-specific.sh.  It specifies 
>>>>>>>>>>>>>>>>>>>> some parameters 
>>>>>>>>>>>>>>>>>>>> for each language.  Do you have any guidance for it?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there 
>>>>>>>>>>>>>>>>>>>>>> but I didn't find any.  I see free fonts and commercial 
>>>>>>>>>>>>>>>>>>>>>> OCR software but 
>>>>>>>>>>>>>>>>>>>>>> not traineddata.  Tessdata repository obviously doesn't 
>>>>>>>>>>>>>>>>>>>>>> have one, either.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata 
>>>>>>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in 
>>>>>>>>>>>>>>>>>>>>>>>> the following format,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in 
>>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols.  This 
>>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines 
>>>>>>>>>>>>>>>>>>>>>>>> of training text.  It's amazing that this thing 
>>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out 
>>>>>>>>>>>>>>>>>>>>>>>> of nowhere.  But then it is not very good.  For 
>>>>>>>>>>>>>>>>>>>>>>>> example:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but 
>>>>>>>>>>>>>>>>>>>>>>>> the last '<' in the result text doesn't exist on the 
>>>>>>>>>>>>>>>>>>>>>>>> image.  It's a small 
>>>>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training 
>>>>>>>>>>>>>>>>>>>>>>>>    text
>>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could 
>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the fine-tuning-from-full 
>>>>>>>>>>>>>>>>>>>>>>>> method.  It seems a 
>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also stops 
>>>>>>>>>>>>>>>>>>>>>>>> at a 'good' level.  I 
>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright 
>>>>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many 
>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it 
>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that 
>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols.  I thought training from 
>>>>>>>>>>>>>>>>>>>>>>>>>>> scratch would eliminate 
>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%,  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rate=0, Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint.  The generated 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> see <a style="font-family: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Arial,Helvetica,sans-serif; font-size: small;" 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> onmousedown="this.href='
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';return
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.google.com/url?q%5Cx3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files%5Cx26sa%5Cx3dD%5Cx26sntz%5Cx3d1%5Cx26usg%5Cx3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';return>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> true;" onclick="this.href='
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';retur
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.google.com/url?q%5Cx3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files%5Cx26sa%5Cx3dD%5Cx26sntz%5Cx3d1%5Cx26usg%5Cx3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';retur>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/01d5a358-e151-40dc-9662-f6d604c334a2%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to