Re: [tesseract-ocr] Trained data for E13B font

Shree Devi Kumar Fri, 09 Aug 2019 08:35:40 -0700

I suggest to rename the traineddata file from eng. to e13b or another
similar descriptive name and also add a link to it in the data file
contributions wiki page.


On Fri, 9 Aug 2019, 20:08 'Mamadou' via tesseract-ocr, <
tesseract-ocr@googlegroups.com> wrote:

>
>
> On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote:
>>
>> I added eng.traineddata and LICENSE.  I used my account name in the
>> license file.  I don't know if it's appropriate or not.  Please tell me if
>> it's not.
>>
> It's ok.
> Thanks. I'll share our dataset (real life samples) in the coming days.
>
>>
>> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>>>
>>>
>>>
>>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>>>
>>>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>>>
>>>> https://github.com/ElMagoElGato/tess_e13b_training
>>>>
>>> Thanks for sharing your experience with us.
>>> Is it possible to share your Tesseract model (xxx.traineddata)?
>>> We're building a dataset using real life images like what we have
>>> already done for MRZ (
>>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
>>> Your model would help us to automated the annotation and will speedup
>>> our devs. Off course we'll have to manualy correct the annotations but it
>>> will be faster for us.
>>> Also, please add a license to your repo so that we know if we have right
>>> to use it
>>>
>>>>
>>>>
>>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.
>>>>> Will be out there soon.
>>>>>
>>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>>
>>>>>>> I'm thinking of sharing it of course.  What is the best way to do
>>>>>>> it?  After all this, the contribution part of mine is only how I 
>>>>>>> prepared
>>>>>>> the training text.  Even that is consist of Shree's text and mine.  The
>>>>>>> instructions and tools I used already exist.
>>>>>>>
>>>>>> If you have a Github account just create a repo and publish the data
>>>>>> and instructions.
>>>>>>
>>>>>>>
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>> Are you planning to release the dataset or models?
>>>>>>>> I'm working on the same subject and planning to share both under
>>>>>>>> BSD terms
>>>>>>>>
>>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy.
>>>>>>>>> As the images of the previous post show, the symbols, especially on-us
>>>>>>>>> symbol and amount symbol, were causing mix-up each other or to another
>>>>>>>>> character.  I added much more more symbols to the training text and 
>>>>>>>>> formed
>>>>>>>>> words that start with a symbol.  One example is as follows:
>>>>>>>>>
>>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng,
>>>>>>>>> 5,000 iteration was almost good.  Amount symbol still is confused a 
>>>>>>>>> little
>>>>>>>>> when it's followed by 0.  Fine tuning tends to be dragged by small
>>>>>>>>> particles.  I'll have to think of something to make further 
>>>>>>>>> improvement.
>>>>>>>>>
>>>>>>>>> Training from scratch produced a bit more stable traineddata.  It
>>>>>>>>> doesn't get confused with symbols so often but tends to generate extra
>>>>>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>>>>>> became
>>>>>>>>> very solid.
>>>>>>>>>
>>>>>>>>> I thought I might have to do image and box file training but I
>>>>>>>>> guess it's not needed this time.
>>>>>>>>>
>>>>>>>>> ElMagoElGato
>>>>>>>>>
>>>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>>>>>
>>>>>>>>>> HI,
>>>>>>>>>>
>>>>>>>>>> Well, I read the description of ScrollView (
>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging)
>>>>>>>>>> and it says:
>>>>>>>>>>
>>>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select
>>>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It basically works.  But for some reason, it doesn't work on my
>>>>>>>>>> e13b image and ends up with a blue screen.  Anyway, it shows each box
>>>>>>>>>> separately when a character is consist of multiple boxes.  I'd like 
>>>>>>>>>> to show
>>>>>>>>>> the box for the whole character.  ScrollView doesn't do it, at 
>>>>>>>>>> least, yet.
>>>>>>>>>> I'll do it on my own.
>>>>>>>>>>
>>>>>>>>>> ElMagoElGato
>>>>>>>>>>
>>>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I got this result from hocr.  This is where one of the phantom
>>>>>>>>>>> characters comes from.
>>>>>>>>>>>
>>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933;
>>>>>>>>>>> x_conf 98.864532'>&lt;</span>
>>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933;
>>>>>>>>>>> x_conf 99.018097'>;</span>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The firs character is the phantom.  It starts with the second
>>>>>>>>>>> character that exists on x axis.  The first character only has 3 
>>>>>>>>>>> points
>>>>>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>>>>>
>>>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image:
>>>>>>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> There seem to be some more cases to cause phantom characters.
>>>>>>>>>>> I'll look them in.  But I have a trivial question now.  I made 
>>>>>>>>>>> ScrollView
>>>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu.  
>>>>>>>>>>> There
>>>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen 
>>>>>>>>>>> though it
>>>>>>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll 
>>>>>>>>>>> be very
>>>>>>>>>>> useful.
>>>>>>>>>>>
>>>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>
>>>>>>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>>>>>>
>>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the
>>>>>>>>>>>>>> pull request 2554.  It shows the candidates for each character 
>>>>>>>>>>>>>> but doesn't
>>>>>>>>>>>>>> show bounding box of each character.  I only shows the box for a 
>>>>>>>>>>>>>> whole word.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see bounding boxes of each character in comments of the
>>>>>>>>>>>>>> pull request 2576.  How can I do that?  Do I have to look in the 
>>>>>>>>>>>>>> source
>>>>>>>>>>>>>> code and manipulate such an output on my own?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't been checking psm too much.  Will turn to those
>>>>>>>>>>>>>>> options after I see how it goes with bounding boxes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see the merges in the git log and also see that new
>>>>>>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>>>>>>> latest
>>>>>>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>>>>>>> complex
>>>>>>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM?
>>>>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>>>>>>>> characters.
>>>>>>>>>>>>>>>> However the results using psm 10 are worse than passing the 
>>>>>>>>>>>>>>>> whole line in.
>>>>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom 
>>>>>>>>>>>>>>>> characters.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character”
>>>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is 
>>>>>>>>>>>>>>>> expected to
>>>>>>>>>>>>>>>> work well.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We both have got the same case.  It seems a solution to
>>>>>>>>>>>>>>>>> this problem would save a lot of people.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't
>>>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 
>>>>>>>>>>>>>>>>> 3 to 4 days
>>>>>>>>>>>>>>>>> ago.  How can I get them?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it
>>>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could 
>>>>>>>>>>>>>>>>>> not use it in
>>>>>>>>>>>>>>>>>> some other cases.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The proper solution is very likely doing more training
>>>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help 
>>>>>>>>>>>>>>>>>> if data is
>>>>>>>>>>>>>>>>>> scarce.
>>>>>>>>>>>>>>>>>> Also doing less training might help is the training is
>>>>>>>>>>>>>>>>>> not done correctly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and
>>>>>>>>>>>>>>>>>> for each "pixel column" does this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this
>>>>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I 
>>>>>>>>>>>>>>>>>> think) tries to
>>>>>>>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> are what gives the phantom characters. More training
>>>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful 
>>>>>>>>>>>>>>>>>> analysis of the
>>>>>>>>>>>>>>>>>> bounding boxes might fix some cases.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the
>>>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different 
>>>>>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same
>>>>>>>>>>>>>>>>>>> results anyway.  How did you get bounding box for each 
>>>>>>>>>>>>>>>>>>> character?  Alto and
>>>>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1
>>>>>>>>>>>>>>>>>>>> maybe this was also improved.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard
>>>>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, 
>>>>>>>>>>>>>>>>>>>> overlapping, etc. But it
>>>>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>>>>>>>> reliable with
>>>>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and 
>>>>>>>>>>>>>>>>>>>> it is quite ugly
>>>>>>>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well
>>>>>>>>>>>>>>>>>>>>> using the OCRB that Shree trained on MRZ lines. For 
>>>>>>>>>>>>>>>>>>>>> example for a 0 it will
>>>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus 
>>>>>>>>>>>>>>>>>>>>> outputting 45
>>>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the 
>>>>>>>>>>>>>>>>>>>>> bounding box output
>>>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added 
>>>>>>>>>>>>>>>>>>>>> somewhere that I can
>>>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding 
>>>>>>>>>>>>>>>>>>>>> box. If anyone
>>>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output 
>>>>>>>>>>>>>>>>>>>>> doesn’t contain the
>>>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing
>>>>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit.  The 
>>>>>>>>>>>>>>>>>>>>>> hocr and lstmbox
>>>>>>>>>>>>>>>>>>>>>> options give some details of positions of characters.  
>>>>>>>>>>>>>>>>>>>>>> The results show
>>>>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the 
>>>>>>>>>>>>>>>>>>>>>> image.  But the text
>>>>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not 
>>>>>>>>>>>>>>>>>>>>>> exist.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates
>>>>>>>>>>>>>>>>>>>>>> far more information.  I hope it explains what happened 
>>>>>>>>>>>>>>>>>>>>>> with each
>>>>>>>>>>>>>>>>>>>>>> character.  I'm yet to read the debug output but I'd 
>>>>>>>>>>>>>>>>>>>>>> appreciate it if
>>>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really 
>>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for
>>>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two
>>>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the 
>>>>>>>>>>>>>>>>>>>>>>>> browser.  Each of the
>>>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are 
>>>>>>>>>>>>>>>>>>>>>>>> getting symbols fairly good
>>>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers 
>>>>>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest
>>>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it?  It gets tripple D 
>>>>>>>>>>>>>>>>>>>>>>>> very often where
>>>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found
>>>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh.  It 
>>>>>>>>>>>>>>>>>>>>>>>> specifies some
>>>>>>>>>>>>>>>>>>>>>>>> parameters for each language.  Do you have any 
>>>>>>>>>>>>>>>>>>>>>>>> guidance for it?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx
>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there
>>>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any.  I see free fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but
>>>>>>>>>>>>>>>>>>>>>>>>>> not traineddata.  Tessdata repository obviously 
>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following
>>>>>>>>>>>>>>>>>>>>>>>>>>>> format,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in which symbols are converted to E13B symbols.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> This makes about 12,000
>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines of training text.  It's amazing that this 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thing generates a good
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reader out of nowhere.  But then it is not very 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close
>>>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the image.  It's a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> parsing.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase
>>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    training text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could
>>>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method.  It seems a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level.  I
>>>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> bright future.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> See
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols.  I thought training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> impact, with your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%,  New
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWo3%3DyZ4LOy9cRiDk-VWVWWaDA35-t6T94GdHEgY3RAHw%40mail.gmail.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to