Re: [tesseract-ocr] Editing Box files

2019-04-28 Thread Shree Devi Kumar
I assumed that you used text2image to generate the box/tiff pairs using a font for your `language`. On Mon, Apr 29, 2019 at 12:14 PM Shree Devi Kumar wrote: > It means that the font you are using has mapped English letters to these > symbols. If you view the box file in that same fo

Re: [tesseract-ocr] Editing Box files

2019-04-29 Thread Shree Devi Kumar
Tesseract generates unicode output after recognizing. Are there any unicode points for symbols that you have used? How do you type out those symbols? On Mon, 29 Apr 2019, 13:42 anne, wrote: > I used this line > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] > batch.noch

Re: [tesseract-ocr] Simple image FAIL fails

2019-04-29 Thread Shree Devi Kumar
ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata_fast PASS wee ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata_best PASS AYE ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-d

Re: [tesseract-ocr] Editing Box files

2019-04-30 Thread Shree Devi Kumar
I found couple of unicode fonts that can display the tagalog range - "Quivira" \ "Noto Sans Tagalog" \ Using these it will be possible to train for for Baybayin . Does the language use any punctuation and numbers? On Tue, Apr 30, 2019 at 11:39 AM anne wrote: > I found the unicode for Baybayi

Re: [tesseract-ocr] Editing Box files

2019-04-30 Thread Shree Devi Kumar
checkout https://github.com/Shreeshrii/tessdata_tagalog/tree/master/tglglegacy for box/tiff pairs On Tue, Apr 30, 2019 at 5:42 PM Shree Devi Kumar wrote: > I found couple of unicode fonts that can display the tagalog range - > "Quivira" \ > "Noto Sans Tagalog"

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Shree Devi Kumar
>There are three model sizes: best, normal and fast. Each of these can also be converted to an integer model. Only `best` can be converted to integer and in fact the LSTM models in `tessdata` are the integer versions of best along with the base/legacy models. `fast` models have been trained with

Re: [tesseract-ocr] After fine tunning training, how do i run on the new model?

2019-05-04 Thread Shree Devi Kumar
Depends on where you keep the new traineddata file. If you copy it to the path specified by your TESSDATA_PREFIX you can use it with `-l ds_10k`. If it is in a different location, you will need to specify that, `-l ds_10k --tessdata-dir /prefix/path/to/your/file/`. On Sat, May 4, 2019 at 12:26

Re: [tesseract-ocr] After fine tunning training, how do i run on the new model?

2019-05-04 Thread Shree Devi Kumar
t make sense. will it work if I put this new model inside the folder of > Tessdata that is situated in the program files folder? > > On Sat, May 4, 2019 at 2:44 PM Shree Devi Kumar > wrote: > >> Depends on where you keep the new traineddata file. >> >> If yo

Re: [tesseract-ocr] How to increase tesseract model accuracy

2019-05-05 Thread Shree Devi Kumar
S) for about 17 times in* *eng.training_text (attached)* > > On Sunday, May 5, 2019 at 3:17:55 PM UTC+2, shree wrote: >> >> Share an image for testing. >> >> How did you try to finetune? >> >> >> On Sunday, May 5, 2019 at 5:40:39 PM UTC+5:30, fady taher

Re: [tesseract-ocr] How to increase tesseract model accuracy

2019-05-05 Thread Shree Devi Kumar
Try with max-iterations 400 On Sun, May 5, 2019 at 7:33 PM fady taher wrote: > *I used option --fontlist "Calibri" and --max_iterations 3600* > > > On Sunday, May 5, 2019 at 4:02:05 PM UTC+2, shree wrote: >> >> Which font did you use? Hopefully it w

Re: [tesseract-ocr] How to increase tesseract model accuracy

2019-05-05 Thread Shree Devi Kumar
nday, May 5, 2019 at 4:02:05 PM UTC+2, shree wrote: >> >> Which font did you use? Hopefully it was similar to your image. How many >> iterations? >> >> On Sun, May 5, 2019 at 6:58 PM fady taher wrote: >> >>> *I followed the instructions* >>> ht

Re: [tesseract-ocr] Multiple jpg files into 1 editable pdf

2019-05-08 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-process-multiple-images-in-a-single-run On Thu, May 9, 2019 at 9:34 AM Brian Lallo wrote: > Thanks for reading. > > I will be taking 200 pages and scanning them into a folder on my computer. > > I need to take those 200 pages (that

Re: [tesseract-ocr] Tesseract Multipage tiff to multipage pdf

2019-05-15 Thread Shree Devi Kumar
tesseract In\SPTest.tif Out\Test --psm 3 -l rus+eng pdf This should be enough to create a multi page pdf from a multi page tiff. On Wed, May 15, 2019 at 7:27 PM András Jeszenkovits wrote: > Here: tesseract In\SPTest.tif Out\Test --psm 3 -l rus+eng *-c > tessedit_page_number=-1* pdf > > 2019. m

Re: [tesseract-ocr] Tesseract Multipage tiff to multipage pdf

2019-05-16 Thread Shree Devi Kumar
What is your version of tesseract? Which O/S? Have you tried it with just one language? On Thu, May 16, 2019 at 1:32 PM András Jeszenkovits wrote: > I thought that too, but the Tesseract create a one page pdf > > 2019. május 15., szerda 17:29:36 UTC+2 időpontban shree a követk

Re: [tesseract-ocr] Tesseract Multipage tiff to multipage pdf

2019-05-16 Thread Shree Devi Kumar
ied > english, russian, hungarian. I tried 32bit/64bit version, i tried a jpg > file too, same result (1 page pdf) > > 2019. május 16., csütörtök 10:31:33 UTC+2 időpontban shree a következőt > írta: >> >> What is your version of tesseract? Which O/S? >> >>

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2019-05-18 Thread Shree Devi Kumar
No, I have not done handwriting training. Others who have tried can share if they had success. On Sat, 18 May 2019, 22:59 vikram sareen, wrote: > hi shree, > did you manage to crack this... > we are also trying to get handwritten working for english but no luck. > truly appreciate y

Re: [tesseract-ocr] unicharset_extractor error

2019-05-20 Thread Shree Devi Kumar
You need to make sure that you build/install tesseract as well as training_tools, otherwise they may get out of sync. How are you reinstalling it? On Mon, May 20, 2019 at 4:50 PM anne wrote: > Hi, I got this error while running the unicharset_extractor command > "ERROR: shared library version m

Re: [tesseract-ocr] unicharset_extractor error

2019-05-20 Thread Shree Devi Kumar
What is your Ubuntu version? Are you using the ppa for installing? On Mon, 20 May 2019, 18:02 anne, wrote: > i did "sudo apt-get install tesseract-ocr" > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group an

Re: [tesseract-ocr] unicharset_extractor error

2019-05-20 Thread Shree Devi Kumar
Also check if you have two versions of program installed? On Mon, 20 May 2019, 19:14 Shree Devi Kumar, wrote: > What is your Ubuntu version? > > Are you using the ppa for installing? > > On Mon, 20 May 2019, 18:02 anne, wrote: > >> i did "sudo apt-get install t

Re: [tesseract-ocr] Facing some problem in understanding fine tuning

2019-05-22 Thread Shree Devi Kumar
> I want to add only one character in the already existing ben.traindata model. What character do you want to add? You should be able to do the same process as the plus-minus training for one character as shown in example for English. On Wed, May 22, 2019 at 1:51 PM Jennil Thiyam wrote: > I am

Re: [tesseract-ocr] Facing some problem in understanding fine tuning

2019-05-22 Thread Shree Devi Kumar
r in the ben_training.txt >> like they did in plus-minus training >> >> On Wed, May 22, 2019 at 5:24 PM Shree Devi Kumar >> wrote: >> >>> > I want to add only one character in the already existing >>> ben.traindata model. >>> >>>

Re: [tesseract-ocr] unicharset_extractor error

2019-05-23 Thread Shree Devi Kumar
what's the output of which tesseract which text2image which unicharset_extractor tesseract -v text2image -v unicharset_extractor -v On Thu, May 23, 2019 at 3:45 PM anne wrote: > I'm using ubuntu version 18.04 > and if I check for tesseract's version this is what I get > tesseract 4.1.0-rc1-170

Re: [tesseract-ocr] unicharset_extractor error

2019-05-23 Thread Shree Devi Kumar
sudo apt-get purge --auto-remove tesseract-ocr also, in your tesseract directory make clean make uninstall Then reinstall again by sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt-get update sudo apt-get install -y \ libleptonica-dev \ libtesseract4 \ libtesseract-dev \ tesse

Re: [tesseract-ocr] unicharset_extractor error

2019-05-23 Thread Shree Devi Kumar
You would have had tesseract directory in case you had built it from source. >I still have the same error are you still getting same version numbers? On Thu, May 23, 2019 at 6:50 PM anne wrote: > I'm sorry if this may sound dumb but uh, where exactly is the tesseract > directory located? > > -

Re: [tesseract-ocr] unicharset_extractor error

2019-05-23 Thread Shree Devi Kumar
sudo apt-get purge --auto-remove libtesseract4 sudo apt-get purge --auto-removelibtesseract-dev sudo apt-get purge --auto-removetesseract-ocr Share the console output log from the above On Thu, May 23, 2019 at 9:20 PM Shree Devi Kumar wrote: > You would have had tesseract direct

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Shree Devi Kumar
Please run following commands again, just to check what's the output of which tesseract which text2image which unicharset_extractor tesseract -v text2image -v unicharset_extractor -v On Fri, May 24, 2019 at 1:52 PM anne wrote: > *sudo apt-get purge --autore-remove libtesseract4* > > Reading p

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Shree Devi Kumar
After that, run the following and post the console output sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt-get update sudo apt-get install -y \ libleptonica-dev \ libtesseract4 \ libtesseract-dev \ tesseract-ocr On Fri, May 24, 2019 at 2:37 PM Shree Devi Kumar wrote

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Shree Devi Kumar
r-osd (1:4.00~git30-7274cfa-1ppa1~bionic1) ... Processing triggers for libc-bin (2.27-3ubuntu1) ... On Fri, May 24, 2019 at 2:38 PM Shree Devi Kumar wrote: > After that, run the following and post the console output > > sudo add-apt-repository ppa:alex-p/tesseract-ocr > sudo apt-get

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Shree Devi Kumar
*which tesseract: */snap/bin/tesseract *which text2image: *no output *which unicharset_extractor:* no output This shows that you have a version of tesseract installed in /snap/bin/tesseract This needs to be removed. On Fri, May 24, 2019 at 3:06 PM anne wrote: > To be honest, I am very confused

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Shree Devi Kumar
I am at the limit of my Linux knowledge now :-( Someone else will need to help you fix the library mismatch. @zdenop @amitdo @stweil .. On Fri, May 24, 2019 at 3:35 PM anne wrote: > I removed it, checked *which* commands to which no outputs are shown. > Checked versions of tesseract, tex

Re: [tesseract-ocr] tesseract dont read text from image

2019-05-24 Thread Shree Devi Kumar
tesseract Player5.png - TO10EH54 On Fri, May 24, 2019 at 7:14 PM Тимур Михайлов wrote: > hello guys i use last version tesseract on c# > i cant read text from this image, i try some settings > and PageSegmentationMode but result is ""(empty) > I think it is necessary to configure that would rea

Re: [tesseract-ocr] Re: tesseract dont read text from image

2019-05-25 Thread Shree Devi Kumar
I used tesseract from command line with traineddata from tessdata_best. On Sat, May 25, 2019 at 5:07 PM Тимур Михайлов wrote: > what are you do for this result?) > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Shree Devi Kumar
Has /ben_extract/ben.lstm been extracted from /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata ? On Mon, May 27, 2019 at 2:55 PM Jennil Thiyam wrote: > I got error whie trying to perform fine tuning, the command i used is > below: > > lstmtraining --model_output /model \ > --continue_fr

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Shree Devi Kumar
Is /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata from tessdata_best repo? Only those models can be used for finetuning. On Mon, May 27, 2019 at 4:25 PM Jennil Thiyam wrote: > yes...i extracted with the command combine_tessdata > > On Mon 27 May, 2019, 4:23 PM Shree Devi Kuma

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Shree Devi Kumar
om git repository,will this thing work?? > > On Mon 27 May, 2019, 5:43 PM Shree Devi Kumar >> Is /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata from >> tessdata_best repo? Only those models can be used for finetuning. >> >> On Mon, May 27, 2019 at 4:25 PM Jen

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Shree Devi Kumar
have any idea about the estimated time it will take for 1500 > iterations? > > Thank you > > On Mon, May 27, 2019 at 10:20 PM Shree Devi Kumar > wrote: > >> You can download ben.traineddata from tessdata_best in a different >> location and use that as part of ls

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-28 Thread Shree Devi Kumar
l/araeval > > can anyone tell me why do we need to create this eval data, i meant it is > also going to same as training data. > > > On Tue, May 28, 2019 at 10:46 AM Jennil Thiyam > wrote: > >> okay, thank you >> >> On Tue, May 28, 2019 at 10:30 AM Shree

Re: [tesseract-ocr] Trained data for E13B font

2019-05-29 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files On Wed, May 29, 2019 at 3:18 PM ElGato ElMago wrote: > Hi, > > I wish to make a trained data for E13B font. > > I read the training tutorial and made a base_checkpoint file according to > the me

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-29 Thread Shree Devi Kumar
Check that the training text you used is normalized correctly, also check the Bengali normalization/validation rules https://github.com/tesseract-ocr/tesseract/issues/1038 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this

Re: [tesseract-ocr] Trained data for E13B font

2019-05-29 Thread Shree Devi Kumar
For training from scratch a large training text and hundreds of thousands of iterations are recommended. If you are just fine tuning for a font try to follow instructions for training for impact, with your font. On Thu, 30 May 2019, 06:05 ElGato ElMago, wrote: > Thanks, Shree. > > Y

Re: [tesseract-ocr] Trained data for E13B font

2019-05-29 Thread Shree Devi Kumar
such confusion. > > 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >> >> For training from scratch a large training text and hundreds of thousands >> of iterations are recommended. >> >> If you are just fine tuning for a font try to follow instructions for >> training

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 Illegal instruction (core dumped)

2019-05-30 Thread Shree Devi Kumar
~/tesstitorial/train_wa/ben/ben.traineddata \ > > --old_traineddata tessdata/best/ben.traineddata \ > > --train_listfile ~/tesstutorial/train_wa/ben.training_files.txt \ > > --max_iterations 3600 > > > As you, shree Devi suggested, i download ben.traindata from the source an

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 Illegal instruction (core dumped)

2019-05-30 Thread Shree Devi Kumar
You have to convert the checkpoint to traineddata - run lstmtraining with --stop_training flag On Thu, May 30, 2019 at 3:44 PM Jennil Thiyam wrote: > thanks shree, it was my silly mistake, but > > lstmtraining --model_output ~/tesstutorial/train_wa/wa > --continue_from ~/tesstutor

Re: [tesseract-ocr] MRZ/MRP (Machine-readable zone/passport) dataset for tesseract v4

2019-05-30 Thread Shree Devi Kumar
Thanks. Added links in https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions On Mon, May 27, 2019 at 11:08 AM Mamadou wrote: > Hello, > > We have open sourced (BSD license) MRZ/MRP (Machine-readable > zone/passport) dataset and models for Tesseract v4. > The dataset contains

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
What is the new character you want to add? On Fri, May 31, 2019 at 3:22 PM Jennil Thiyam wrote: > I have followed the procedure (that is described in training tesseract 4 > for fine tuning for putting plus-minus sign in eng.traineddata) to train > ben.traineddata (by adding one character which

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
Is your new character included in https://github.com/tesseract-ocr/langdata_lstm/blob/master/ben/ben.unicharset On Fri, May 31, 2019 at 3:22 PM Jennil Thiyam wrote: > I have followed the procedure (that is described in training tesseract 4 > for fine tuning for putting plus-minus sign in eng.t

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
or in https://github.com/tesseract-ocr/langdata_lstm/blob/master/asm/asm.unicharset On Fri, May 31, 2019 at 3:45 PM Shree Devi Kumar wrote: > Is your new character included in > > > https://github.com/tesseract-ocr/langdata_lstm/blob/master/ben/ben.unicharset > > > On Fri,

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
after adding this new character in ben.training_text) > The character is in line no.35(in wa.png) and 79(in wa_11.png) > > Please help me out > > On Fri, May 31, 2019 at 3:47 PM Shree Devi Kumar > wrote: > >> or in >> >> https://github.com/tesseract-ocr/langdat

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
script/Bengali.traineddata is another option On Fri, 31 May 2019, 16:58 Shree Devi Kumar, wrote: > Please try the asm.traineddata which is for Assamese which is written in > Bengali script. > > On Fri, 31 May 2019, 16:55 Jennil Thiyam, wrote: > >> How come this character

Re: [tesseract-ocr] Tesseract

2019-05-31 Thread Shree Devi Kumar
For handwriting you can look at transkribus also On Fri, 31 May 2019, 17:20 Naga Raju Kusa, wrote: > Hii amulya, > I have tried it with the pytesseract library but it's working only for > text detection on a pdf and it's not working for handwritten scripts.. > Can you refer me any open source ap

Re: [tesseract-ocr] How to use trained data

2019-05-31 Thread Shree Devi Kumar
It depends on the o/s you are using. Look at the tesseract wiki home page. On Fri, 31 May 2019, 19:06 Mirror, wrote: > I have a bunch of files of trained data (gotten from tessdata_best > ), can someone provide me > a brief help in how to use them?

Re: [tesseract-ocr] have the width, height, of each character of an image pdf file

2019-05-31 Thread Shree Devi Kumar
I think the hocr output has an option to output bounding info per character also. On Fri, 31 May 2019, 19:07 G. S., wrote: > Dear all, > > i have a pdf image file, (in Greek language) > > i would appreciate if you could help me on how i could > > a) have an output similar to what pdf alto does,

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
stead. Regarding normalization you should look at the text to make sure that it is ok. I don't know the script but my guess is that the vowel maatraa that go on both sides of consonants may have been encoded as separate rather than one. On Fri, 31 May 2019, 22:40 Jennil Thiyam, wrote: &g

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdata_best/tree/master/script On Fri, 31 May 2019, 23:01 Jennil Thiyam, wrote: > What is this script/bengali traineddata??? > Is it not the ben,traineddata? > > On Fri, May 31, 2019 at 10:55 PM Shree Devi Kumar > wrote: > >> Di

Re: [tesseract-ocr] have the width, height, of each character of an image pdf file

2019-06-01 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/commit/06b7a7b188b2ed21a101cd179b4dd3cfc13aaf30 On Fri, May 31, 2019 at 9:00 PM Shree Devi Kumar wrote: > I think the hocr output has an option to output bounding info per > character also. > > On Fri, 31 May 2019, 19:07 G. S., wrot

Re: [tesseract-ocr] Japanese - Problems with vertical words

2019-06-03 Thread Shree Devi Kumar
tesseract 4 has been trained on line images and hence gives better results for lines, as far as I have seen. On Sun, Jun 2, 2019 at 2:52 PM Jorge Castrillo wrote: > Hi everyone. I'm making a program on that uses tesseract to get a word > from a manga with a snipping-tool like program, and transl

Re: [tesseract-ocr] ben.traineddata & Bengali.traineddata

2019-06-04 Thread Shree Devi Kumar
Ben trained on bengali, Bengali with ben, asm and English. https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Bengali.langs.txt On Tue, 4 Jun 2019, 17:11 Jennil Thiyam, wrote: > What is the difference between ben.traineddata and Bengali.traineddata, > some character are not reco

Re: [tesseract-ocr] ben.traineddata & Bengali.traineddata

2019-06-04 Thread Shree Devi Kumar
, wrote: > Shree what is the segmentation algorithm used in this bengali ocr, i think > the segmentation algorithm for english characters and bengali character has > to be different. Is it the BB Chaudhury's segmentation algorithm used? > > On Tue, Jun 4, 2019 at 5:41 PM Shree

Re: [tesseract-ocr] ben.traineddata & Bengali.traineddata

2019-06-04 Thread Shree Devi Kumar
at Github (sorry for typo in earlier msg.. autocorrect :-( ) On Wed, 5 Jun 2019, 12:05 Shree Devi Kumar, wrote: > You can extract the files from traineddata with combine_tessdata -u > > Look at the ben.config file for any special layout config in it. > > The LSTM training was do

Re: [tesseract-ocr] error when make training

2019-06-05 Thread Shree Devi Kumar
If training tools are made correctly, you should have all those programs. AT least that's how it is on Linux and Windows. On Wed, Jun 5, 2019 at 6:40 PM Jingjing Lin wrote: > Actually I found all the following are not there. Am I missing something? > text2image > unicharset_extractor > set_unich

Re: [tesseract-ocr] error when make training

2019-06-05 Thread Shree Devi Kumar
You are probably missing the last step sudo make training-install Usual Build and Install instructions git clone https://github.com/tesseract-ocr/tesseract/ cd tesseract ./autogen.sh ./configure make sudo make install sudo ldconfig make training sudo make training-install On Wed, Jun 5, 2019

Re: [tesseract-ocr] Scripts are almost same but different language

2019-06-06 Thread Shree Devi Kumar
You need to fine-tune using your language data. These models have been trained on 5 + lines of text. You need to create normalized text for your language, then use good unicode fonts to render them. You can replace top layer in script/Bengali. I did something similar for sanskrit using script/

Re: [tesseract-ocr] Scripts are almost same but different language

2019-06-06 Thread Shree Devi Kumar
Clarifying above msg 10-20 samples for each syllable Total number of lines was in thousands On Thu, 6 Jun 2019, 16:05 Shree Devi Kumar, wrote: > You need to fine-tune using your language data. These models have been > trained on 5 + lines of text. You need to create normalized te

Re: [tesseract-ocr] Trained data for E13B font

2019-06-07 Thread Shree Devi Kumar
- Or else? > > Also, I referred to engrestrict*.* and could generate similar result with > the fine-tuning-from-full method. It seems a bit faster to get to the same > level but it also stops at a 'good' level. I can go with either way if it > takes me to the bright future. &g

Re: [tesseract-ocr] Changes in Tesseract 4.0 to 4.1 causing loss in precision

2019-06-11 Thread Shree Devi Kumar
Using the latest code from master branch, --oem 1 and --psm 6 I get the following results using the different traineddata files: tessdata -- | =~ 7.2. | BK Medical 1.01<1.50 TIS: 1.2<2.0 _ Res /Hz 1/7 Hz ~ aw > General nz - ¥ Povier > Gan 52 %, PRF 0.5 kHz| “oo, TENN 0.7. { 7.2. oe ¥ [Heas

Re: [tesseract-ocr] Trained data for E13B font

2019-06-12 Thread Shree Devi Kumar
You will get output of A B C D for the MICR symbols. If it works well otherwise, I will update it to generate the Unicode text for the symbols. Trained using font "MICR Encoding" On Wed, Jun 12, 2019 at 9:53 PM Shree Devi Kumar wrote: > Please test the attached file. It is trai

Re: [tesseract-ocr] Re: could not find fonts

2019-06-13 Thread Shree Devi Kumar
FYI Font list used for LSTM training is at https://github.com/tesseract-ocr/langdata_lstm/blob/master/chi_sim/okfonts.txt ttps://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh

Re: [tesseract-ocr] Trained data for E13B font

2019-06-13 Thread Shree Devi Kumar
I didn't find any. I > see free fonts and commercial OCR software but not traineddata. Tessdata > repository obviously doesn't have one, either. > > 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >> >> Please also search for existing MICR traineddata files. >> >> On

Re: [tesseract-ocr] How to create training data from checkpoint?

2019-06-14 Thread Shree Devi Kumar
use --stop-training flag - see example below ~/tesseract/bin/src/training/lstmtraining \ --stop_training \ --continue_from ~/tesstutorial/tagalog/layer_checkpoint \ --traineddata ~/tesstutorial/tglgtrain/eng/eng.traineddata \ --model_output ~/tesstutorial/tagalog/tglg.traineddata On Fri, Jun 1

Re: [tesseract-ocr] Trained data for E13B font

2019-06-14 Thread Shree Devi Kumar
:26 PM ElGato ElMago wrote: > Thanks a lot, shree. It seems you know everything. > > I tried the MICR0.traineddata and the first two mcr.traineddata. The last > one was blocked by the browser. Each of the traineddata had mixed > results. All of them are getting symbols fairly g

Re: [tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

2019-06-17 Thread Shree Devi Kumar
I don't think you need training to improve results. You need to pre-process the image, straighten it. Use a separate tool to identify each cell of data and then OCR that. You will get best results like that. On Mon, Jun 17, 2019 at 6:07 PM phucp...@gmail.com wrote: > Thanks shree

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
s/chi_sim/chi_sim.traineddata \ > --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt > 2>&1 | > grep ± > > to check and ± only shows up in Truth but not in OCR > > > 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道: >> >> combine_tessdata -u

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
± is being picked for training'? When I set > --debug_interval -1 I found in every iteration it only outputs one line, > does that mean in every iteration only one line is being used for training?? > > 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: >> >> How big was your tra

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
Jingjing Lin wrote: > I was only using two different fonts and It only achieved lowest error > rate of 11.271 after the training, does this mean I really need to increase > the iterations? > > 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: >> >> How big was your training text? H

Re: [tesseract-ocr] Custom Tiff/Box pairs support in tesstrain.sh

2019-06-18 Thread Shree Devi Kumar
It should work if your files follow similar naming convention. lang.xxxnnn.exp0.tif lang.xxxnnn.exp0.box Where lang is your language code eg. eng xxxnnn is any unique random string (fontname in files generated by text2image) On Tue, Jun 18, 2019 at 2:54 PM hrishikesh kaulwar wrote: > Greeti

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
ed up by the BEST OCR TEXT at all, it always recognizes ± as something > else. What is happening here? Should I increase the number of ±? Or do I > need to increase the number of fonts? I'm trying increasing iterations. > > 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道: >> >>

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread Shree Devi Kumar
Check ~/tesstutorial/trainplusminus Did your earlier training complete correctly? Does ~/tesstutorial/trainplusminus/eng/eng.traineddata exist? On Tue, Jun 18, 2019 at 8:11 PM fady taher wrote: > Am trying to fine tune tesseract > > but I keep getting the error *mgr_.Init(traineddata_path.c_str(

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread Shree Devi Kumar
n 18, 2019 at 8:46 PM fady taher wrote: > Nop, this file doesn't exist yet > only contains > > *eng.charset_size=110.txt* > *eng.unicharset* > > > On Tue, Jun 18, 2019 at 4:46 PM Shree Devi Kumar > wrote: > >> Check ~/tesstutorial/trainplusminus &g

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread Shree Devi Kumar
read data from: > /home/sw/repo/langdata/eng/eng.configNull char=2Reducing Trie to > SquishedDawgError during conversion of wordlists to DAWGs!!* > > On Tue, Jun 18, 2019 at 5:18 PM Shree Devi Kumar > wrote: > >> That means >> >> src/training/tess

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
r adding a few characters to eng. With > such high error rate, I would not be surprised that it could't recognize > some special characters like ±. Is this it for chi_sim? Or can I increase > iterations to make the error rate smaller? > Thanks for your help. > > 在 2019年6月18日星期二 UTC

Re: [tesseract-ocr] Custom Tiff/Box pairs support in tesstrain.sh

2019-06-19 Thread Shree Devi Kumar
gt; Image is attached above. and two files generated are also attached. > On Tuesday, June 18, 2019 at 3:08:19 PM UTC+5:30, shree wrote: >> >> It should work if your files follow similar naming convention. >> >> lang.xxxnnn.exp0.tif >> lang.xxxnnn.exp0.box >&

Re: [tesseract-ocr] Custom Tiff/Box pairs support in tesstrain.sh

2019-06-19 Thread Shree Devi Kumar
>Also one more doubt is when I use lstm.train command a text file also gets generated with lstmf file You can ignore that txt file. Only lstmf is used for further processing. On Wed, Jun 19, 2019 at 2:44 PM hrishikesh kaulwar wrote: > Hello shree, > I tried again with .tif and l

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Shree Devi Kumar
Can you please test on arrows (↑ > <https://en.wikipedia.org/wiki/%E2%86%91_(disambiguation)> or ↓ > <https://en.wikipedia.org/wiki/%E2%86%93_(disambiguation)>) instead of ± > if it's not inconvenient for you? > > 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: >> >

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Shree Devi Kumar
Old thread https://groups.google.com/forum/#!searchin/tesseract-ocr/layer$20chi_sim%7Csort:date/tesseract-ocr/iFMg7Gjczq4/f7_XRop2BAAJ On Wed, Jun 19, 2019 at 9:13 PM Shree Devi Kumar wrote: > Update: > > 1. When using a smaller training_text for chi_sim for plus training, the >

Re: [tesseract-ocr] Re: Custom Tiff/Box pairs support in tesstrain.sh

2019-06-19 Thread Shree Devi Kumar
See tesstrain_utils.sh On Thu, 20 Jun 2019, 10:55 hrishikesh kaulwar, wrote: > > Hey shree could you tell me what line in tesstrain.sh takes care of user > provided tiff box pairs. Like what is the line which creates lstmf files > from those pairs and then puts the name of ls

Re: [tesseract-ocr] Re: Custom Tiff/Box pairs support in tesstrain.sh

2019-06-20 Thread Shree Devi Kumar
utput directory On Thu, Jun 20, 2019 at 10:55 AM hrishikesh kaulwar wrote: > > Hey shree could you tell me what line in tesstrain.sh takes care of user > provided tiff box pairs. Like what is the line which creates lstmf files > from those pairs and then puts the name of lstmf files in t

Re: [tesseract-ocr] Re: What do we inherit from tessdata_best when doing fine tuning?

2019-06-20 Thread Shree Devi Kumar
There are different types of finetuning. This is my understanding: When you finetune for impact, new font, the unicharset and dawgs remain the same. lstm is modified for the font. Iterations have to be between 300-400 only. With plus-minus training for adding a character, the lstm (language mod

Re: [tesseract-ocr] Re: FontAwesome and Tesseract

2019-06-20 Thread Shree Devi Kumar
See https://github.com/Shreeshrii/tessdata_emoji Font Awesome uses PUA Unicode range for the icons. So it did not work with text2image. I used other emoji fonts. The script and training data used are also in the repo. On Tue, Jun 18, 2019 at 12:04 AM Jason wrote: > Can I "bump" this? > > Even

Re: [tesseract-ocr] Re: Divergence in Trained data

2019-06-21 Thread Shree Devi Kumar
LSTM training can take days depending on amount of training data. Be patient and wait. On Fri, Jun 21, 2019 at 3:37 PM Pooja Kamra wrote: > Please help. > > On Thursday, June 20, 2019 at 5:19:19 PM UTC+5:30, Pooja Kamra wrote: >> >> Hi, >> For training i have provided target_error_rate 4. >> But

Re: [tesseract-ocr] Re: lstmtraining generates only checkpoint file, how can i get traineddata?

2019-06-21 Thread Shree Devi Kumar
run from main tesseract directory. The directory structure has been changed. it will be src/training/lstmtraining if you built like that. If you installed it, it should be accessible as `lstmtraing`. On Fri, Jun 21, 2019 at 8:36 PM sai sumanth Kalluri < saisumanthkall...@gmail.com> wrote: > Ma

Re: [tesseract-ocr] Re training with Tesseract 3.x

2019-06-21 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/langdata https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05 On Sat, Jun 22, 2019 at 2:18 AM Sarasi Lalithsena wrote: > Hi, > > Is t

Re: [tesseract-ocr] problem creating box file for tesseract 4 lstm training

2019-06-23 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh https://github.com/tesseract-ocr/tesseract/issues/2357 On Sun, Jun 23, 2019 at 3:28 PM madhav barthwal wrote: > hello all, > I am currently unable to generate the box file for the respective text > fil

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-06-27 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc When using checkpoint you need to also use the starter traineddata file used for training. Or give final traineddata file as model. So, if after training u have converted the checkpoint to a traineddata, you can use th

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-06-27 Thread Shree Devi Kumar
/eng.training_files.txt On Thu, 27 Jun 2019, 22:47 Shree Devi Kumar, wrote: > See > https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc > > When using checkpoint you need to also use the starter traineddata file > used for training. > > Or give final trai

Re: [tesseract-ocr] Tesseract fails for super clear text string

2019-06-28 Thread Shree Devi Kumar
I reduced size by half, added a white border and changed to 300 dpi. Works perfectly now. ubuntu@tesseract-ocr:~/TEST$ tesseract date_final4.tiff - --dpi 300 Page 1 Empty page!! Empty page!! ubuntu@tesseract-ocr:~/TEST$ tesseract date_final4.tif - --dpi 300 Page 1 20:08:00 Modified image attached

Re: [tesseract-ocr] Tesseract fails for super clear text string

2019-06-28 Thread Shree Devi Kumar
Interactively changed in irfanview. On Fri, 28 Jun 2019, 18:20 JH Sundberg, wrote: > Thank you so much - did you do this in imagemagick and if so, do you have > the code for the image manipulation? > Jonas > > > On Friday, 28 June 2019 14:29:14 UTC+2, shree wrote: >> &

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-06-28 Thread Shree Devi Kumar
ining would be to try a > lstmeval on each checkpoint, but I think there must be a better way ? > Otherwise the *--eval_listfile *argument would be useless in > lstmtraining, but I can't find out how it is used. > > Thank you :) > > Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-06-28 Thread Shree Devi Kumar
=3.8059549, Word error rate=12.294499 wrote checkpoint. On Fri, Jun 28, 2019 at 9:09 PM Shree Devi Kumar wrote: > Your best source for documentation is the source code. See > > > https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrai

Re: [tesseract-ocr] Invalid resolution 0, using 70dpi instead

2019-06-28 Thread Shree Devi Kumar
See discussion at https://github.com/tesseract-ocr/tesseract/issues/756#issuecomment-285786671 On Sat, Jun 29, 2019 at 1:51 AM Mox Betex wrote: > When I run tesseract on file F26_line-004.png I get message Invalid > resolution 0, using 70 dpi instead. > > Can someone explain me why? > > -- > You

Re: [tesseract-ocr] Choice Iterator only shows one choice for each character

2019-07-01 Thread Shree Devi Kumar
Take a look at https://github.com/tesseract-ocr/tesseract/blob/ab09b09da66f458002f01d0bc4ffeee8eff58f6e/src/ccmain/tesseractclass.cpp#L524 On Mon, Jul 1, 2019 at 2:45 PM Jochen Naumann wrote: > Hi, I am using the official api example for iterating over the choices for > characters and getting th

<    1   2   3   4   5   6   7   8   9   10   >