Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

2018-09-04 Thread Shree Devi Kumar
fine tune the dan.traineddata for my specific use case. > However, the model is over fitting on my data and seems to be forgetting > the original data it was trained on. I remember I have read somewhere that > this can be solved by showing the original training data to the network so >

Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

2018-09-04 Thread Shree Devi Kumar
My earlier suggestion of mixing the two kinds of images - scanned pages and text2image created synthetic ones - was from before ocrd-train was available. ocrd-train works on single line images, while tesstrain.sh works on multipage tifs. By mixing these the single line images will get more iterati

Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

2018-09-04 Thread Shree Devi Kumar
3:25 AM, Shandigutt wrote: > Thank you very much for sorting things out Shree. But I have one more > question > > When I run tesstrain.sh I don't pass my words list, punctuation and > numbers as input parameters. But I keep those files in the langdata folder. &

Re: [tesseract-ocr] Making custom traineddata

2018-09-05 Thread Shree Devi Kumar
I think finetune will be a better option than training from scratch. Using a small training/test text - 40 lines, I get - + lstmeval --verbosity 0 --model /home/ubuntu/ *tessdata_best/script/Latin.traineddata* --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.tra

Re: [tesseract-ocr] Making custom traineddata

2018-09-05 Thread Shree Devi Kumar
See https://github.com/Shreeshrii/tessdata_ocrb for the files and traineddata. On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar wrote: > I think finetune will be a better option than training from scratch. > > Using a small training/test text - 40 lin

Re: [tesseract-ocr] Making custom traineddata

2018-09-06 Thread Shree Devi Kumar
> When it's combining language model I've spotted that it's making some dawg files. Yes, it takes the files from langdata repo specified in the training command. You could change langdata/pol/pol.wordlist to have only the LAST NAMES and GIVEN NAMES, pol.punc to have only < and change number forma

Re: [tesseract-ocr] Error when trying to run lstmtraining: Can't encode transcription

2018-09-08 Thread Shree Devi Kumar
> Warning: given outputs 111 not equal to unicharset of 90. your starter traineddata has a unicharset of 90. In your --net_spec you have specified number of unichars as 111. > Encoding of string failed! It means that some of the chracters in the displayed string are NOT in the unicharset of your

Re: [tesseract-ocr] How we can create a Tesseract fast model

2018-09-09 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc *-c* *.traineddata* *FILE*…: Compacts the LSTM component in the .traineddata file to int. This converts the float model to integer. On Sun, Sep 9, 2018 at 4:27 PM, wrote: > Hi, > > I am working on training

Re: [tesseract-ocr] Training with a large number of LSTMF files

2018-09-11 Thread Shree Devi Kumar
> I assumed that each iteration would be a training pass over all the lstmf files No. Each iteration is just one line of text in one font. Change debug interval to -1 to see details of each iteration. --debug_interval -1 \ Finetuning with 300-400 iterations may not be enough for handwriting.

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-11 Thread Shree Devi Kumar
11, 2018 at 8:43 PM, ProgressNotPerfection < jimquitten...@gmail.com> wrote: > Thank you Shree > I ran with --debug_interval -1 as you suggested and I can see 1 > iteration showing 1 text line from a given font (lstmf) and then the next > iteration showing 1 text line from t

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-14 Thread Shree Devi Kumar
> Very interesting about having a single box around the whole image though That only works when the whole image is a single line of text. Example of box file created by ocrd for a single line image with groundtruth as "Athāto Gobhiloktānām anyeshāṁ caiva karmaṇām" - note it ends in a line with a

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-14 Thread Shree Devi Kumar
> So with say 2000 fonts then (i.e. handwriting samples by 2000 authors), I suppose there's far more variation than the standard sized tesseract model is intended for. I did read that the netowrk spec cannot be changed by finetuning so maybe I should try training from scratch to create a bigger mod

Re: [tesseract-ocr] Documentation related to lang data

2018-09-15 Thread Shree Devi Kumar
*desired_characters* This is used by Google internally when creating the training text. Should I enter all those compound character combinations to this file? No, since this is not used by tesstrain.sh - at least in the open source version in Github. *okfonts.txt* This lists the Unicode fonts

Re: [tesseract-ocr] Documentation related to lang data

2018-09-15 Thread Shree Devi Kumar
>Are they created using the same files we're talking as sin.numbers, sin.punc and sin.wordlist? Yes, the dawg files are created from these and the unicharset. The same unicharset should be used for lstm training. On Sat, 15 Sep 2018, 21:46 Pubudu Tharaka Viswakula, wrote: > Hi Shree

Re: [tesseract-ocr] Documentation related to lang data

2018-09-15 Thread Shree Devi Kumar
_text. Change -m 7 to -m 1 to create file with just one sample of each. Sort unique removes duplicate lines. This can be used to create a smaller training_text useful for finetuning. On Sat, Sep 15, 2018 at 9:23 PM, Shree Devi Kumar wrote: > *desired_characters* > > This is used by

Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Shree Devi Kumar
I think pdf creation adds a text layer only and there isn't an option to add HOCR to it. @jbreiden can confirm. On Mon, Sep 17, 2018 at 6:10 PM, Monica wrote: > I have tried this, but this is showing the default behaviour. I think the > default output is overlaying on pdf instead of hocr out. >

Re: [tesseract-ocr] combine_lang_model makes no dawg file

2018-09-17 Thread Shree Devi Kumar
I use it as follows and it works. Please check that you are using correct paths for the files. combine_lang_model \ --input_unicharset ./layersan/san.unicharset \ --script_dir ~/langdata \ --words ~/langdata/san/san.wordlist \ --numbers ~/langdata/san/san.numbers \ --puncs ~/langdata/san/san.punc

Re: [tesseract-ocr] Fine tuning existing model

2018-09-18 Thread Shree Devi Kumar
, Sep 18, 2018 at 5:36 PM, Varun Sab wrote: > HI @ Lorenzo Blz, > I am also getting the same segmentation fault error. Can you please > suggest how you solved it. > > > > > On Friday, June 29, 2018 at 9:03:34 PM UTC+5:30, Lorenzo Blz wrote: >> >> Hi Shree, t

Re: [tesseract-ocr] Install 4.0.0-beta.4 on Ubuntu

2018-09-21 Thread Shree Devi Kumar
Try the ppa by Alex, it should have a newer version https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr?field.series_filter=bionic On Fri, 21 Sep 2018, 08:40 , wrote: > Maybe a silly question since I'm not very familiar with Linux. > > I tried upgrading to tesseract 4 today on my Mac (

Re: [tesseract-ocr] Text2image doens't create font list

2018-09-25 Thread Shree Devi Kumar
Are the fonts in /usr/share/fonts ? Reduce the --min_coverage 1 to .99 and see if some fonts are found. On Tue, 25 Sep 2018, 07:50 Zohreh Khosrobeygi, wrote: > Hi, > I use > tesseract 4.0.0-beta.4 > leptonica-1.74.4 > libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zl

Re: [tesseract-ocr] Text2image doens't create font list

2018-09-25 Thread Shree Devi Kumar
Also check that your langdata path matches your environment. On Tue, 25 Sep 2018, 08:11 Shree Devi Kumar, wrote: > Are the fonts in /usr/share/fonts ? > > Reduce the > --min_coverage 1 > > to .99 and see if some fonts are found. > > > > On Tue, 25 Sep 2018, 07:5

Re: [tesseract-ocr] Text2image doens't create font list

2018-09-25 Thread Shree Devi Kumar
lPKcz > I have this error. How can I solve this error? > > On Tue, Sep 25, 2018 at 3:41 PM Shree Devi Kumar > wrote: > >> Are the fonts in /usr/share/fonts ? >> >> Reduce the >> --min_coverage 1 >> >> to .99 and see if some fonts are found. >

Re: [tesseract-ocr] Text2image doens't create font list

2018-09-25 Thread Shree Devi Kumar
ernalPKcz > It shows same error. > > On Tue, Sep 25, 2018 at 4:20 PM Shree Devi Kumar > wrote: > >> What's the output for? >> >> which text2image >> >> text2image -v >> >> >> >> >> On Tue, 25 Sep 2018, 08:39 Khosrobei

Re: [tesseract-ocr] Compute CTC targets failed while training

2018-09-25 Thread Shree Devi Kumar
--fontlist "Arial" Does that have good coverage for Farsi? --max_iterations 5000 You are trying to train from scratch with 18000 lines of text and only 5000 iterations. That will not work. Ray has trained on hundreds of thousands of lines of text and millions of iterations. On Tue, 25 Sep 2

Re: [tesseract-ocr] Compute CTC targets failed while training

2018-09-26 Thread Shree Devi Kumar
d T > tosp_old_to_constrain_sp_kn T > tosp_old_sp_kn_th_factor 4.0 > > tosp_only_small_gaps_for_kern T > tosp_use_pre_chopping T > I used all these, but now my model doesn't learn. > Has any thing changed in beta 4 for example text2image? > > On Wed, Sep 26, 2018 at 12:53 AM

Re: [tesseract-ocr] Network specification for tessdata_best files

2018-09-26 Thread Shree Devi Kumar
It is NOT there in ALL traineddata files in tessdata_best. You can view the version string by using combine_tessdata For tessdata_fast the network specs are available on a page in wiki. On Wed, 26 Sep 2018, 00:08 anonynamja, wrote: > I understand that the net_spec used in training is containe

Re: [tesseract-ocr] Cuneiform scripts

2018-09-28 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/1781#issuecomment-422845120 On Fri, 28 Sep 2018, 01:01 'Andreas H.' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > Hi, > > is there a way to let tesseract work with and recognize cuneiform scripts > like akkadian or sumerian? >

Re: [tesseract-ocr] Failure when creating training data

2018-09-30 Thread Shree Devi Kumar
Looks like your langdata dir does not have the script unicharset files for Signals and Latin scripts. Failed to load script unicharset from:../training/Latin.unicharset Failed to load script unicharset from:../training/Sinhala.unicharset On Sun, 30 Sep 2018, 18:27 Shandigutt, wrote: > Hi, >

Re: [tesseract-ocr] Failure when creating training data

2018-09-30 Thread Shree Devi Kumar
Sinhala script Sorry about the wrong autocorrect on phone On Sun, 30 Sep 2018, 19:33 Shree Devi Kumar, wrote: > Looks like your langdata dir does not have the script unicharset files for > Signals and Latin scripts. > > Failed to load script unicharset from:../training/Lati

Re: [tesseract-ocr] Problem when using custom-trained model with default tesseract 4 model

2018-10-01 Thread Shree Devi Kumar
Have you tried https://github.com/tesseract-ocr/tessdata_fast/blob/master/script/Thai.traineddata which is supposed to support both Thai and English On Mon, Oct 1, 2018 at 5:33 AM Rujrawee K wrote: > Hi, > > After I trained my custom Thai language model to use in my tesseract 4, > it's working

Re: [tesseract-ocr] Problem when using custom-trained model with default tesseract 4 model

2018-10-01 Thread Shree Devi Kumar
1. Have you trained for legacy tesseract engine or for LSTM? 2. Which default traineddata are you using? 3. For us to test, please provide an image and the commands used for testing and the output you got. On Mon, Oct 1, 2018 at 11:08 PM Rujrawee K wrote: > Hi Shree, > Yes we tried th

Re: [tesseract-ocr] Problem when using custom-trained model with default tesseract 4 model

2018-10-02 Thread Shree Devi Kumar
There is an open issue with similar problem in issue tracker. It will help to move the discussion there. I will test with your sample image and also post link to the issue. On Tue, 2 Oct 2018, 01:01 Rujrawee K, wrote: > > ok, Shree, I miscommunicated with my colleague, he said this p

Re: [tesseract-ocr] Problem when using custom-trained model with default tesseract 4 model

2018-10-02 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesseract/issues/1579 and continue further discussion there. On Tue, Oct 2, 2018 at 9:52 AM Shree Devi Kumar wrote: > There is an open issue with similar problem in issue tracker. It will help > to move the discussion there. > > I will te

Re: [tesseract-ocr] Need help building Tesseract for OpenCv, Where are the include files?

2018-10-03 Thread Shree Devi Kumar
+Egor Pugin Have you checked https://github.com/tesseract-ocr/tesseract/blob/master/CMakeLists.txt On Wed, Oct 3, 2018 at 3:28 PM Mich Po wrote: > I'm trying to build OpenCV with the Tesseract OCR module to use on a > raspberry pi. > > There is very little information online on how to build th

Re: [tesseract-ocr] Fine tuning the Old traineddat file

2018-10-11 Thread Shree Devi Kumar
No. On Thu, 11 Oct 2018, 03:41 Mugunthan, wrote: > Hi, > > Is there any way to fine to the old trained data files (3.05) using the > new version 4.00 [LSTM]? > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this gro

Re: [tesseract-ocr] New JPN_VERT traineddata (for 4.0)

2018-10-15 Thread Shree Devi Kumar
Thank you for sharing. It will be helpful if you add this info to the readme file in your github repo also. Please share the training options that you used, number of fonts, iterations etc. It will be useful as reference . On Mon, 15 Oct 2018, 17:27 Seokbong Choi, wrote: > Hello all, > > Durin

Re: [tesseract-ocr] Multiple Languages

2018-10-16 Thread Shree Devi Kumar
> > Please try with tessdata_fast -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email

Re: [tesseract-ocr] train tesseract OCR 4.0

2018-10-16 Thread Shree Devi Kumar
. On Tue, 16 Oct 2018, 08:33 kislay bajpai, wrote: > Hello Shree, > > I am confused how to train tesseract 4.0 alpha for new font (E 13B). > Please help me for it. > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" grou

Re: [tesseract-ocr] What do iteration numbers mean in the train logging?

2018-10-19 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/blob/3a7f5e4de459f4c64f36e08b18ce1b66b1fbc876/src/lstm/lstmtrainer.cpp#L410 On Fri, 19 Oct 2018, 09:01 , wrote: > I get the following log lines while training tesseract: > > At iteration *303839/569300/573167*, Mean rms=0.777%, delta=2.588%, char

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

2018-10-19 Thread Shree Devi Kumar
On Fri, Oct 19, 2018 at 10:02 PM Seokbong Choi wrote: > Can you share the content of "eng.training_files.txt" file? that > --train_listfile argument refers to? > Thanks. > > The contents will differ based on the fonts chosen and the output diectory. See the following for a sample: /home/ubuntu/t

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

2018-10-19 Thread Shree Devi Kumar
reenshot from 2018-10-20 10-14-07.png] > > > > Vào 09:19:28 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết: >> >> On Fri, Oct 19, 2018 at 10:02 PM Seokbong Choi >> wrote: >> >>> Can you share the content of "eng.training_files.txt" fi

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

2018-10-19 Thread Shree Devi Kumar
Maybe it is not finding your ./eng.training_files.txt Try giving its full path in lstmtraining command. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesser

Re: [tesseract-ocr] train tesseract OCR 4.0

2018-10-22 Thread Shree Devi Kumar
etting no > idea, how to train it. > Please help me out. I am in big trouble. > > version - tesseract4.0 alpha > OS - ubuntu16.04 and RHEL 7.3 (any one i can use) > > On Tue, Oct 16, 2018 at 7:10 PM Shree Devi Kumar > wrote: > >> Please do not use tesseract 4.0

Re: [tesseract-ocr] Re: Where can I get a list of papers that can algorithms that Tesseract system implements?

2018-10-24 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/docs?files=1 And https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#documentation On Wed, 24 Oct 2018, 09:10 Nouran S. Ahmad, wrote: > Hi, > I am looking for the same thing, did you find anything useful after all > this time? > > Thanks,

Re: [tesseract-ocr] Re: train more fonts on trained model fas in tesseract

2018-10-24 Thread Shree Devi Kumar
See the wiki page on training 4.0 and follow the tutorial. On Wed, 24 Oct 2018, 08:09 , wrote: > training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \ > --continue_from /path/to/existing/model \ > --traineddata /path/to/original/traineddata \ > [--perfect_sample_delay 0] [

Re: [tesseract-ocr] Re: Any suggestions for more accurate Text conversion?

2018-10-28 Thread Shree Devi Kumar
The starter traineddata that you have used does not have any dawg files, based on word list, numbers and punctuation, hence the report that dictionaries are not found. On Fri, 26 Oct 2018, 14:38 Abu Anas, wrote: > I am also having similar problem. I have trained KB-JT-NEW from ben > (continue

Re: [tesseract-ocr] How to improve the quality of Training From Scratch

2018-10-29 Thread Shree Devi Kumar
Please look at the langdata_lstm repo, specifically the chi_sim folder. It has the training_text as well as list of fonts used for LSTM training. On Mon, 29 Oct 2018, 05:40 bruce, wrote: > Recently,I'm using tesseract training my chi_sim language. I want to train > a chi_sim.traineddata better t

Re: [tesseract-ocr] How to improve the quality of Training From Scratch

2018-10-29 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/langdata_lstm/tree/master/chi_sim On Mon, 29 Oct 2018, 14:41 Shree Devi Kumar, wrote: > Please look at the langdata_lstm repo, specifically the chi_sim folder. It > has the training_text as well as list of fonts used for LSTM training. > > On Mon,

Re: [tesseract-ocr] tesstrain.sh with hundreds of fonts

2018-10-30 Thread Shree Devi Kumar
Please check the log file in the tmp directory. There might be some font related errors there. There has been pango related change made for fonts procese recently. Please check the change log. On Tue, 30 Oct 2018, 09:10 , wrote: > I would like to train the tesseract with hundreds of my fonts. My

Re: [tesseract-ocr] How to improve the quality of Training From Scratch

2018-10-30 Thread Shree Devi Kumar
Please read the wiki page regarding training 4.0 and the presentation files in docs by Ray Smith. On Tue, 30 Oct 2018, 02:32 bruce, wrote: > thank you for your reply ,shree. > I've seen the training_text and the list of fonts. > I will try again. > Before I start my next S

Re: [tesseract-ocr] How do I train tesseract 4 for the font Comic Sans MS?

2018-10-30 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact Use comic sans font instead of impact, to finetune On Tue, 30 Oct 2018, 12:32 'rely LIVE' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > Hello, > > I want to train the default eng.tra

Re: [tesseract-ocr] Generated tif/box files has encountered a problem

2018-10-31 Thread Shree Devi Kumar
--strip_unrenderable_words Have you tried the above option? It should cause words with characters not in the font to be ignored. On Wed, 31 Oct 2018, 06:22 bruce I'm using tesseract training my chi_sim language > I used text2image to generate tif/box. > And one of my font is "YouYuan"(a font des

Re: [tesseract-ocr] Convert image to text shows arrow instead of empty string

2018-11-06 Thread Shree Devi Kumar
Probably you are referring to the form feed symbol which is the new default for page separator. You can change the setting by using the config variable. That will make it similar to 3.05. look in the FAQ page in wiki. @stweil what about not outputting the page separator symbol if output is just a

Re: [tesseract-ocr] Reducing output image quality to make PDF smaller

2018-11-13 Thread Shree Devi Kumar
You can try https://pypi.org/project/ocrmypdf/ Which uses tesseract On Tue, 13 Nov 2018, 07:07 Zdenko Podobny Tesseract approach is to not re-compress/change image type of input image > in pdf creation. > So you need to use other tools for creating smaller pdf. > > Zdenko > > > ut 13. 11. 2018

Re: [tesseract-ocr] Regarding space in punjabi recogniton

2018-11-21 Thread Shree Devi Kumar
Please provide a sample test image and expected ground truth text. Which version of trained data did you use? On Wed, 21 Nov 2018, 01:33 Vaibhav Kumar Hi, > > I was trying to do text recogniton using tesseract on punjabi language. > The recognition is working fine. > > But there is a little issu

Re: [tesseract-ocr] Regarding space in punjabi recogniton

2018-11-21 Thread Shree Devi Kumar
There are three repositories with trained data files tessdata tessdata_best tessdata_fast Please also share the version info and command used... tesseract -v On Wed, 21 Nov 2018, 09:13 Vaibhav Kumar PFA for the image. > > I used the default punjabi traineddata on which tesseract-ocr is trai

Re: [tesseract-ocr] Regarding space in punjabi recogniton

2018-11-21 Thread Shree Devi Kumar
Please try with tesseract 4.0.0 It should give you better recognition On Wed, 21 Nov 2018, 12:07 Vaibhav Kumar tesseract -v yields > > *tesseract 3.04.01* > * leptonica-1.73* > * libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : > libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : li

Re: [tesseract-ocr] Regarding space in punjabi recogniton

2018-11-21 Thread Shree Devi Kumar
Read the main wiki page. You can install using Alex's ppa on older versions of Ubuntu. Make sure to uninstall the 3.04 version. On Wed, 21 Nov 2018, 12:35 Vaibhav Kumar I read tesseract 4.x works on ubuntu 18 . > I am using ubuntu 16. > > Isn't there any other solution ? > > -- > You received

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-26 Thread Shree Devi Kumar
What is the version of tesseract? tesseract -v On Mon, 26 Nov 2018, 05:51 Zohreh Khosrobeygi Hi, > I have been runnig about 130G data which are 4000 files. My command is > > /home/kddlab/Desktop/tesseract-master/src/training/lstmtraining \ > --traineddata > /home/kddlab/Desktop/tesseract-ma

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-26 Thread Shree Devi Kumar
> > *Kind regards,* > *Zohreh Khosrobeygi* > > *Student of IT* > > *University of Tehran, 2016* > > *Phone: (+98)9196042887* > > *Email:khosrobeygi.zo...@ut.ac.ir * > > > > On Mon, Nov 26, 2018 at 3:33 PM Shree Devi Kumar > wrote: > >> What is the v

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-26 Thread Shree Devi Kumar
osrobeygi* >> >> *Student of IT* >> >> *University of Tehran, 2016* >> >> *Phone: (+98)9196042887* >> >> *Email:khosrobeygi.zo...@ut.ac.ir * >> >> >> >> On Mon, Nov 26, 2018 at 5:25 PM Shree Devi Kumar >> wrote: >> &g

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-27 Thread Shree Devi Kumar
In my opinion, the assert still needs to be documented as an issue, with LSTM training. On Tue, 27 Nov 2018, 05:03 Zdenko Podobny Shree, > > issue tracker is not for custom training. Simply because there is not > enough people and > it can not be reproduced... > Did you read

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-27 Thread Shree Devi Kumar
file/data that cause problem > and create minimal input data that demonstrate problem. Creating issue > without testing case (for reproducing problem) is useless and demotivating. > > Zdenko > > > ut 27. 11. 2018 o 13:23 Shree Devi Kumar > napísal(a): > >> In my o

Re: [tesseract-ocr] Train just some layer in tesseract

2018-11-28 Thread Shree Devi Kumar
Look in basetrain.log, usually start of training will display the new network spec being used. The version string is user defined and by default just reports tesseract version number. You will have to assign a new string if you want it to be different. On Wed, 28 Nov 2018, 16:17 Zohreh Khosrobeyg

Re: [tesseract-ocr] the length of input to lstm

2018-12-04 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00 On Tue, 4 Dec 2018, 15:14 Zohreh Khosrobeygi I'm training tesseract from scratch for the Persian Language. But I need > to know about the output of TF conventi

Re: [tesseract-ocr] lstmeval give me perfect result but tesseract command failed

2018-12-06 Thread Shree Devi Kumar
What --psm did you use with tesseract command? On Thu, 6 Dec 2018, 03:01 Bredalas Hello, > > I trained a model from scratch : > > I generated .box .tiff files > > I generated lstmf files with .box and .tiff files > for file in *.tiff; do > echo $file > base=`basename $file .tiff` > tesserac

Re: [tesseract-ocr] [/usr/local/bin/language-specific.sh: 줄 1125: FONTS: unbound variable] Error help me!!

2018-12-07 Thread Shree Devi Kumar
You can use a different font list for training vs testing. This way you will have more control over which fonts are being used. On Fri, 7 Dec 2018, 00:02 SEUNGGWANSHIN Thanks. i don't know why but it works > > this way, what are the differences between creating train data and > creating test da

Re: [tesseract-ocr] Failed to read data eng/eng.config

2018-12-12 Thread Shree Devi Kumar
There is no eng.config file. This is only an info msg, not error and training will continue ok. On Wed, 12 Dec 2018, 09:57 I am trying to train the lstm model for my own datasets, following these > steps TrainingTesseract-4.00 >

Re: [tesseract-ocr] Errors when numeric and alphabetic data is mixed

2018-12-14 Thread Shree Devi Kumar
Try to include mixed data in your training files and see if that helps. On Fri, 14 Dec 2018, 17:14 'ilochray' via tesseract-ocr < tesseract-ocr@googlegroups.com wrote: > I am using the API to read data from an image. I have created training > files for the fonts I process and I pre-process the im

Re: [tesseract-ocr] Recognition of chemical formulas

2018-12-17 Thread Shree Devi Kumar
Please take a look at related issue regarding subscripts/superscripts (in langdata or tessdata repos). As far as I understand, the currently used normalization routines convert them to regular numbers. Hence, training did not seem to help in my fine tuning trial. However, you can give it a try a

Re: [tesseract-ocr] Recognition of chemical formulas

2018-12-18 Thread Shree Devi Kumar
ipts not being recognized as numbers. On which data did you try to > fine tune? > > On Monday, 17 December 2018 19:13:47 UTC+1, shree wrote: >> >> Please take a look at related issue regarding subscripts/superscripts (in >> langdata or tessdata repos). >> >

Re: [tesseract-ocr] Error running the BEST Model of "eng.traineddata"

2018-12-19 Thread Shree Devi Kumar
Tesseract Open Source OCR Engine v3.04.01 with Lept The tessdata_best models are for use with tesseract 4 On Wed, 19 Dec 2018, 06:32 I am trying to improve my accuracy to OCR tool which I built using > pytesseract. > As I was not getting good results using default eng.traineddata, I saw a > rep

Re: [tesseract-ocr] Tags Users Unanswered Where i can find fonts Tesseract 4 was trained on?

2019-01-02 Thread Shree Devi Kumar
See the okfonts.txt for the language in tesseract/langdata_lstm repo. On Wed, 2 Jan 2019, 09:54 I don't wanna train for new font, I just want use fonts which were chosen > for training Tesseract 4 for more accuracy > > -- > You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread Shree Devi Kumar
tesestrain.sh is setup to process files in batches of 8 simultaneously. Are you allowing the script to run to completion? On Fri, 4 Jan 2019, 11:27 Hey all, > > I'm currently working on a program that explores the handwritten OCR > capabilities of Tesseract. > > I have ~1400 images with ~8 lines

Re: [tesseract-ocr] Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread Shree Devi Kumar
You can also try the ocr-d/train project which can train using scanned images. On Fri, 4 Jan 2019, 12:44 Shree Devi Kumar tesestrain.sh is setup to process files in batches of 8 simultaneously. > Are you allowing the script to run to completion? > > On Fri, 4 Jan 2019, 11:27 &

Re: [tesseract-ocr] Re: Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread Shree Devi Kumar
That's indeed strange. What's your version of tesseract and o/s? You should not be getting such errors with unmodified tesstrain.sh script. On Fri, Jan 4, 2019 at 1:15 PM wrote: > Disregard my last question. I figured out how to modify the batch size and > found that it will hang indefinitely a

Re: [tesseract-ocr] Expected output of LSTMTRAINING

2019-01-07 Thread Shree Devi Kumar
You need to convert the checkpoint to a traineddata file. Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files On Mon, 7 Jan 2019, 11:03 Hey all, > > After some wrangling, I've been able to get Tesseract to successfully > train on my datase

[tesseract-ocr] Fwd: Need programming for custom license plate recognition

2019-01-07 Thread Shree Devi Kumar
Forwarding below a request for programming for custom license plate recognition. Those interested can contact *jagdishgg on github.* -- Forwarded message - From: jagdishgg Date: Mon, 7 Jan 2019, 05:14 Subject: [Shreeshrii/imagessan] How to contact you (#1) To: Shreeshrii/imagessa

Re: [tesseract-ocr] why tesseract can't load my model?

2019-01-16 Thread Shree Devi Kumar
>>rename math_checkpoint at mathoutput to math.traineddata That is not the way to convert checkpoint to traininddata. Use lstmtraining with stop_training flag. See wiki for details. On Wed, Jan 16, 2019 at 5:29 PM De Zero wrote: > I train my math model with trainModel.sh > > eng.training_text.1

Re: [tesseract-ocr] Finetune 4.0 location of new punc and numbers files?

2019-01-19 Thread Shree Devi Kumar
It depends on what you are fine tuning for. I had changed the punc and numbers file so that only those punctuation characters were used which were in the unicharset eg. For a digits trained data which is for 0-9 and decimal point, comma and minus sign, I removed all other punctuation marks and ke

Re: [tesseract-ocr] Re: some questions about lstm training

2019-01-24 Thread Shree Devi Kumar
>currently I am in the tesseract directory, *I can not find training folder under this directory.* All source files were moved to tesseract/src. You will find training folder under it. *src/training/lstmtraining* should work without install. >*mgr_.Init(traineddata_path.c_str()):Error:Assert fail

Re: [tesseract-ocr] Evaluating Tesseract with new domain-specific documents

2019-01-25 Thread Shree Devi Kumar
also see https://github.com/impactcentre/ocrevalUAtion https://github.com/Shreeshrii/ocr-evaluation-tools https://github.com/tesseract-ocr/test/tree/master/unlvtests On Fri, Jan 25, 2019 at 5:17 PM Lorenzo Bolzani wrote: > This is an option if you want to consider missing/extra chars too: >

Re: [tesseract-ocr] Training without font files

2019-01-26 Thread Shree Devi Kumar
Check out https://github.com/OCR-D/ocrd-train On Sat, 26 Jan 2019, 13:36 Hello, > > I’m trying to train Tesseract 4 using images (and associated box files). I > can’t pinpoint the font name and prefer to avoid sourcing the font itself. > > I’m currently trying to train on MacOS High Sierra, but

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-01-28 Thread Shree Devi Kumar
You have not mentioned which traineddata file you are using. >It works with '-l spa', but when I do '--psm 6', it crashes. Please share the image. Also note the commands used and their output. On Tue, Jan 29, 2019 at 6:33 AM Pablo Andres Araya Melo wrote: > I updated tesseract from github, I n

Re: [tesseract-ocr] How to use fine tuning for training?

2019-01-28 Thread Shree Devi Kumar
combine_tessdata -o ./tessdata/eng_new.traineddata \ ~/tesstutorial/engtuned_from_eng/eng.lstm \ You need to extract eng.lstm from tessdata_best On Tue, 29 Jan 2019, 09:37 易鑫 Hello,everyone: > > Now I want to recognize the character in the table*,y*ou can find > the table sample in the at

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-01-29 Thread Shree Devi Kumar
> I am using spa.tessdata >>> What do you mean with commands used and their output? >>> >>> On Mon, Jan 28, 2019 at 10:45 PM Shree Devi Kumar >>> wrote: >>> >>>> You have not mentioned which traineddata file you are using. >>>

Re: [tesseract-ocr] How to training lstm model on this occasion

2019-01-29 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tessdata_shreetest/blob/master/makedata-digits.sh https://github.com/Shreeshrii/tessdata_shreetest/blob/master/finetune-digits.sh https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text On Tue, Jan 29, 2019 at 2:02 PM 易鑫 wro

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-29 Thread Shree Devi Kumar
Finetune with your specific font - see eg. below which uses IMPACT font. #!/bin/bash time ~/tesseract/src/training/tesstrain.sh \ --fonts_dir /usr/share/fonts \ --lang eng --linedata_only \ --noextract_font_properties \ --langdata_dir ~/langdata \ --tessdata_dir ~/tessdata \ --fontlis

Re: [tesseract-ocr] Training without font files

2019-01-29 Thread Shree Devi Kumar
Train previously, and > seem to have an issue running even the training example. I receive issues > with make and also ascii encoding errors (likely from the included python > script). Might you have advice for accomplishing my initial goal without > the helper app? > > On Jan 26,

Re: [tesseract-ocr] Training Tesseract OCR

2019-01-29 Thread Shree Devi Kumar
Have you tried using the amh.traineddata for tesseract 4.0. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files On Wed, Jan 30, 2019 at 12:40 PM Getachew Abebe wrote: > hello my friendsi am trying to train the Tesseract Engine for amharic > language but can't train it > any one pleas

Re: [tesseract-ocr] Tesseract with Thai language

2019-01-30 Thread Shree Devi Kumar
> I am able to extract the Thai characters perfectly on Windows environment whereas when I extract the same on Ubuntu I found spaces between the characters in the extracted text. What are the exact versions of tesseract in both environments? `tesseract -v` Also, which trineddata file are you usi

Re: [tesseract-ocr] My confusion about "Fine Tuning for ± a few characters"

2019-01-30 Thread Shree Devi Kumar
> it says "*Modify**langdata/eng/eng.training_text to include some samples of ±."* *That is part of a training tutorial, where the goal is to add a new character **± to the eng.traineddata so that it can be recognized by the finetuned traineddata.* It is only an example. You have to modify it b

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread Shree Devi Kumar
AFAIK the textline option for box files (WordStr) has NOT been implemented. The wordaround has been to use the bounding box for the whole line for every character on a line. Ref: ocrd-train project Example: च 0 0 1965 128 0 त् 0 0 1965 128 0 व 0 0 1965 128 0 ा 0 0 1965 128 0 र 0 0 1965 128 0 ि 0

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread Shree Devi Kumar
also see https://github.com/tesseract-ocr/tesseract/blob/cfa787d976007f5866ce25fbd8e2a0223fc40fda/src/ccstruct/boxread.cpp#L165 https://github.com/tesseract-ocr/tesseract/blob/3ac33d59aeb93fc9dab13874a64ab0b73690d5eb/src/ccmain/applybox.cpp#L36 On Wed, Jan 30, 2019 at 5:15 PM Shree Devi Kumar

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread Shree Devi Kumar
lstm training using box/tiff files is NOT supported. Use tesstrain.sh with a UTF8 training_text and fonts. On Thu, Jan 31, 2019 at 3:04 PM Kristóf Horváth wrote: > Oh i see, but i dont know what you mean by this: you can use the master > branch,latest code. I compiled the latest version on my c

Re: [tesseract-ocr] Question about "Failed loading language"

2019-02-01 Thread Shree Devi Kumar
try with --tessdata-dir /usr/local/share/tessdata/ On Fri, Feb 1, 2019 at 12:29 PM nampyo hong wrote: > [image: tesseract.PNG] > When I was running tesseract 3.0.4, there was no problem. > > I tried to install tesseract 4.0.0 in ubuntu 16.04 by building it from > source, but there was an issue.

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread Shree Devi Kumar
AVX2 > Found AVX > Found SSE > > This was installed from github, and tessdata comes from > https://github.com/tesseract-ocr/tessdata/blob/master/spa.traineddata > > Thank you! > > On Tue, Jan 29, 2019 at 1:09 PM Shree Devi Kumar > wrote: > >> >this works ok

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread Shree Devi Kumar
desktop. On Fri, Feb 1, 2019 at 6:15 PM PA wrote: > Are those test data for Spanish language? > > Also I can not give error message as tesseract crashes making the desktop > to reboot. Do you know a way to save to text file? > > El vie., 1 de feb. de 2019 09:39, Shree Devi

Re: [tesseract-ocr] normalisation failed for string error

2019-02-01 Thread Shree Devi Kumar
Looks like two maatraas together or maatraa followe by vedic accent - does not meet Indic normalization rules. What training text are you using? On Fri, Feb 1, 2019 at 5:58 PM Prabhakar Tayenjam wrote: > What is causing this error and what are the possibles fixes?? > > Normalization failed for

Re: [tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Shree Devi Kumar
Use training_text from langdata_lstm which has larger training text used for LSTM training (for tessdata_best and tessdata_fast). On Fri, Feb 1, 2019 at 7:14 PM Prabhakar Tayenjam wrote: > This happens everytime I use tesstrain.sh. I use a training text combining > the default provided in the la

<    1   2   3   4   5   6   7   8   9   10   >