Re: [tesseract-ocr] Re: To make traineddata file non-traineable

2021-02-24 Thread Jennil Thiyam
HI shree, so by running this command, the model will be in its integer/fast version? On Wed, Feb 24, 2021 at 10:27 AM shree wrote: > You can create an integer/fast version of traineddata which cannot be used > as START_MODEL for further training. > > `combine_tessdata -c myfile.traineddata` > >

[tesseract-ocr] To make traineddata file non-traineable

2021-02-22 Thread Jennil Thiyam
Does anyone have any idea about making the traineddata file non trainable, which meant to make it not applicable for fine-tuning by other -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails f

[tesseract-ocr] Training Text size

2020-05-27 Thread Jennil Thiyam
Hi everyone Does Anyone know what is the actual size(may be in number of words) to train. For example, for the traineddata bengali (ben), the trainingtext size is 34.7 mb (for tesseract LSTM version) but for assamese (asm) I can see the size of training text is only 140 kb (thi is also for tess

[tesseract-ocr] working of Tesseract OCR

2019-11-08 Thread Jennil Thiyam
Does anyone has any links that describe the detail working of the tesseract using LSTM. Like detail on what are the features extraction techniques and all. Please let me know -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from t

[tesseract-ocr] Preprocessing Tools

2019-10-03 Thread Jennil Thiyam
HI shree, Is there any tools associated with tesseract that we can use for preprocessing the images? Please advise -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email

Re: [tesseract-ocr] Tesseract OCR 4 paper

2019-09-11 Thread Jennil Thiyam
tutorial2016/6ModernizationEfforts.pdf> >, #7 > > <https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf> > have >information about LSTM integration in Tesseract 4.0. > > > On Wed, Sep 11, 2019 a

Re: [tesseract-ocr] Tesseract OCR 4 paper

2019-09-11 Thread Jennil Thiyam
Shree do you have any other links that talk about how LSTM works in tesseract OCR On Wed, Sep 11, 2019 at 6:33 PM Shree Devi Kumar wrote: > https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#documentation > > > > > On Wed, Sep 11, 2019 at 6:29 PM Jennil Thiyam

[tesseract-ocr] Tesseract OCR 4 paper

2019-09-11 Thread Jennil Thiyam
Does anyone has the link that describes the working of Tessercat 4, I found paper that talks about the processing steps of tesseract 3, but failed to get any research paper that describes tesseract 4. Please let me know -- You received this message because you are subscribed to the Google Groups

[tesseract-ocr] Can I add new trainedata in the repository, for my language. like officially

2019-08-07 Thread Jennil Thiyam
Is it possible to add new traineddata in the repository, so that everyone who knows the language can use it -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tess

[tesseract-ocr] Is there any option that the lan.training_text is in the form of imgae?

2019-08-05 Thread Jennil Thiyam
I did fine-tuning by adding some words that contained the new characters that I want. Now what I want to know is when we OCRed the document which is not computerized printed but scan image, the accuracy drops. so I thought if we trained the engine even in scan image then the accuracy won't be dropp

Re: [tesseract-ocr] Re: Box file for testing data

2019-06-13 Thread Jennil Thiyam
Thanks, I will check it out. On Thu, Jun 13, 2019 at 9:46 PM Jingjing Lin wrote: > I think this link might be helpful although I didn't succeed for some > reason: > https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging > > 在 2019年6月13日星期四 UTC-4上午8:57:43,Jennil Thiy

[tesseract-ocr] Box file for testing data

2019-06-13 Thread Jennil Thiyam
Lets say I have a file "test.tiff" which i want to OCRed, can we get the box file for this data. I know we get box file when creating training data, but what I want is to see how the model is performing segmentation algorithm over my testing data. I want to know this because i have some character w

Re: [tesseract-ocr] Bounding box

2019-06-09 Thread Jennil Thiyam
> Bye > > Lorenzo > > Il giorno dom 9 giu 2019 alle ore 10:50 Jennil Thiyam < > thiyamjen...@gmail.com> ha scritto: > >> ই 110 4657 137 4701 0 >> ম্ফা 131 4660 191 4693 0 >> ল 185 4660 217 4689 0 >> , 217 4654 226 4667 0 >> 226 4650 240 4689

[tesseract-ocr] Bounding box

2019-06-09 Thread Jennil Thiyam
ই 110 4657 137 4701 0 ম্ফা 131 4660 191 4693 0 ল 185 4660 217 4689 0 , 217 4654 226 4667 0 226 4650 240 4689 0 জু 240 4650 277 4689 0 ন 269 4660 298 4689 0 298 4660 316 4689 0 ১ 316 4660 332 4689 0 ৩ঃ 334 4661 376 4688 0 376 4655 394 4701 0 হৌ 394 4655 441 4701 0 জি 436 4660 482 4701 0 ক 477

Re: [tesseract-ocr] Scripts are almost same but different language

2019-06-06 Thread Jennil Thiyam
ince sanskrit > training text did not have samples of all letters. I then also added any > new characters that I wanted to add. > > On Thu, 6 Jun 2019, 14:01 Jennil Thiyam, wrote: > >> Manipuri language has been using two scripts, among them one is bengali >> script wit

[tesseract-ocr] Scripts are almost same but different language

2019-06-06 Thread Jennil Thiyam
Manipuri language has been using two scripts, among them one is bengali script with some extra characters,(these extra characters has been using in Assamese's script). As tesseract gives an opportunity to train the already existing model by adding some extra characters, i tried performing fine tuni

Re: [tesseract-ocr] ben.traineddata & Bengali.traineddata

2019-06-04 Thread Jennil Thiyam
ained on bengali, Bengali with ben, asm and English. > > > https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Bengali.langs.txt > > > On Tue, 4 Jun 2019, 17:11 Jennil Thiyam, wrote: > >> What is the difference between ben.traineddata and Bengali.tra

[tesseract-ocr] ben.traineddata & Bengali.traineddata

2019-06-04 Thread Jennil Thiyam
What is the difference between ben.traineddata and Bengali.traineddata, some character are not recognised by the be.traineddata but it was recognised by Bengali.traineddata. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from th

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Jennil Thiyam
Thank you so much for all your help On Fri, May 31, 2019 at 11:26 PM Jennil Thiyam wrote: > So, your suggestion is perform fine tuning process to this > bengali.traineddata? > > On Fri, May 31, 2019 at 11:16 PM Shree Devi Kumar > wrote: > >> https://github.com/tesserac

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Jennil Thiyam
So, your suggestion is perform fine tuning process to this bengali.traineddata? On Fri, May 31, 2019 at 11:16 PM Shree Devi Kumar wrote: > https://github.com/tesseract-ocr/tessdata_best/tree/master/script > > > > On Fri, 31 May 2019, 23:01 Jennil Thiyam, wrote: > >> Wha

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Jennil Thiyam
my guess is that the vowel maatraa that > go on both sides of consonants may have been encoded as separate rather > than one. > > > > > > > On Fri, 31 May 2019, 22:40 Jennil Thiyam, wrote: > >> SHree Devi, any suggestions? >> >> On Fri, May 31, 2019

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Jennil Thiyam
SHree Devi, any suggestions? On Fri, May 31, 2019 at 5:45 PM Jennil Thiyam wrote: > Assamese used some extra characters which are not used in Bengali and our > language, so I want to modify in ben.traineddata. I tried using > asm.traineddata, it recognizes the character that I wante

Re: [tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Jennil Thiyam
; > On Fri, 31 May 2019, 16:58 Shree Devi Kumar, wrote: > >> Please try the asm.traineddata which is for Assamese which is written in >> Bengali script. >> >> On Fri, 31 May 2019, 16:55 Jennil Thiyam, wrote: >> >>> How come this character is in here??? I

[tesseract-ocr] The extra character is not recognized after fine tuning training

2019-05-31 Thread Jennil Thiyam
I have followed the procedure (that is described in training tesseract 4 for fine tuning for putting plus-minus sign in eng.traineddata) to train ben.traineddata (by adding one character which is not in the Bengali alpahbets, more than 30 times, in ben.training_text). after creating starter trainin

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 Illegal instruction (core dumped)

2019-05-30 Thread Jennil Thiyam
The character that i added is still not recognized, do you have any idea ? On Thu, May 30, 2019 at 3:56 PM Shree Devi Kumar wrote: > You have to convert the checkpoint to traineddata - run lstmtraining with > --stop_training flag > > On Thu, May 30, 2019 at 3:44 PM Jennil Thi

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 Illegal instruction (core dumped)

2019-05-30 Thread Jennil Thiyam
traineddata(that I got as an output of tesstrain.sh) or is it the old traineddata? On Thu, May 30, 2019 at 3:56 PM Shree Devi Kumar wrote: > You have to convert the checkpoint to traineddata - run lstmtraining with > --stop_training flag > > On Thu, May 30, 2019 at 3:44 PM Jennil Thi

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 Illegal instruction (core dumped)

2019-05-30 Thread Jennil Thiyam
: > --traineddata ~/tesstitorial/train_wa/ben/ben.traineddata \ > > Typo tere tutorial check spelling > > On Thu, 30 May 2019, 12:05 Jennil Thiyam, wrote: > >> lstmtraining --model_output ~/tesstutorial/train_wa/wa \ >> > --continue_from ~/tesstutorial/train_wa/ben.lstm \

[tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 Illegal instruction (core dumped)

2019-05-29 Thread Jennil Thiyam
lstmtraining --model_output ~/tesstutorial/train_wa/wa \ > --continue_from ~/tesstutorial/train_wa/ben.lstm \ > --traineddata ~/tesstitorial/train_wa/ben/ben.traineddata \ > --old_traineddata tessdata/best/ben.traineddata \ > --train_listfile ~/tesstutorial/train_wa/ben.training_files.txt \ > --max

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-29 Thread Jennil Thiyam
I add only one character like 30 times in the ben.training_text (that too in the end of the original training text), which meant i dint modified the original ben.training_text in large aspect. still why i am getting this "normalization failed" in many of the words which are already in the original

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-29 Thread Jennil Thiyam
One simple question, I get confuse every time. The question is about setting the TESSDATA_PREFIX environment variable. Which path should i set? */usr/local/share/tessdata* (but here i could not find .traineddata, but if this is the path, can i just copy the .traineddata to this folder "tess

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-28 Thread Jennil Thiyam
aineddata for LSTM training of language 'ben' Run 'lstmtraining' command to continue LSTM training for language 'ben' *No error, will this training data be good, i am asking this because i feel lots of things are happening not in the way it has to belike it say

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-28 Thread Jennil Thiyam
t of > fonts. > > It all depends on what you want to accomplish with training. > > On Tue, May 28, 2019 at 5:59 PM Jennil Thiyam > wrote: > >> training/tesstrain.sh \ >> --fonts_dir /c/Windows/Fonts \ >> --tessdata_dir ./tessdata \ >> --training_tex

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-28 Thread Jennil Thiyam
s \ --exposures "0"\ --fontlist "Arial" \ "Arial Unicode MS" \ "Calibri" \ "Courier New" \ --output_dir ~/tesstutorial/araeval can anyone tell me why do we need to create this eval data, i meant it is also going to same as training data. On Tue, Ma

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Jennil Thiyam
Tue, May 28, 2019 at 10:26 AM Jennil Thiyam > wrote: > >> do you mean to change only the path of this old traineddata(in the >> command, that I underlined) to the path of ben.traineddata(that i am going >> to download from tessdata_best)? or do i need to perform the

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Jennil Thiyam
the estimated time it will take for 1500 iterations? Thank you On Mon, May 27, 2019 at 10:20 PM Shree Devi Kumar wrote: > You can download ben.traineddata from tessdata_best in a different > location and use that as part of lstmtraining command > > On Mon, May 27, 2019 at 6:24 PM J

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Jennil Thiyam
els can be used for finetuning. > > On Mon, May 27, 2019 at 4:25 PM Jennil Thiyam > wrote: > >> yes...i extracted with the command combine_tessdata >> >> On Mon 27 May, 2019, 4:23 PM Shree Devi Kumar > wrote: >> >>> Has /ben_extract/ben.lstm be

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Jennil Thiyam
yes...i extracted with the command combine_tessdata On Mon 27 May, 2019, 4:23 PM Shree Devi Kumar Has /ben_extract/ben.lstm been extracted from > /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata ? > > On Mon, May 27, 2019 at 2:55 PM Jennil Thiyam > wrote: > >> I got

[tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-05-27 Thread Jennil Thiyam
I got error whie trying to perform fine tuning, the command i used is below: lstmtraining --model_output /model \ --continue_from /ben_extract/ben.lstm \ --traineddata /tesstutorial_output/ben/ben.traineddata \ --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata \ --tr

[tesseract-ocr] Fine tuning

2019-05-23 Thread Jennil Thiyam
I want to perform fine tuning over ben.traindata by adding one character. It is written that for fine tuning what we need is to add only the desirable characters to langdata/ben/ben,training_text. but in the folder 'ben' it consist other file also like ben.config, ben.params_model,ben.word.bigram,

Re: [tesseract-ocr] Facing some problem in understanding fine tuning

2019-05-22 Thread Jennil Thiyam
> > On Wed, 22 May 2019, 18:16 Jennil Thiyam, wrote: > >> The layout of writing is in some manner in the ben_training.txt, (i have >> attached the sshot). could u please explain how do i put my character in >> this file >> >> On Wed, May 22, 2019 at 5:35 PM Je

Re: [tesseract-ocr] Facing some problem in understanding fine tuning

2019-05-22 Thread Jennil Thiyam
The layout of writing is in some manner in the ben_training.txt, (i have attached the sshot). could u please explain how do i put my character in this file On Wed, May 22, 2019 at 5:35 PM Jennil Thiyam wrote: > we used bengali script, but with one extra character, that is what i want >

Re: [tesseract-ocr] Facing some problem in understanding fine tuning

2019-05-22 Thread Jennil Thiyam
lready existing ben.traindata > model. > > What character do you want to add? > > You should be able to do the same process as the plus-minus training for > one character as shown in example for English. > > On Wed, May 22, 2019 at 1:51 PM Jennil Thiyam > wrote: > >

[tesseract-ocr] Facing some problem in understanding fine tuning

2019-05-22 Thread Jennil Thiyam
I am planning to perform fine tuning training in ben.traindata. According to he procedure written it is said to we that "The training requires a new unicharset/recoder, optional language models, and the old traineddata file containing the old unicharset/recoder." Here I get the old traindata, bu

Re: [tesseract-ocr] After fine tunning training, how do i run on the new model?

2019-05-04 Thread Jennil Thiyam
I am new in tessseract and ubuntu, plz forgive me if if my question does not make sense. will it work if I put this new model inside the folder of Tessdata that is situated in the program files folder? On Sat, May 4, 2019 at 2:44 PM Shree Devi Kumar wrote: > Depends on where you keep the new tra

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-23 Thread Jennil Thiyam
hat rather > than the normal ones. > > Are you doing cut and paste from some word processor? This is probably > causing all the errors... > > > > 2018-07-23 9:48 GMT+02:00 Jennil Thiyam : > >> I tried using Lohit Bengali and here is the command >> >> /usr

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-23 Thread Jennil Thiyam
121: Meera 122: Mitra Mono ... Lohit Bengali is in it, so please tell me why is the error, do i need to do something others too? On Sun, Jul 22, 2018 at 11:00 AM, Shree Devi Kumar wrote: > See https://github.com/tesseract-ocr/tesseract/wiki/Fonts > > On Sun 22 Jul, 2018, 8:20 PM Jenn

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Jennil Thiyam
it-bengali”.exp0.box does not exist or is not readable ERROR: /tmp/tmp.pBWa4wRHmt/ben/ben.“lohit-bengali”.exp0.box does not exist or is not readable SO , please tell is all the fonts which are in this FONTS folder are already installed to tesseract or not? On Sun, Jul 22, 2018 at 7:15 AM, Je

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Jennil Thiyam
Oh sorry for the mistake...I put two dashes, still it says unrecognised.. On Sun 22 Jul, 2018, 4:27 PM Shree Devi Kumar, wrote: > needs two dashes, > > On Sun, Jul 22, 2018 at 12:29 PM wrote: > >> hello again, i modified the error in the way you said and there is no >> error. but now the same e