[tesseract-ocr] How to create training data in teseract5.3.0 use tesstrain.sh way?

2023-10-22 Thread
Hello, everyone: As we know.in tesseract 5.0 , we can use tesstrain.sh to create training data,but in tesseract5.3.0, the tesstrain.sh script is removed. The guide says:" * bash scripts is unsupported/abandoned for Tesseract 5. Please use python scripts from tesstrain repo

[tesseract-ocr] Re: command line and code get different result

2020-06-22 Thread
Does anyone knows,thanks. 在 2020年6月19日星期五 UTC+8下午5:27:55,易鑫写道: > > Hello,every one: > > I use the command line and C++ code to recognize the image text ,but > get different result.The image file and traineddata file are in the > attachments. > I used this command

[tesseract-ocr] command line and code get different result

2020-06-19 Thread
Hello,every one: I use the command line and C++ code to recognize the image text ,but get different result.The image file and traineddata file are in the attachments. I used this command: tesseract test_20200617_1_1854.png stdout -l regular_rgb_layer The result is -6X201,it is correct

[tesseract-ocr] Some quesetions about ocrd-train

2020-06-03 Thread
Hello,everyone: Currently I use the "https://github.com/tesseract-ocr/tesstrain " project for training my own dataset. I use this command, make unicharset lists training MODEL_NAME=foo TESSDATA=./tessdata_best GROUND_TRUTH_DIR=./data/foo-ground-truth PSM=7 I encounter one error that tell

[tesseract-ocr] How to use tesseract to recognzie verification code

2020-06-01 Thread
Hello,everyone : Now I want to use tesseract to train verification code,I have 2 samples and labels.I use my own dataset ,so I use this project https://github.com/tesseract-ocr/tesstrain . The length of verification code is always 4. How to use this information to improve the accuracy

Re: [tesseract-ocr] How can use tessract for training using my own image dataset

2020-06-01 Thread
ain > > On Mon, Jun 1, 2020 at 11:16 AM 易鑫 > > wrote: > >> Hello,everyone: >> As we all know,after teseract v4.0,it can generate dataset >> automatically.But for me ,the accuracy is not as good as I expected,I want >> to use my own image dat

[tesseract-ocr] How can use tessract for training using my own image dataset

2020-05-31 Thread
Hello,everyone: As we all know,after teseract v4.0,it can generate dataset automatically.But for me ,the accuracy is not as good as I expected,I want to use my own image dataset for training,can tesseract v4.0.0 support this function? Thanks in advance. -- You received this message becau

Re: [tesseract-ocr] Can anyone tell the the improvement in 5.0.0-alpha

2020-04-03 Thread
mpatibility should > use 4.1. > Those who wish to improve tesseract should use master branch. > > Zdenko > > > ut 31. 3. 2020 o 8:54 易鑫 napísal(a): > >> Hello,everyone: >> I can see the 5.0.0-alpha version in the github,can some one >> tell me its imporv

[tesseract-ocr] Can anyone tell the the improvement in 5.0.0-alpha

2020-03-30 Thread
Hello,everyone: I can see the 5.0.0-alpha version in the github,can some one tell me its imporvement.Because I used 4.0.0 verion currenlty,if 5.0.0-alpha has a great imporvement,I will change to the lastest version,thans in advance. -- You received this message because you are subscrib

Re: [tesseract-ocr] Re: Can I use this way for fine tuning?

2019-04-23 Thread
set up your START_MODEL > as chi_sim. > > On Sun, Apr 21, 2019 at 19:34 易鑫 wrote: > >> No,I want to fine tuning using actual images. >> >> suraa syss 于2019年4月19日周五 下午7:07写道: >> >>> you want to prepare unicharset before lstm training >>> >>&g

Re: [tesseract-ocr] Re: Can I use this way for fine tuning?

2019-04-21 Thread
No,I want to fine tuning using actual images. suraa syss 于2019年4月19日周五 下午7:07写道: > you want to prepare unicharset before lstm training > > On Thursday, 18 April 2019 14:49:20 UTC+5:30, yixinl...@gmail.com wrote: >> >> Hello,everyone: >> I have used tesseract 4.0 to train a chi_sim model,but

Re: [tesseract-ocr] Can I use this way for fine tuning?

2019-04-18 Thread
Is anybody here,can some one help me,thanks a lot. 于2019年4月18日周四 下午5:19写道: > Hello,everyone: > I have used tesseract 4.0 to train a chi_sim model,but the result is > not so good as I expected,So I think out one way for fine tuning. > > 1.src/training/tesstrain.sh --fonts_dir /usr/share/font

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-18 Thread
> You can improve the script with an iteration and stop if the improvement > over the best result is below a threshold for a few epochs. I found no real > advantage in doing this as the training is quite fast and I have no problem > in letting it run while I do something else. > > &

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-17 Thread
Thank you very much. >>"Train for a few epochs (100 or 1000 depending on how much data you have), stop it and check with lstmeval if the *eval score* is improving. Restart the training adding 100/1000 to the max_iterations and continue from the previous model and repeat until the eval score stops t

[tesseract-ocr] question about "target_error_rate" parameter

2019-04-11 Thread
Hello,everyone: I want to ask a question about "*--target_error_rate*" parameter. The target error rate is based on the training data set or eval data set? I think the "--target_error_rate" should based on eval data set,but actually, I find even I do not generate eval data,this command can r

Re: [tesseract-ocr] Re: Tesseract on VS

2019-04-11 Thread
1. Fewer of the images are not very clean have different sort of noises and character are miss reading like Z as 2 B as 8 and S as 5. So i just want to know what all you are using for Tesseract like: You should clean the noise before you use the tesseract. please refer to https://github.com/tesse

Re: [tesseract-ocr] Questions about recognize Chinese characters

2019-04-08 Thread
Does some one know the reason? thanks. 易鑫 于2019年4月8日周一 上午10:42写道: > Hello,everyone: > > Good day!I have trained a chi_sim model to recognize the Chinese > characters.You can find the sample image in the attach file. > > I find that the two Chinese characters are a little

Re: [tesseract-ocr] How to train tesseract with ancient Greek character

2019-04-08 Thread
I have tried,but still can not recognize " Φ ". 易鑫 于2019年4月8日周一 上午9:44写道: > thanks a lot.I will try. > > Shree Devi Kumar 于2019年4月4日周四 下午10:05写道: > >> You don't need to add *"GFS Artemisia" as it may not have the Chinese >> characters.* &g

[tesseract-ocr] Questions about recognize Chinese characters

2019-04-07 Thread
Hello,everyone: Good day!I have trained a chi_sim model to recognize the Chinese characters.You can find the sample image in the attach file. I find that the two Chinese characters are a little connected and the image is very very clear. But tesseract regarded as one Chinese character ,s

Re: [tesseract-ocr] Tesseract on VS

2019-04-07 Thread
I am using Tesseract on Visual Studio 2017 Shobhit Kapil 于2019年4月5日周五 下午10:29写道: > Hi All, > > Is there anyone who is using Tesseract on windows using Visual Studio. > > If yes i will be having few questions to ask. > > > Thanks, > Shobhit > > -- > You received this message because you are subs

Re: [tesseract-ocr] How to train tesseract with ancient Greek character

2019-04-07 Thread
port it. > Verify in generated tif files that it is getting rendered. > > On Thu, Apr 4, 2019 at 7:25 AM 易鑫 wrote: > >> Does anybody knows how to solve this problems?thanks. >> >> 易鑫 于2019年4月3日周三 下午12:37写道: >> >>> Hello,everyone: >>> >>&g

Re: [tesseract-ocr] How to train tesseract with ancient Greek character

2019-04-03 Thread
Does anybody knows how to solve this problems?thanks. 易鑫 于2019年4月3日周三 下午12:37写道: > Hello,everyone: > >I want to recognize the content in the table image.(You can get it > in the attach file).It contains Chinese characters and some English > letters, the most troublesome

Re: [tesseract-ocr] Getting Unexpected result

2019-04-01 Thread
; [image: Screenshot from 2019-03-29 21-45-24.png] > > > On Friday, March 29, 2019 at 9:14:40 AM UTC+5:30, 易鑫 wrote: >> >> Which traineddata did you use? >> try to use tesseract ***.jpg stdout -l eng_best --psm 7 >> >> >> Heeramani Prasad 于2019年

Re: [tesseract-ocr] What does special character "|" mean?

2019-03-28 Thread
my training text do not have the "|" character, does this character is reserved? 易鑫 于2019年3月29日周五 下午1:34写道: > I' m sorry I don't quite understand what you mean.Shall I ignore the "|" > character ? > > Shree Devi Kumar 于2019年3月29日周五 下午12:45写道: >

Re: [tesseract-ocr] What does special character "|" mean?

2019-03-28 Thread
I' m sorry I don't quite understand what you mean.Shall I ignore the "|" character ? Shree Devi Kumar 于2019年3月29日周五 下午12:45写道: > The training text that I used for replace layer has the | character. > > On Fri, 29 Mar 2019, 08:51 易鑫, wrote: > >> Hello,eve

Re: [tesseract-ocr] Getting Unexpected result

2019-03-28 Thread
Which traineddata did you use? try to use tesseract ***.jpg stdout -l eng_best --psm 7 Heeramani Prasad 于2019年3月29日周五 上午2:31写道: > I am using tesseract 4.0.0-beta.1. > I am using > > tesseract filename.jpg - --psm 6 > > as command to get output. But,i get wrong output. Input file in image in >

Re: [tesseract-ocr] Re: Replace top layers, output class count and recoder (chi_sim)

2019-03-27 Thread
"The number of classes is ignored (only there for compatibility with TensorFlow) as the actual number is taken from the unicharset."* *So the number is ignored I think.* 易鑫 于2019年3月27日周三 下午4:35写道: > Hello, > Did you fix this problem, I am encounter this problem now? I have tried

[tesseract-ocr] Re: Replace top layers, output class count and recoder (chi_sim)

2019-03-27 Thread
Hello, Did you fix this problem, I am encounter this problem now? I have tried many ways,include your method. thanks. 在 2019年1月24日星期四 UTC+8下午10:55:37,Shiming He写道: > > Hi group, > > I'm trying to retrain top layers from the chi_sim tessdata_best model > using Tesseract 4.0.0. Combine_tessdata sa

Re: [tesseract-ocr] The problem of training eng + chi_sim

2019-03-25 Thread
and how many lines are the training_text is better , the total number of my character is no more than 100. 易鑫 于2019年3月26日周二 上午9:50写道: > okay.Thank you very much. > But does 36000 iterations overfit will happen? > > Shree Devi Kumar 于2019年3月25日周一 下午11:43写道: > >> 36000 itera

Re: [tesseract-ocr] The problem of training eng + chi_sim

2019-03-25 Thread
al/chi_sim_train/chi_sim/chi_sim.traineddata \ >> --append_index 5 --net_spec '[Lfx192 O1c1]' \ >> --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \ >> --max_iterations 3 >> >> On Mon, Mar 25, 2019 at 4:14 PM 易鑫 wrote: >> >&

Re: [tesseract-ocr] The problems about training eng+chinese

2019-03-20 Thread
Thank you very much. Shree Devi Kumar 于2019年3月20日周三 下午2:20写道: > On Wed, Mar 20, 2019 at 9:57 AM 易鑫 wrote: > >> Thank you very much for your reply, your result is pretty good. >> >> You are right, I want to limit my unicharset. >> I want to ask you a few question

Re: [tesseract-ocr] The problems about training eng+chinese

2019-03-19 Thread
Thank you very much for your reply, your result is pretty good. You are right, I want to limit my unicharset. I want to ask you a few questions: 1.What pre-processing have you done? only Binarisation,Rotation and Deskewing? 2.From your result,chi_sim_tuned.txt, also contains some characters that

Re: [tesseract-ocr] The problems about training eng+chinese

2019-03-19 Thread
thanks for your advice,I will try. Shree Devi Kumar 于2019年3月19日周二 下午10:01写道: > You are using a number of Japanese, Koean and Traditional Chinese fonts > for training. Try without them. > > On Tue, Mar 19, 2019 at 4:19 PM 易鑫 wrote: > >> Hello,everyone: >> I want

Re: [tesseract-ocr] How to use tesseract for these images

2019-03-18 Thread
You mean recognize the characters with green background? Shailesh Barve 于2019年3月19日周二 上午4:42写道: > I would like to read the meter reading from the attached image. > I tried several preprocessing but tesseract is not able to read the meter > readings from the attached images. > > Pls help. > > --

Re: [tesseract-ocr] Improving accuracy on recognition Tesseract 4.3.1

2019-03-15 Thread
The latest Tesseract version is 4.0.0,how do you get the 4.3.1 version? Alberto Andreotti 于2019年2月25日周一 上午8:38写道: > Hello, > > You can try the OCR preprocessing in spark NLP, if you are on Python or > Scala. > Try to use the scaling option. > > Alberto. > > On Feb 24, 2019 2:21 PM, "'Nenad Koce

Re: [tesseract-ocr] How to choose a suitable threshold for Binarisation

2019-03-12 Thread
input, especially with scans with poor lightning and bad contrast. >> >> there is a nice comparison and yet another method in >> >> https://www.google.com/url?sa=t&source=web&rct=j&url=https://arxiv.org/pdf/1609.08078&ved=2ahUKEwjasPKLgvLgAhXLJVAKHdrHCV8QFj

Re: [tesseract-ocr] Re: How to choose a suitable threshold for Binarisation

2019-03-06 Thread
thank you very much 于2019年3月7日周四 上午1:26写道: > Hey, > i found this theis that propose a way to calculate the threshold and > represent other methods : > https://d-nb.info/989481123/34 > i would help. > best regards > > Le mercredi 20 février 2019 03:56:33 UTC+1, 易鑫 a écri

Re: [tesseract-ocr] How to train two different language using tesseract 4.0

2019-02-27 Thread
st 27. 2. 2019 o 9:03 易鑫 napísal(a): > >> Hello,everyone: >> Now I want to recognize the text in the images,but the images do not >> contain only one language,it contain English and Chinese,so I want to >> recognize them simultaneously. >> In that case, I wil

[tesseract-ocr] How to train two different language using tesseract 4.0

2019-02-27 Thread
Hello,everyone: Now I want to recognize the text in the images,but the images do not contain only one language,it contain English and Chinese,so I want to recognize them simultaneously. In that case, I will train a model that satisfy English and Chinese,right? and how to train the model i

[tesseract-ocr] How to choose a suitable threshold for Binarisation

2019-02-19 Thread
Hello,everyone: In the tesseract wiki " https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality";, it says the importance of binarisation to the final recognition result. I have tired many methods for choose a suitable threshold, but I have not find a very perfect method. Does an

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread
output_dir ~/tesstutorial/engtrain >>>>> >>>>> >>>>>> === Starting training for language 'eng' >>>>> >>>>> /usr/bin/language-specific.sh: line 1175: FONTS: unbound variable >>>>> >>>>> &g

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread
, csütörtök 10:27:17 UTC+1 időpontban 易鑫 a következőt írta: >> >> Thanks for your reply. I have already tried to do lstm trianing on ubuntu >> successfully, but the result is not so good as I expected and I do not use >> my tiff/box file,so I want to add more sample,that'

[tesseract-ocr] How to write .unicharambigs file?

2019-01-31 Thread
Hello,everyone: I have trained a new lstm model in my project,but the result is not so good as I expected. I notice that some characters often mistake in my result. I learned that add some rules in .unicharambigs can reduce the mistakes? I extract the eng.traineddata and get the

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread
;>> --noextract_font_properties --langdata_dir ~/langdata --tessdata_dir >>> ../tessdata --output_dir ~/tesstutorial/engtrain >> >> >>> === Starting training for language 'eng' >> >> /usr/bin/language-specific.sh: line 1175: FONTS: unbound variab

[tesseract-ocr] How to do lstm training using box/tiff files?

2019-01-30 Thread
Hello,everyone: I used tesseract 3.05 engine before, I have lots of tiff and box file, now I want to use tesseract 4.0.0 engine for lstm training. I want to know how to train use the tiff/box files in the new engine? Thanks in advance. -- You received this message because you are subscr

[tesseract-ocr] My confusion about "Fine Tuning for ± a few characters"

2019-01-30 Thread
Hello,everyone: I get some confusion about "*Fine Tuning for ± a few characters*". In the wiki *(* https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters *),* it says "*Modify**langdata/eng/eng.training_text to include some samples of ±."*

Re: [tesseract-ocr] Tesseract with Thai language

2019-01-29 Thread
Please upload your image file,I can try in my environment. 于2019年1月28日周一 下午3:50写道: > Hi, > > I am using Tesseract OCR v 4 for extracting text form an Thai language > image file. I am able to extract the Thai characters perfectly on Windows > environment whereas when I extract the same on Ubuntu

Re: [tesseract-ocr] Why can't recognize characters in C++ code?

2019-01-29 Thread
The problem has been solved.The reason is that the binary image should use white as background and black as foreground, Maybe there exist some issues during my training process. 易鑫 于2019年1月30日周三 上午10:38写道: > Hello,everyone: > I have fine tuned a new lstm model named engtuned.trainedd

[tesseract-ocr] Why can't recognize characters in C++ code?

2019-01-29 Thread
Hello,everyone: I have fine tuned a new lstm model named engtuned.traineddata. I have test it in the command line,it is okay. In the command line: PS D:\git_workspace\TableOcrRecognition\Tesserac-OCR-Train> *tesseract D:\git_workspace\TableOcrRecognition\TableDetect\TableDetect\Data\Tmp\2.pn

[tesseract-ocr] How to training lstm model on this occasion

2019-01-29 Thread
Hello,everyone: Now I want to recognize some characters using tesseract. The candidate characters *only contains "0123456789.-ABLQX" .* Because only 17 different characters, I want to get a very high accuracy. I read the wiki,but I don't quite understand,it seems that I should use Fine

Re: [tesseract-ocr] How to use fine tuning for training?

2019-01-28 Thread
gt; combine_tessdata -o ./tessdata/eng_new.traineddata \ > ~/tesstutorial/engtuned_from_eng/eng.lstm \ > > You need to extract eng.lstm from tessdata_best > > On Tue, 29 Jan 2019, 09:37 易鑫 >> Hello,everyone: >> >> Now I want to recognize the character in the table*,

Re: [tesseract-ocr] Re: some questions about lstm training

2019-01-24 Thread
thank you so much,I will try. Aodren BARY 于2019年1月25日周五 下午2:03写道: > Yes you need to install some fonts > You can find a tutorial here > http://www.linuxandubuntu.com/home/how-to-install-microsoft-fonts-in-ubuntu-linux > You can find the fonts that tesseract use for the his command in the > scrip

Re: [tesseract-ocr] Re: some questions about lstm training

2019-01-24 Thread
*I do not run the command:* src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain >From the wiki,I thought the it is optional Now I think this co

[tesseract-ocr] some questions about lstm training

2019-01-24 Thread
Hello,everyone: I am a new user of tesseract 4.0.Now I follow the instructions( *https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00)* to training lstm model. By the way,my environment is Ubuntu16.04 and I compile the tessract 4.0 by myself.I met some problems. I follow

Re: [tesseract-ocr] train tesseract 4.0

2019-01-20 Thread
tesseractV4 Aodren BARY 于2019年1月18日周五 下午10:17写道: > Thanks for the answer, > i didn't try your solution, I will do it, do you tesseract v3 or V4?? > > Le vendredi 18 janvier 2019 10:35:26 UTC+1, 易鑫 a écrit : >> >> Hello,I am also a new user of Tesseract. I have train

Re: [tesseract-ocr] Re: what does "batch.nochop" mean?

2019-01-15 Thread
o 2019 12:06:04 UTC+1, 易鑫 ha scritto: >> >> Hello, >> I use this command to generate box file: >> >> tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] >> batch.nochop makebox >> >> and if I omit batch.nochop >> tesseract

[tesseract-ocr] what does "batch.nochop" mean?

2019-01-04 Thread
Hello, I use this command to generate box file: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox and if I omit batch.nochop tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] makebox The box file can also be generated,the content is

Re: [tesseract-ocr] Unable to recognize image

2019-01-03 Thread
Try to use this command: tesseract Capture.jpg stdout -l eng --psm 7 --oem 3 prcrAGER Z 于2019年1月4日周五 上午2:31写道: > I am using version 4.0.0.20181030 to recognize the text on the image > (Capture.jpg). But it only saves a weird symbol in the text file > (Capture.txt). The command line I used is: >

Re: [tesseract-ocr] Really poor performance with decimal numbers

2019-01-02 Thread
Please upload the images then we can use then to try. 于2019年1月3日周四 下午1:21写道: > Hello everybody, > > did anybody get this "solved". I played a lot with upscaling, gamma > changes, contrast etc. but I keep on getting errors, in particular missing > decimal points even though the point seem to be v

Re: [tesseract-ocr] Re: Shapeclustering Not Responding

2019-01-01 Thread
The issues has been resolved.The reason is that the "font_properties" file must be formatted with UTF-8. 易鑫 于2018年12月29日周六 下午5:50写道: > I use tesseract-ocr-w64-setup-v4.0.0.20181030 and jTessBoxEditor-2.2.0 > in windows10. I use 3 images for test,you can find it in

Re: [tesseract-ocr] Re: Shapeclustering Not Responding

2018-12-29 Thread
I use tesseract-ocr-w64-setup-v4.0.0.20181030 and jTessBoxEditor-2.2.0 in windows10. I use 3 images for test,you can find it in the attach files sample.zip. 1. I use jTessBoxEditor to merge the 3 images. The merged file name is "langyp.fontyp.exp0.tif" 2. generate box file tesseract l