Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

2019-10-06 Thread Shree Devi Kumar
tif > and lstmf files. Am I right? so where should I place this script file in > tesseract? or should I directly run this before the generation of the > box,tif and lstmf files? Please correct me if my understanding is wrong. > > Thank you. > > On Sat, Oct 5, 2019 at 10:55 PM Shre

Re: [tesseract-ocr] German lang support

2019-10-08 Thread Shree Devi Kumar
see https://github.com/UB-Mannheim/tesseract/wiki/Install-additional-language-and-script-models On Tue, Oct 8, 2019 at 3:09 PM Leopold Hamminger wrote: > Thank you, Zdenko > > I downloaded tesseract and installed it on my PC running Win 10. tesseract > --version returns: v5.0-0-alpha.20190708. -

Re: [tesseract-ocr] How to effeciently extend the training_text file?

2019-10-10 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 This was for Devanagari and Indic languages. Also see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-text-requirements On Thu, Oct 10, 2019 at 12:45 PM peter bence wrote: > I'm wor

Re: [tesseract-ocr] Why are the results of lstmeval and tesseract different?

2019-10-10 Thread Shree Devi Kumar
I suggest that you open issue in tesstrain repo. The makefile does training from scratch. Is that what you wanted? Do you have a large enough training text - how many lines? How many iterations for training? Eval Char error rate=133.3, Word error rate=96.875 That is a very high error rate.

Re: [tesseract-ocr] CentOS 8 package?

2019-10-10 Thread Shree Devi Kumar
@AlexanderP maybe able to build one. On Thu, Oct 10, 2019 at 8:09 PM 'Mario Trojan' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Dear Community, > > what's the best way to use tesseract under CentOS 8 right now? > > Currently we're using the EPEL package under CentOS 7.7 (3.04.00)

Re: [tesseract-ocr] CentOS 8 package?

2019-10-10 Thread Shree Devi Kumar
opensuse.org/projects/home:Alexander_Pozdnyakov/public_key > dnf install tesseract > dnf install tesseract-langpack-deu > > > чт, 10 окт. 2019 г. в 18:31, Shree Devi Kumar : > >> @AlexanderP maybe able to build one. >> >> On Thu, Oct 10, 2019 at 8:09 PM 'Mario

Re: [tesseract-ocr] Input in Arabic Eastern Numbers and Output in Arabic Western Numbers

2019-10-14 Thread Shree Devi Kumar
Replace AEN in your box files with AWN and rerun training, using the original tif files On Mon, Oct 14, 2019, 12:16 Mobeen Ali wrote: > Hello everyone! I'm stuck with a problem of creating a traineddata file > that reads numerals in arabic and gives output in english numerals. > >- Input = A

Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

2019-10-14 Thread Shree Devi Kumar
sin) folder. But > still same problem is there by giving warning message and normalization > failed message [1] > > > > On Mon, 14 Oct 2019, 18:34 Shree Devi Kumar, wrote: > >> What about text in langdata_lstm? >> >> On Mon, Oct 14, 2019 at 2:44 PM Isurianurad

Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

2019-10-15 Thread Shree Devi Kumar
ue, 15 Oct 2019, 12:21 Shree Devi Kumar, wrote: > >> Check if you also have an installed version of tesstrain.sh? >> >> >> On Tue, Oct 15, 2019, 11:26 Isurianuradha96 >> wrote: >> >>> I changed as you mentioned but giving the same warning as the

Re: [tesseract-ocr] Re: Please advise

2019-10-16 Thread Shree Devi Kumar
There are also third party GUI interfaces for tesseract. The ones that I have used at times are vietocr and gimagereader. On Wed, Oct 16, 2019, 17:13 Leopold Hamminger wrote: > I was new a few weeks ago and found tesseract quite easy to use. However, > you should know the basics of console inpu

Re: [tesseract-ocr] tesseract data language model sources

2019-10-17 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Hi tesseract community, > > I'm working on a research project about OCR and I'm wondering where the > includ

Re: [tesseract-ocr] tesseract data language model sources

2019-10-17 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/langdata_lstm has the files used. On Fri, Oct 18, 2019 at 9:39 AM Shree Devi Kumar wrote: > See > https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 > > > On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesser

Re: [tesseract-ocr] Form Recognizer using Ocr

2019-10-17 Thread Shree Devi Kumar
You can try with uzn files. See https://jsoma.github.io/kull/#/ On Fri, Oct 18, 2019 at 11:03 AM Rahul Dochak wrote: > Hi All, > > I have a task and I could see a way to approach this but i do not know > how to ,what i am trying to do is this: > I want to make a form recogniser and then extr

Re: [tesseract-ocr] Form Recognizer using Ocr

2019-10-17 Thread Shree Devi Kumar
Rahul > > On Friday, October 18, 2019 at 11:16:54 AM UTC+5:30, shree wrote: >> >> You can try with uzn files. See https://jsoma.github.io/kull/#/ >> >> On Fri, Oct 18, 2019 at 11:03 AM Rahul Dochak >> wrote: >> >>> Hi All, >>> >>

Re: [tesseract-ocr] Segmentation Fault Core Dumped during LSTM training

2019-10-18 Thread Shree Devi Kumar
Check your netspec. Does it meet the required vgsl specs. See wiki for details and netspec used for various languages. On Fri, Oct 18, 2019, 15:07 Shubham Gupta wrote: > Hi All > > I am training Tesseract for Perso-Arabic languages using my custom > dataset. I get *Segmentation fault-core dumped

Re: [tesseract-ocr] OCR results are different on different OS (Linux and Windows)

2019-10-22 Thread Shree Devi Kumar
Please check Tesseract version on both with tesseract -v Share an example image and the output you received on Mac OS and Ubuntu. On Wed, Oct 23, 2019, 00:46 Yu Wang wrote: > Hi, I experienced the same as Karan reported. I first installed tesseract > on my macbook pro, then later on an Ubunt

Re: [tesseract-ocr] OCR results are different on different OS (Linux and Windows)

2019-10-23 Thread Shree Devi Kumar
blicly. > > On Wed, Oct 23, 2019 at 3:10 AM Shree Devi Kumar > wrote: > >> Please check Tesseract version on both with >> >> tesseract -v >> >> Share an example image and the output you received on Mac OS and Ubuntu. >> >> >> >>

Re: [tesseract-ocr] WordStr box file format?

2019-10-24 Thread Shree Devi Kumar
Looks ok. The dimensions need to match the bounding box in your tif. You can extract unicharset from the training text also. On Thu, Oct 24, 2019, 15:00 Adam Funk wrote: > Hi, > > I'm a bit confused by some of the comments in the tesseract > documentation, issues, and wiki about the addition of

Re: [tesseract-ocr] Getting the model output from my trained data

2019-10-25 Thread Shree Devi Kumar
You are mixing legacy Tesseract training and LSTM training. The traineddata and other files from jtessboxeditor seem to be for the legacy engine. On Fri, Oct 25, 2019, 11:18 'ZenMaster181' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Hi, I am new to this training tesseract. > I

Re: [tesseract-ocr] Re: Getting the model output from my trained data

2019-10-25 Thread Shree Devi Kumar
If you have the box and tiff files from jtesseditor, you can use https://github.com/tesseract-ocr/tesstrain for training However, training is needed only in special cases. Have you tried with existing traineddata files? On Fri, Oct 25, 2019 at 1:02 PM 'ZenMaster181' via tesseract-ocr < tesseract

Re: [tesseract-ocr] Re: Getting the model output from my trained data

2019-10-25 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--4alpha--network-specification-for-tessdata_fast Your trainedadata file is of legacy format. It will NOT work

Re: [tesseract-ocr] Re: Getting the model output from my trained data

2019-10-25 Thread Shree Devi Kumar
You are mixing many different approaches for training. If you have box/tiff pairs, use makefile from tesseract-ocr/tesstrain If you want to train from text and fonts, use tesseract-ocr/tesseract/src/tesstrain.sh On Fri, Oct 25, 2019 at 2:57 PM 'ZenMaster181' via tesseract-ocr < tesseract-ocr@g

Re: [tesseract-ocr] Corrupt eng.traineddata output file?

2019-10-27 Thread Shree Devi Kumar
Please open an issue in the tesstrain repository and include test data for your issue to be reproduced eg. a sample of files that fail (text that leads to the 4c lstmf files) On Sun, Oct 27, 2019 at 10:16 PM J Adam Funk wrote: > I have partly figured out what's wrong. From the 9 matching *.

Re: [tesseract-ocr] Force Tesseract to do individual character OCR only

2019-10-28 Thread Shree Devi Kumar
Have you tried to ocr it character by character, using appropriate psm. On Tue, Oct 29, 2019, 09:42 Dave Wood wrote: > I am trying to use Tesseract to OCR screen shots from various Windows > applications. So essentially the data is a random collection of letters > and numbers, not written words

Re: [tesseract-ocr] Tesstutorial fails to generate lstmf files

2019-11-05 Thread Shree Devi Kumar
It fails with latest code. See https://github.com/tesseract-ocr/tesseract/issues/2748 Try with an older commit. On Tue, Nov 5, 2019, 11:32 Khangaroo wrote: > Hi. I'm trying to create a fine-tuned model for Tesseract, but the > tesstrain.sh script always appears to fail on "Phase E: Generatin

Re: [tesseract-ocr] Tesstutorial fails to generate lstmf files

2019-11-05 Thread Shree Devi Kumar
Google search about uzn, there are utilities to generate them. On Tue, Nov 5, 2019, 14:20 Shree Devi Kumar wrote: > It fails with latest code. > > See https://github.com/tesseract-ocr/tesseract/issues/2748 > > > Try with an older commit. > > On Tue, Nov 5, 2019, 11:32 K

Re: [tesseract-ocr] Re: How to process PDF files line by line with tesseract

2019-11-10 Thread Shree Devi Kumar
See https://stackoverflow.com/questions/34981144/split-text-lines-in-scanned-document On Sat, Nov 9, 2019 at 3:10 AM Aaron Stewart wrote: > If you have any suggestions on how to split input images into individual > text lines, I would appreciate it. I am able to use Python and OpenCV, but > I

Re: [tesseract-ocr] Extracting character locations from image with tesseract

2019-11-13 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/2580#issuecomment-553393800 for an example On Wed, Nov 13, 2019 at 6:17 PM Kljuka Kljucavnicar wrote: > Hi, > I would like to OCR an image with a single word on it and output .hocr > file with coordinates of each character on that image (norm

Re: [tesseract-ocr] Re: tesseract in Windows

2019-11-15 Thread Shree Devi Kumar
Process the same invoice image on both platforms with tesseract command line and compare those results. post results of tesseract --version on each inform which traineddata file you are using - language is eng, but is it best/fast or tessdata… etc. On Fri, Nov 15, 2019 at 5:18 PM MATHANKUMAR m

Re: [tesseract-ocr] Different Outputs on creating my own traineddata

2019-11-16 Thread Shree Devi Kumar
tesseract --version Share output of above command on each platform. Share an image and output on each platform. On Sun, Nov 17, 2019 at 12:54 PM Mobeen Ali wrote: > Hi everyone! > > i have successfully created my own custom traineddata file. I've done the > training on ubuntu OS and it was giv

Re: [tesseract-ocr] Failed to load language 'eng'

2019-11-18 Thread Shree Devi Kumar
You can use --oem 0 and 2 only with the traineddata file from tessdata repo. Those are the only files which also have the legacy models. On Tue, Nov 19, 2019, 11:07 MATHANKUMAR m wrote: > I do facing an issue while using the OCR engine modes 0 & 2. > > > Failed loading language 'eng',Tesseract c

Re: [tesseract-ocr] Failed to load language 'eng'

2019-11-18 Thread Shree Devi Kumar
to work with oem 0,2 values. > what i supposed to do get a response from those values > > On Tuesday, 19 November 2019 12:12:20 UTC+5:30, shree wrote: >> >> You can use --oem 0 and 2 only with the traineddata file from tessdata >> repo. Those are the only files which also ha

Re: [tesseract-ocr] Failed to load language 'eng'

2019-11-18 Thread Shree Devi Kumar
able then provide me. > On Tuesday, 19 November 2019 12:23:37 UTC+5:30, shree wrote: >> >> If you so want, you can copy the legacy model files from the traineddata >> in tessdata repo to another traineddata. >> >> See the combine_tessdata command for unpacking and combini

Re: [tesseract-ocr] Tools required to build ,debug and trace tesseract code on linux

2019-11-19 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation On Wed, Nov 20, 2019, 12:46 Essam Zaky wrote: > Dears sorry for this basic question > I'm new in Linux world > now i need to build ,debug , and trace tesseract code and see how it's > working step by step in linu

Re: [tesseract-ocr] Re: Tools required to build ,debug and trace tesseract code on linux

2019-11-20 Thread Shree Devi Kumar
https://www.cs.cmu.edu/~gilpin/tutorial/ https://web.eecs.umich.edu/~sugih/pointers/summary.html On Wed, Nov 20, 2019, 16:10 Essam Zaky wrote: > > Thanks Shree > The link describes the build process > ?but what is the IDE will be used to debug and trace the code ,In windows &g

Re: [tesseract-ocr] OCR broken characters in images using Tessearact

2019-11-21 Thread Shree Devi Kumar
convert test1.png -despeckle -despeckle -despeckle -despeckle -despeckle -despeckle -despeckle -despeckle -despeckle -despeckle miff:- | textcleaner -f 25 -o 10 - result.png convert -units PixelsPerInch result.png -resample 300 result1.png tesseract result1.png - 27627 uses textcleaner from http:

Re: [tesseract-ocr] OCR broken characters in images using Tessearact

2019-11-22 Thread Shree Devi Kumar
ref: https://imagemagick.org/discourse-server/viewtopic.php?t=33628#p154457 On Sat, Nov 23, 2019 at 11:53 AM lucmaa wrote: > Hi, shree > Why is the option -despeckle repeated so many times in the command > convert? > > On Friday, 22 November 2019 13:17:32 UTC+8, shree wrote: >

Re: [tesseract-ocr] Arabic Text Sort Left to Right

2019-11-23 Thread Shree Devi Kumar
Training for all languages including RTL languages is done in LTR order. See https://github.com/tesseract-ocr/tesseract/issues/2082 and other related issues in github On Sun, Nov 24, 2019 at 1:28 AM Ishak DÖLEK wrote: > Hi; > I create a trainneddata for an Arabic font. > I prepared the ara.train

Re: [tesseract-ocr] Recognizing blurred dots as CJK characters

2019-11-25 Thread Shree Devi Kumar
have you tried `osd` - orientation and script detection? On Mon, Nov 25, 2019 at 8:13 PM Jeetendra Ahuja < jeetendra.ahuja...@gmail.com> wrote: > So before processing a document, we want to rejects ones which are CJK so > I've used Tesseract for this.. It does pretty good job but some times when

Re: [tesseract-ocr] Recognizing blurred dots as CJK characters

2019-11-25 Thread Shree Devi Kumar
Also try with 300 dpi On Mon, Nov 25, 2019 at 9:45 PM Jeetendra Ahuja < jeetendra.ahuja...@gmail.com> wrote: > Nopes, I will do it. Thanks. > > On Monday, November 25, 2019 at 9:48:08 AM UTC-5, shree wrote: >> >> have you tried `osd` - orientation and script detection?

Re: [tesseract-ocr] Re: Tesseract 4.1.0 released

2019-12-07 Thread Shree Devi Kumar
tessdata supports both legacy engine and lstm engine. Tessdata_fast and tessdata_best only support lstm engine. To use tessdata_fast , use oem engine code 1. On command line it is --oem 1.please look up the corresponding syntax. On Sat, Dec 7, 2019, 14:06 NY C wrote: > Hi, I am using tess-two

Re: [tesseract-ocr] Re: Tesseract 4.1.0 released

2019-12-07 Thread Shree Devi Kumar
ocrEngineMode On Sat, Dec 7, 2019, 14:35 Shree Devi Kumar wrote: > tessdata supports both legacy engine and lstm engine. Tessdata_fast and > tessdata_best only support lstm engine. > > To use tessdata_fast , use oem engine code 1. > > On command line it is --oem 1.

Re: [tesseract-ocr] TrainingTesseract 4.00

2019-12-09 Thread Shree Devi Kumar
text2image is not for use with scanned images. Please see the repo tesseract-ocr/tesstrain for training using images. On Mon, Dec 9, 2019, 15:23 P007 wrote: > Hi, > I want to use tesseract-OCR for Hindi language working with images. after > installation all steps when I tried to execute the com

Re: [tesseract-ocr] Tesseract-OCR giving different results for same image on different systems.

2019-12-16 Thread Shree Devi Kumar
Run tesseract --version on the different systems. Are thetraineddata files being used on the different systems the same? Share an image and the different output received in each case. On Mon, Dec 16, 2019, 17:58 adesh gautam wrote: > Hi, > > I am using tesseract-ocr on my images, and i am gett

Re: [tesseract-ocr] Retraining Tesseract Word Spotting / Segmentation

2019-12-16 Thread Shree Devi Kumar
Tesseract 4 lstm engine and traineddata work on line images. Character level bounding boxes are not accurate as has been reported in multiple issues. On Mon, Dec 16, 2019, 19:02 Mazzwar wrote: > Supposing I have a dataset of images with bounding boxed words, is it > possible to retrain the word

Re: [tesseract-ocr] Retraining Tesseract Word Spotting / Segmentation

2019-12-16 Thread Shree Devi Kumar
one with an lstm? Thanks > > On Monday, December 16, 2019 at 6:02:51 PM UTC+2, shree wrote: >> >> Tesseract 4 lstm engine and traineddata work on line images. Character >> level bounding boxes are not accurate as has been reported in multiple >> issues. &g

Re: [tesseract-ocr] Tesseract-OCR giving different results for same image on different systems.

2019-12-17 Thread Shree Devi Kumar
12:47:28 PM UTC+5:30, shree wrote: >> >> Please check file sizes for eng.traineddata - they maybe different >> versions even though they are called the same. >> >> On Mon, Dec 16, 2019 at 9:06 PM adesh gautam wrote: >> >>> >>> There is the

Re: Ynt: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

2019-12-18 Thread Shree Devi Kumar
You can try to finetune tessdata_best/script/Arabic.traineddata for Ottoman. If you have line images and their groundtruth transcription, you can use makefile process from tesstrain. See https://github.com/tesseract-ocr/tesstrain/issues/128 Tesseract recognizes images to Unicode code points (UTF8

Re: [tesseract-ocr] Need help in training Tesseract with application images

2019-12-19 Thread Shree Devi Kumar
Please use https://github.com/tesseract-ocr/tesstrain This works on line images and their ground-truth transcription. On Windows, you could install WSL for running the *NIX scripts. On Thu, Dec 19, 2019 at 11:14 AM preeti padalia wrote: > Hi, > > We are using tesseract to perform actions and v

Re: Ynt: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

2019-12-20 Thread Shree Devi Kumar
Check https://github.com/OpenITI/OCR_GS_Data/tree/master/AzTurkish/kulliyati On Fri, Dec 20, 2019, 03:30 Serkan Taş wrote: > Hi Shree, > > I checked git page you referred and need some time to prepare line images > and their ground-truth transcription. I guess I can but will ta

Re: [tesseract-ocr] Interrupting and restarting lstmtraining

2019-12-23 Thread Shree Devi Kumar
You can create traineddata with the --stop-training while lstmtraining continues to run. If you are using tesstrain makefile then it has a target called traineddata which will generate traineddata file for each intermediate checkpoint. You can stop and start training but I have a feeling that tra

Re: [tesseract-ocr] Re: 18th-century French

2019-12-26 Thread Shree Devi Kumar
Please see the repo tesseract-ocr/tesstrain, specifically wiki pages regarding training for Fraktur. On Fri, Dec 27, 2019, 00:51 Scott M. Sanders wrote: > If you can't see the bad_rep.html, here is a pdf version. > > Le jeudi 26 décembre 2019 14:17:46 UTC-5, Scott M. Sanders a écrit : >> >> >> I

Re: [tesseract-ocr] lstm-unicharset

2019-12-27 Thread Shree Devi Kumar
Run the command combine-tessdata -u eng.traineddata eng. This will unpack all components of the traineddata file, including lstm-unicharset On Fri, Dec 27, 2019, 14:27 Ashwini Nande wrote: > How to generate lstm-unicharset for tesserasct 4? > > -- > You received this message because you are

Re: [tesseract-ocr] Re: 18th-century French

2019-12-27 Thread Shree Devi Kumar
Formatting info is not retained in tesseract4. It was available in 3.0x On Fri, Dec 27, 2019, 22:29 Scott M. Sanders wrote: > I added the following code, which has improved the results. I thought that > adding 'alto' would create an xml file with formatting information, but it > didn't work. Is

Re: [tesseract-ocr] Fresh install not recognizing text like before

2020-01-04 Thread Shree Devi Kumar
Please also provide tesseract version information from a machine where it is working. On Sat, Jan 4, 2020 at 1:51 PM Votum V wrote: > I've been using tesseract for a while now to read text from images that I > take with a script for a game I am automating. I recently had to do a fresh > install

Re: [tesseract-ocr] Fresh install not recognizing text like before

2020-01-04 Thread Shree Devi Kumar
2lib/1.0.6 liblz4/1.7.5 > > > > On Saturday, January 4, 2020 at 4:44:39 AM UTC-4, shree wrote: >> >> Please also provide tesseract version information from a machine where it >> is working. >> >> On Sat, Jan 4, 2020 at 1:51 PM Votum V wrote: >> >>>

Re: [tesseract-ocr] Announcement: Python package pytesstrain (Tesseract training helpers)

2020-01-04 Thread Shree Devi Kumar
Thanks for the info. It looks like a helpful set of tools. Please confirm whether this is for training legacy tesseract and which versions of tesseract are compatible with it. On Sun, Jan 5, 2020, 02:22 Wincent Balin wrote: > Hi all, > > I would like to announce pytesstrain, a collection of Tes

Re: [tesseract-ocr] tesseract unable to detect characters in simple two-word image

2020-01-04 Thread Shree Devi Kumar
try --psm 6 ubuntu@tesseract-ocr:~/TEST$ tesseract lao.jpg - Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 197 Empty page!! Estimating resolution as 197 Empty page!! ubuntu@tesseract-ocr:~/TEST$ tesseract lao.jpg - --dpi 300 Empty page!! Empty page!! ubuntu@tesserac

Re: [tesseract-ocr] Fresh install not recognizing text like before

2020-01-05 Thread Shree Devi Kumar
ica-1.78.0 >>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : >>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >>> Found AVX2 >>> Found AVX >>> Found SSE >>> Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3

Re: [tesseract-ocr] Re: convert a .tiff file to text file

2020-01-06 Thread Shree Devi Kumar
Have you tried OMP_THREAD_LIMIT=1 On Tue, Jan 7, 2020 at 4:18 AM George Varghese wrote: > > reason I want to do this : > > I found that sometime other processes which runs on the same server, gets > an exit code of 255 and does not complete. So If I can limit the usage of > tesseract to 2 core

Re: [tesseract-ocr] how to use my collected corpus and convert it one line tif

2020-01-07 Thread Shree Devi Kumar
Read your textfile line by line run text2image to create box/tif, similar to following. text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300 --leading=32 --margin=12 --exposure=0 --font="$fontname" --outputbase="${fontname// /_}.exp0

Re: [tesseract-ocr] how to use my collected corpus and convert it one line tif

2020-01-08 Thread Shree Devi Kumar
x27;t know how to run it and work with it, so please if you can help me to > make a new traindata because I don't wanna use existing traindata! > Thanks > > > On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote: >> >> Read your textfile line by line >>

Re: [tesseract-ocr] Interrupting and restarting lstmtraining

2020-01-08 Thread Shree Devi Kumar
r/eng/eng.traineddata \ > > --model_output /data/output/mem.traineddata > > > > The file I'm using in --continue_from always has a fresh timestamp, but > > the other checkpoints (with numbers in the filenames) in the same > > directory are quite old. >

Re: [tesseract-ocr] need more details about tesseract recognize result

2020-01-09 Thread Shree Devi Kumar
try hocr output as follows tesseract choices.png choices -c lstm_choice_mode=2 hocr On Thu, Jan 9, 2020 at 11:43 AM 叶新舟 wrote: > Hi: >I found that tesseract by default return a recognize result (a single > char for example) with the maxinum confidence, >yet in my case, I want a list (

Re: [tesseract-ocr] Getting Tesseract Output as ANSI Encoding

2020-01-09 Thread Shree Devi Kumar
output is utf-8, how are you opening it? what is your locale? On Thu, Jan 9, 2020 at 5:37 PM Manankumar Bhatt wrote: > > I am running command "Tesseract image.jpg output -l eng -psm 6" which > generates output.txt file. > > On Thursday, 9 January 2020 15:15:29 UTC+5:30, universal reseller wrote:

Re: [tesseract-ocr] Can tesseract be used to read a PDF and OCR it to text?

2020-01-12 Thread Shree Devi Kumar
Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those. Or use wrappers that use tesseract.which take a PDF and convert to text. Look under add-ons in wiki. On Mon, Jan 13, 2020, 00:31 'pjfarley3' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote:

Re: [tesseract-ocr] Some spaces are not recognized

2020-01-12 Thread Shree Devi Kumar
i text. > > Library : Tess-Two > > Platform : Android > > How i can fix the problem related to spaces. Hereby, attaching a > screenshot, input and output text. > > Regards > > On Tuesday, May 29, 2018 at 4:33:43 PM UTC+5:30, shree wrote: >> >> set the config vari

Re: [tesseract-ocr] Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-15 Thread Shree Devi Kumar
Take a look at tesseract-ocr/tesstrain On Tue, Jan 14, 2020 at 10:13 PM 'Fabio Lugli' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Hello everyone, i'm trying to train tesseract on handwriting, knowing that > it's not the best option, using the latest version available for Windows.

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-15 Thread Shree Devi Kumar
Please share a couple of lstmf files for testing. On Wed, Jan 15, 2020 at 8:03 PM 'Fabio Lugli' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > After some work i am able to: > - Use the method *lstmbox* of *tesseract.exe* to obtain the *.box* files > of my *.tif* images > - Use the t

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-16 Thread Shree Devi Kumar
ooglegroups.com> wrote: > Yes, i forgot to do it in the latest post. I share a couple of the images > and their correspondant .*box *and .*lstmf *files. The others that i > tried until now are very similar to these ones. > > Il giorno mercoledì 15 gennaio 2020 15:38:23 UTC+1, sh

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-16 Thread Shree Devi Kumar
ase I can simply copy those file in the folder? > > Il giorno giovedì 16 gennaio 2020 10:45:59 UTC+1, shree ha scritto: >> >> Are you sure you have the files in the right places? It seems to work for >> me... >> >> ubuntu@tesseract-ocr:~/tesseract$ cd ../TEST/

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-16 Thread Shree Devi Kumar
at this is not what i should have inside *all-lstmf* > ? > > Il giorno giovedì 16 gennaio 2020 12:04:50 UTC+1, shree ha scritto: >> >> tesseract unpack is a new feature by @stweil - not yet in the master >> branch. I was testing to see that your lstmf files are read corre

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-16 Thread Shree Devi Kumar
ittsskell from* > *Iteration 0: BEST OCR TEXT : k MOVE t0 stoe Mr. GarkkeldR Prom* > *File eng.test.pro0.lstmf line 0 :* > > And then nothing. It opens a new terminal prompt. Could it be using > windows the cause of this issue? > > P.S. Thank you for all your time that you pass answ

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-16 Thread Shree Devi Kumar
orno giovedì 16 gennaio 2020 14:26:50 UTC+1, shree ha scritto: >> >> I haven't trained on windows. If you want to do training, it will be >> better to use Linux. >> >> On Thu, Jan 16, 2020 at 6:30 PM 'Fabio Lugli' via tesseract-ocr < >> tesser

Re: [tesseract-ocr] Can tesseract be used to read a PDF and OCR it to text?

2020-01-18 Thread Shree Devi Kumar
dd ons" part of the wiki doesn't actually have > a PDF-to-OCR'ed-text wrapper as far as I can see. > > Still searching for a solution, but thanks for trying to help. > > Peter > > On Monday, January 13, 2020 at 1:49:31 AM UTC-5, pjfarley3 wrote: >> >

Re: [tesseract-ocr] tesstrain.sh only generates 2 pages no matter what maxpages I set.

2020-01-18 Thread Shree Devi Kumar
Verify that you don't have an older version of tesstrain.sh Try using tesseract/src/training/tesstrain.sh and see if maxpages takes effect On Sat, Jan 18, 2020 at 1:47 PM Fil wrote: > I'm trying to figure out how to train tesseract from scratch using > auto-generated box/tif/lstm files. I've be

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

2020-01-20 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#ocr-results Sometimes using multiple models (last three) from training gives better results. On Mon, Jan 20, 2020 at 1:52 PM 'Fabio Lugli' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > After working a couple of days on

Re: [tesseract-ocr] tesstrain.sh only generates 2 pages no matter what maxpages I set.

2020-01-21 Thread Shree Devi Kumar
Please share your input files to see if I can replicate this. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To vie

Re: [tesseract-ocr] tesstrain.sh only generates 2 pages no matter what maxpages I set.

2020-01-22 Thread Shree Devi Kumar
of pages I specified, not just >> generate exactly what's in the eng.training_text file and nothing more/less. >> >> On Tuesday, January 21, 2020 at 10:37:41 PM UTC-8, shree wrote: >>> >>> Please share your input files to see if I can replicate this. >>&g

Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-26 Thread Shree Devi Kumar
Is there a Unicode font for modi script? On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Hi... I want to create Modi script (Marathi language) traineddata in > tesseract for OCR. Can somebody guide what steps should I follow. > I referred

Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-26 Thread Shree Devi Kumar
Thanks for the link to Modi Unicode font. I would convert the Marathi training text to Modi script (use Aksharamukha) and then train using the unicode font. On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW wrote: > > On Jan 26, 2020, at 08:16, Shree Devi Kumar wrote: > > Is there a

Re: [tesseract-ocr] Re: Why is there no selectable text in the PDF output file?

2020-01-27 Thread Shree Devi Kumar
Not all viewers work alike. Try with the free Adobe Acrobat Reader or the viewer in Chrome. When I last checked most readers/viewers will select and search text in tesseract generated pdfs. Many times the highlighting of selection is incorrect but if you copy and paste all recognized text should b

Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-27 Thread Shree Devi Kumar
ance. > > On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote: >> >> Thanks for the link to Modi Unicode font. >> >> I would convert the Marathi training text to Modi script (use >> Aksharamukha) and then train using the unicode font. >> >> O

[tesseract-ocr] Training for Kurdish in Arabic script

2020-01-27 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tesstrain-ckb It uses a modified training text based on what you sent and earlier text that I had from Pewan and other corpora. Currently the training data includes * AWN 0-9 * AEN - ARabic numbers * No Persian numbers since some shapes are similar to Arab

Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

2020-01-28 Thread Shree Devi Kumar
-txt2img.sh https://github.com/Shreeshrii/tesstrain-ckb/blob/master/3-img2lstmf.sh https://github.com/Shreeshrii/tesstrain-ckb/blob/master/4-train-layer.sh On Tue, Jan 28, 2020 at 12:08 PM manu pranay wrote: > shree, > can you please help me out how to perform arabic training on tesse

Re: [tesseract-ocr] Incremental Training Tesseract 4.0+ for fraktur

2020-01-28 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesstrain/wiki There are already newly trained models by @stweil for Fraktur. On Tue, Jan 28, 2020, 22:46 Val LNB wrote: > *How to perform incremental training on Tesseract 4.0+?* > > > I want to improve the existing fraktur (frk) model with some 6000

Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-28 Thread Shree Devi Kumar
. Pango suggested font 'MarthiCursiveT Medium'* > > Please advise for both the queries.Thanks in advance > > On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote: >> >> For LSTM training punc, numbers, wordlist are NOT required. You can add >> them if y

Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-28 Thread Shree Devi Kumar
The default language that tesseract uses when none are specified is eng. Hence you get box file with English characters. There is currently no `Modi` traineddata so you can't use that, You could use `-l mar` to use Marathi but obviously the recognition will not be correct. I suggest that you use

Re: [tesseract-ocr] Incremental Training Tesseract 4.0+ for fraktur

2020-01-29 Thread Shree Devi Kumar
t; > > Interestingly, .png failes are used when running training so I could have > perhaps skipped conversion to .tif since I started with .png! :) > > Now, the big question, how long will it take to run 10,000 epochs on > average 4 core Xeon v3 VM? > > > > > >

Re: [tesseract-ocr] Re: Adding Modi Script to Tesseract

2020-01-31 Thread Shree Devi Kumar
ing tesseract >> with images. Thanks once again >> >> >> >> On Friday, January 31, 2020 at 12:39:31 AM UTC-5, shree wrote: >>> >>> Please see https://github.com/Shreeshrii/tesstrain-modi for finetune >>> training for Modi from Marathi using synt

Re: [tesseract-ocr] Re: Adding Modi Script to Tesseract

2020-01-31 Thread Shree Devi Kumar
If you send a couple of scanned images with their ground truth transcription and box files, I can test with that and suggest next steps. On Sat, Feb 1, 2020, 09:28 Shree Devi Kumar wrote: > tesseract-ocr/tesstrain repo has makefile for training with images. > > See > https:

Re: [tesseract-ocr] Re: Compiling tesseract 4 in Debian

2020-01-31 Thread Shree Devi Kumar
The version of leptonica that you have leptonica-1.79.0 libpng 1.2.50 : zlib 1.2.8 Only has support for png. All others will fail. You need to change leptonica build to include libtiff etc. On Sat, Feb 1, 2020, 05:48 lundissimo wrote: > Thank you for that link. I hadn't retrieved the file f

Re: [tesseract-ocr] Training for Kurdish in Arabic script

2020-01-31 Thread Shree Devi Kumar
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#training-just-a-few-layers On Sat, Feb 1, 2020 at 11:33 AM manu pranay wrote: > Thank you so much for your help shree. > the links you provided were very helpful for me. > > now i am trying to train lstm training wit

Re: [tesseract-ocr] Training for Kurdish in Arabic script

2020-01-31 Thread Shree Devi Kumar
data/modi/list.eval \ --max_iterations 99 On Sat, Feb 1, 2020 at 11:33 AM manu pranay wrote: > Thank you so much for your help shree. > the links you provided were very helpful for me. > > now i am trying to train lstm training with retraining the top layer. > can you please pro

Re: [tesseract-ocr] Re: Compiling tesseract 4 in Debian

2020-01-31 Thread Shree Devi Kumar
For Debian you can also get the latest packages from https://notesalexp.org/tesseract-ocr/ On Sat, Feb 1, 2020 at 10:56 AM Shree Devi Kumar wrote: > The version of leptonica that you have > > leptonica-1.79.0 > libpng 1.2.50 : zlib 1.2.8 > > Only has support for png. Al

Re: [tesseract-ocr] Training for Kurdish in Arabic script

2020-02-01 Thread Shree Devi Kumar
https://github.com/impactcentre/ocrevalUAtion https://github.com/eddieantonio/ocreval https://github.com/tesseract-ocr/tesstrain/wiki/German-Konzilsprotokolle On Sat, Feb 1, 2020 at 4:31 PM manu pranay wrote: > thank you shree. > I am done with my retraining top layer training with

Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-04 Thread Shree Devi Kumar
-modiLayer_1.017_157724_324000/report_modiLayer_1.017_157724_324000-modi-ALL.txt for an example Do you have a workflow for tesseract training using your tools? If so, I would like to add/refer to it in Tesseract documentation. On Tue, Feb 4, 2020 at 2:06 AM Wincent Balin wrote: > Hi Shree, > > I am

Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-04 Thread Shree Devi Kumar
> > By the way, I added a create_ground_truth utility, which creates .gt.txt > files as well as the associated .tif files for every specified font, to > the package. I think it could be useful for anyone who does not have a > ground truth collection yet. > > Thanks, I tried it with latest tesseract

Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-09 Thread Shree Devi Kumar
Re: max threads, please see https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-455614504 I will test the new scripts later and report back On Mon, Feb 10, 2020 at 12:28 AM Wincent Balin wrote: > Hello Shree, > > I just uploaded new version of the package. About the fix

Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-10 Thread Shree Devi Kumar
Hello Wincent, Thanks for the new version of package. No errors regarding font now and not slow either. Tested on Ubuntu. On Mon, Feb 10, 2020 at 12:28 AM Wincent Balin wrote: > Hello Shree, > > I just uploaded new version of the package. About the fixes: > > 1. --fonts_d

<    2   3   4   5   6   7   8   9   10   >