Re: [tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Shree Devi Kumar
Please run a substitution script to clean up your training text. eg. for Hindi I use the following sed script. s/ / /g s/्‌ं/ं/g s/‌्‌ृ/‌ृ/g s/ा्/ा/g s/ि्/ि/g s/ी्/ी/g s/ु्/ु/g s/े्/े/g s/ै्/ै/g s/ो्/ो/g s/ौ्/ौ/g s/ॊ्/ॊ/g s/ॆ्/ॆ/g s/ॉ्/ॉ/g s/ृ्/ृ/g s/°//g s/²//g s/³//g s/¹//g s//ः/g s//॑/g s//॒

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-02-03 Thread Shree Devi Kumar
The easiest way to see box file layout for any language is to run 'text2image,' for training text sample of 2-3 lines. On Sun, 3 Feb 2019, 07:42 Li-Chung Chou Hi Shree, > > Thanks for your kindly response! It's very clear. Actually, I'm also > curious about some lang

Re: [tesseract-ocr] Ocr-d train - Tesseract 4.0 Training

2019-02-03 Thread Shree Devi Kumar
see https://github.com/OCR-D/ocrd-train On Mon, Feb 4, 2019 at 1:04 PM wrote: > I am a beginner for OCR training. Can anyone explain how to use Ocr-d > train briefly? > > I have Tesseract and Leptonica library installed in Cygwin > > tesseract 4.0.0 > leptonica-1.77.0 > libgif 5.1.4 : libjpeg

Re: [tesseract-ocr] Same image and commonad giving different results

2019-02-04 Thread Shree Devi Kumar
PM Shree Devi Kumar wrote: > Try your commands with --oem 1 or with default. It works fine > > TESSDATA_PREFIX=/home/ubuntu/tessdata_best > > $ tesseract -v > tesseract 4.0.0-272-g005f > leptonica-1.76.0 > libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54

Re: [tesseract-ocr] Same image and commonad giving different results

2019-02-04 Thread Shree Devi Kumar
Try your commands with --oem 1 or with default. It works fine TESSDATA_PREFIX=/home/ubuntu/tessdata_best $ tesseract -v tesseract 4.0.0-272-g005f leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0 $

Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 249

2019-02-04 Thread Shree Devi Kumar
> kh@DSAD-6 /usr/share/tessdata $ combine_tessdata -e ./eng.traineddata ~/tesstutorial/engoutput/eng.lstm Extracting tessdata components from ./eng.traineddata Wrote /home/kh/tesstutorial/engoutput/eng.lstm You need the traineddata from tessdata_best repo for use with training. On Mon, Feb 4

Re: [tesseract-ocr] Re: Same image and commonad giving different results

2019-02-04 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata On Mon, Feb 4, 2019 at 6:29 PM wrote: > Where can i find the testdata_best or testdata? > > Still i am not able to get the result if i remove --oem 2 or use --oem 1 > > On Monday, February 4, 2019 at 4:45:0

Re: [tesseract-ocr] OCRd gives weird python error

2019-02-05 Thread Shree Devi Kumar
https://github.com/OCR-D/ocrd-train/issues/26 UnicodeEncodeError: 'ascii' codec can't encode character in Python3 #26 On Tue, Feb 5, 2019 at 3:44 PM Kristóf Horváth wrote: > I got ocrd master from github. I set leptonica and tesseract up with it, > then i included test tiff/text pairs in ground-

Re: [tesseract-ocr] Can i include a unicharambigs file in LSTM training?

2019-02-06 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files *NOTE* Tesseract 4.00 will now run happily with a traineddata file that contains *just* lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The lstm-*-dawgs are optional, and *none of the other com

Re: [tesseract-ocr] Coordinates of the Text on the Mobile screen.

2019-02-06 Thread Shree Devi Kumar
Check the HOCR and TSV outputs or resultiterator api at word level. On Wed, 6 Feb 2019, 17:21 Rakesh Kumar Can any one please look into this? > > On Tue, Feb 5, 2019 at 1:21 AM Rakesh Kumar > wrote: > >> Hi, >> >> >> >> >> >> Recently i have success using Tesseract-ocr in converting PNG file int

Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

2019-02-07 Thread Shree Devi Kumar
You may want to see the following guide (found using Google search) https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch On Thu, 7 Feb 2019, 19:44 Kristóf Horváth Dear Lorenzo, > > thank you for your input it is very much appreciated. I will go through > your suggesti

Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

2019-02-07 Thread Shree Devi Kumar
s error if file is not found for other languages too. On Thu, Feb 7, 2019 at 9:38 PM Kristóf Horváth wrote: > Thx shree. I will check it out tomorrow, but pls can you give a personal > feedback? > Also i left from stratch because it requires serious amount of sample data > and a newb

Re: [tesseract-ocr] ERROR: shared library version mismatch (was 4.0.0-279-gec8f, expected 4.0.0-255-gfc55

2019-02-08 Thread Shree Devi Kumar
You seem to have an older version of the shared library file. make clean and then rebuild tesseract and training tools again. On Fri, Feb 8, 2019 at 10:49 AM 한정협 wrote: > I was try to use /src/training/tesstrain.sh with my own .tif/box files > > my tesseract version is below > tesseract 4.0.0-

Re: [tesseract-ocr] Re: tesseract 4 box files format

2019-02-09 Thread Shree Devi Kumar
This is good to know. What languages did you test this for? On Sat, Feb 9, 2019 at 5:46 PM thebigwasp wrote: > Ok, I managed to fine tune existing model using tiff/box pairs. In box > files i used so called WordStr format that is described here: > https://github.com/tesseract-ocr/tesseract/blo

Re: [tesseract-ocr] Re: tesseract 4 box files format

2019-02-10 Thread Shree Devi Kumar
Thank you! I tested today for both English and Hindi and your suggested format worked perfectly. I had tested earlier for Hindi language with the WordStr format as described in the code and that had not worked well. I will test further with RTL and CJK languages and also make a PR to create WordS

Re: [tesseract-ocr] Tesseract Performing Poorly on a Very Clear Picture of Text

2019-02-10 Thread Shree Devi Kumar
ubuntu@tesseract-ocr:~/TEST$ tesseract thres.png - --psm 6 --oem 0 -l eng --tessdata-dir ~/tessdata Warning: Invalid resolution 0 dpi. Using 70 instead. LJN7VT This uses `base` tesseract (non LSTM) --oem 0 which is only available in the traineddata from tessdata repo. --oem 1 and trainedata from

Re: [tesseract-ocr] Help with recognizing text

2019-02-17 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/langdata/issues/65 On Sun, Feb 17, 2019 at 5:29 PM 'Ml Ml' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Hello List, > > how can i extract the text from here? > > [image: RSP4_sp1.jpg] > > I tried: > tesseract RSP4_sp1.jpg stdout nobatch digit

Re: [tesseract-ocr] Help with recognizing text

2019-02-17 Thread Shree Devi Kumar
You could try fine tuning with this font https://www.urbanfonts.com/fonts/Atomic_Clock_Radio.htm On Sun, Feb 17, 2019 at 8:31 PM Shree Devi Kumar wrote: > See https://github.com/tesseract-ocr/langdata/issues/65 > > On Sun, Feb 17, 2019 at 5:29 PM 'Ml Ml' via tesseract-oc

[tesseract-ocr] tess4train_impact_from_boxtiff.sh

2019-02-19 Thread Shree Devi Kumar
I have created a simple bash script for LSTM training - finetuning for impact using box/tiff pairs. Change file locations to match your setup and let me know if it works for you. https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_boxtiff.sh -- You received this messa

Re: [tesseract-ocr] Need help to train me first, that I could train tesseract (Eng/Rus/Hindi)

2019-02-19 Thread Shree Devi Kumar
Please share a couple of scanned pages for testing. You may be able to use existing traineddata files for English and Russian with -l eng+rus or for English and Hindi with -l eng+hin For text with diacritics you can try -l script/Latin This will give you an idea of current state. You can plan tr

Re: [tesseract-ocr] Need help to train me first, that I could train tesseract (Eng/Rus/Hindi)

2019-02-19 Thread Shree Devi Kumar
Actually, for English + Hindi, use `script/Devanagari.traineddata` for English + Bengali, try `eng+ben` or `script/Bengali` Please check the language code for Russian. On Wed, Feb 20, 2019 at 11:02 AM Shree Devi Kumar wrote: > Please share a couple of scanned pages for testing. > > Y

Re: [tesseract-ocr] Symbol lookup error while using tesseract-ocr

2019-02-23 Thread Shree Devi Kumar
You may want to file the issue at https://github.com/AlexanderP/tesseract-debian so that Alex can look at it. Thanks! On Sat, Feb 23, 2019 at 11:53 AM wrote: > I've been working with Tesseract 4.0.0 for the last two months. > > I used the ppa by alexander pozdnyakov to install it. > > But today

Re: [tesseract-ocr] Re: Training Tesseract Arabic/Hindi Digits using JTessBoxEditor in window 10

2019-02-25 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tessdata_arabic You can try the new traineddata from there alongwith the PR https://github.com/tesseract-ocr/tesseract/pull/2266 On Mon, Feb 25, 2019 at 9:27 PM Soufiane Sabiri wrote: > Have you had any luck training tesseract for arabic letters or numb

Re: [tesseract-ocr] make it simple . why this wont work ?

2019-03-04 Thread Shree Devi Kumar
For such images eng.traineddata from tessdata repo with --oem 0 may give better results. On Mon, Mar 4, 2019 at 10:05 PM wrote: > Simple texts > > but tesseract wont work if some thing follows T like T1234567 , it takes T > as 1 or 7 . What am i missing ? > > -- > You received this message becau

Re: [tesseract-ocr] make it simple . why this wont work ?

2019-03-04 Thread Shree Devi Kumar
--oem 0 will only work with traineddata files from https://github.com/tesseract-ocr/tessdata On Tue, Mar 5, 2019 at 1:43 AM Arun KUMAR M wrote: > Thankyou Shree Devi I will try this out . Is this on Tesseract 3.0 or 4.0 ? > > i dont see oem 0 in version 4.00 > > > > On Mon,

Re: [tesseract-ocr] Spacing extracted text horizontally

2019-03-05 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesseract/issues/781 Using ` -c preserve_interword_spaces=1` with the command may give you the extra spaces. On Tue, Mar 5, 2019 at 10:07 PM wrote: > Hello > It is possible to declare in the Tesseract call to space the extracted > text horizontally in

Re: [tesseract-ocr] Finetuning in ocrd-train

2019-03-09 Thread Shree Devi Kumar
Please see https://github.com/OCR-D/ocrd-train/blob/f89efdd46c01aedea615d35e0561c50d7f86e584/Makefile learning rate does not need to be specified for finetuning. It is automatically determined/reduced - see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#net-mode-and-optimiz

Re: [tesseract-ocr] What would be the best resolution (dpi) and recommended size of each characters to train in Tesseract

2019-03-11 Thread Shree Devi Kumar
1. Which ancient scripts are you trying to train? 2. Are you trying to train base tesseract (3.0x) or LSTM tesseract? 3. Are you using synthetic traineddata using text2image or scanned images? On Tue, Mar 12, 2019 at 10:16 AM Naga raja wrote: > Hi All, > > As we are working in some ancient scr

Re: [tesseract-ocr] Why combine_lang_model need ommon.unicharset

2019-03-12 Thread Shree Devi Kumar
Your simple unicharset does not look right. You can make a list of characters needed in a file and create unicharset from it. unicharset_extractor --output_unicharset cp.unicharset --norm_mode 1 cp.syllables.txt combine_lang_model \ --input_unicharset cp.unicharset \ --script_dir ~/langdata \ --o

Re: [tesseract-ocr] What would be the best resolution (dpi) and recommended size of each characters to train in Tesseract

2019-03-12 Thread Shree Devi Kumar
at 7:24 AM Naga raja wrote: > Thanks for the response. Following are the response to your questions. > Basically we had doubt on sizes while training a new script sample in LSTM. > > On Tuesday, March 12, 2019 at 2:00:49 PM UTC+9, shree wrote: >> >> 1. Which ancient scripts

Re: [tesseract-ocr] OCR Evaluation Tools

2019-03-19 Thread Shree Devi Kumar
Yes. I have used for testing English, Spanish (unlvtests) and Devanagari (Sanskrit) and Fraktur (German) . On Tue, 19 Mar 2019, 10:54 , wrote: > Do the OCR Evaluation tools here work for Tesseract 4.00? > > https://github.com/Shreeshrii/ocr-evaluation-tools > > If not, could someone please point

Re: [tesseract-ocr] The problems about training eng+chinese

2019-03-19 Thread Shree Devi Kumar
You are using a number of Japanese, Koean and Traditional Chinese fonts for training. Try without them. On Tue, Mar 19, 2019 at 4:19 PM 易鑫 wrote: > Hello,everyone: > I want to recognize the characters in the table(You can see find it in > the attach file).In the past, I only recognize the en

Re: [tesseract-ocr] The problems about training eng+chinese

2019-03-19 Thread Shree Devi Kumar
bin/src/training/lstmtraining \ --stop_training \ --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint \ --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ --model_output ~/tessdata_best/chi_sim_tuned.traineddata On Wed, Mar 20, 2019 at 8:46 A

Re: [tesseract-ocr] How to use tesseract for these images

2019-03-19 Thread Shree Devi Kumar
See https://www.mkompf.com/cplus/emeocv.html OpenCV practice: OCR for the electricity meter https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/ OpenCV OCR and text recognition with Tesseract https://github.com/charlesw/tesseract/issues/90 https://stackoverflo

Re: [tesseract-ocr] The problems about training eng+chinese

2019-03-19 Thread Shree Devi Kumar
ing to add a character is 3600. You should check an eval set (different from training set) around these numbers to find the minimum. > > > > > > > > > > > > > > Shree Devi Kumar 于2019年3月20日周三 上午11:18写道: > >> >> ~/tesseract/src/training/tess

Re: [tesseract-ocr] Error by using own model

2019-03-21 Thread Shree Devi Kumar
A checkpoint is NOT a traineddata file. Use -stop-training to build the traineddata. eg. echo " stop training " ~/tesseract/bin/src/training/lstmtraining \ --stop_training \ --continue_from ./devaplus_z1/plus_checkpoint \ --traineddata ./santrain_z1

Re: [tesseract-ocr] pytesseract - how to improve quality of text

2019-03-22 Thread Shree Devi Kumar
If the invoices have a fixed format, you can try with uzn. See https://github.com/jsoma/tesseract-uzn https://jsoma.github.io/kull/#/ Or checkout OPENCV See https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/ On Fri, Mar 22, 2019 at 9:35 PM yoganand w

Re: [tesseract-ocr] Re: MICR recognition with tesseract-ocr

2019-03-22 Thread Shree Devi Kumar
See https://github.com/BigPino67/Tesseract-MICR-OCR On Fri, Mar 22, 2019 at 6:10 PM wrote: > i am a student working on this but i don't have much idea about tessseract > will smeone guide me how can i make my own OCR for cheque please > any help is appreciated > > On Friday, March 28, 2008 at 2:

Re: [tesseract-ocr] Re: MICR recognition with tesseract-ocr

2019-03-22 Thread Shree Devi Kumar
Also see http://www.devscope.net/Content/ocrchecks.aspx On Fri, Mar 22, 2019 at 10:59 PM Shree Devi Kumar wrote: > See https://github.com/BigPino67/Tesseract-MICR-OCR > > On Fri, Mar 22, 2019 at 6:10 PM wrote: > >> i am a student working on this but i don't have much i

Re: [tesseract-ocr] Re: Dot Matrix Fonts and Tesseract's Connected Component Analysis

2019-03-22 Thread Shree Devi Kumar
haven't tested the new traineddata with the original image. I will email you the training text and fonts used, if you want. On Sat, 23 Mar 2019, 03:33 , wrote: > Hi Shree, > > Thanks for sending these images and the traineddata file. I confirmed > that they worked. Would you

Re: [tesseract-ocr] Re: Dot Matrix Fonts and Tesseract's Connected Component Analysis

2019-03-22 Thread Shree Devi Kumar
Also changed image to 300 dpi and used --dpi 300. On Sat, 23 Mar 2019, 07:43 Shree Devi Kumar, wrote: > Hi Ameera, > > Please do check with other images too as I tested with only one image that > you sent. > > I had initially tried fine tuning (impact and plus) but those

Re: [tesseract-ocr] General strategies for dealing with problem images

2019-03-23 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/pull/2294 by @bertsky adds the whitelist/blacklist functionality for Tesseract4. It has not been merged yet. On Sat, Mar 23, 2019 at 2:58 PM Lorenzo Bolzani wrote: > Il giorno mar 19 mar 2019 alle ore 06:03 Jonathan Muller < > jmul...@pukogames.com> ha

Re: [tesseract-ocr] Re: Dot Matrix Fonts and Tesseract's Connected Component Analysis

2019-03-23 Thread Shree Devi Kumar
> > That's interesting that you tried replacing the top layer. I haven't > tried that yet. How many iterations did you use? > >> In this case the unicharset was limited to UPPERCASE letters, 0-9 numbers , : and /. I used a training_text which followed the pattern of the image - lines starting wit

Re: [tesseract-ocr] The problem of training eng + chi_sim

2019-03-25 Thread Shree Devi Kumar
Try replacing a layer - you may need larger training_text and more iterations lstmtraining --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_layer \ --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \ --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.t

Re: [tesseract-ocr] The problem of training eng + chi_sim

2019-03-25 Thread Shree Devi Kumar
36000 iterations, error rate 0.1 OCR output attached On Mon, Mar 25, 2019 at 6:09 PM Shree Devi Kumar wrote: > Try replacing a layer - you may need larger training_text and more > iterations > > lstmtraining --model_output > ~/tesstutorial/chi_sim_tuned_from_chi_si

Re: [tesseract-ocr] Unable to recognise small size image

2019-03-25 Thread Shree Devi Kumar
try --psm 6 --dpi 300 ubuntu@tesseract-ocr:~/TEST$ tesseract small.png - --psm 6 --dpi 300 a) ! b) | c) * d) _ ubuntu@tesseract-ocr:~/TEST$ tesseract small.png - --psm 6 a) ! b) | c) * d) _ On Mon, Mar 25, 2019 at 11:39 PM Heeramani Prasad wrote: > I am trying to recognise various images For s

Re: [tesseract-ocr] How to generate a searchable PDF using some images but executing OCR on their preprocessed version

2019-03-26 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#integrate-original-image-file-and-detected-text-into-pdf On Wed, Mar 27, 2019 at 5:04 AM Nico wrote: > Hi, > I have a bunch of RGB images I need to OCR and put together in a > searchable PDF. I noticed that if I preprocess th

Re: [tesseract-ocr] What does special character "|" mean?

2019-03-28 Thread Shree Devi Kumar
The training text that I used for replace layer has the | character. On Fri, 29 Mar 2019, 08:51 易鑫, wrote: > Hello,everyone: > > I now use tesseract 4.0.0 to recognize the content of table image. The > sample image is in the attach files(5-a.jpg) > > When I use the command: > > *tesseract 5-a.j

Re: [tesseract-ocr] Training a language not in tesseract but almost similar script/ letters with vietnam language

2019-03-28 Thread Shree Devi Kumar
tesseract procssed_image.png stdout -l vie bazaar -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCD EFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî Bazaar should be listed last - see tesseract --help Check your command syntax On Fri, 29 Mar 2019, 00:02 , wrote: > I am trying to train a language c

Re: [tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

2019-03-28 Thread Shree Devi Kumar
For tesseract 3, and training language similar to vie, take a look at vietocr and jtessboxeditor. On Fri, 29 Mar 2019, 00:02 , wrote: > The steps mentioned here for [tessercat 3.0-3.02][ > https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 > ] is not clear nor I

Re: [tesseract-ocr] How to restrict OCR character set.

2019-03-28 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/pull/2294 On Fri, 29 Mar 2019, 11:17 Martin Emmerson, wrote: > Is there a way to restrict the character set that tesseract-ocr will > attempt to identify? I'm scanning USA-based receipts which have a fairly > simple set of monospaced characters but

Re: [tesseract-ocr] Leptonica sometimes mangles images when using PDF output mode

2019-03-29 Thread Shree Devi Kumar
Please also post as an issue for leptonica at https://github.com/DanBloomberg/leptonica/issues. On Fri, Mar 29, 2019 at 12:01 AM Lucas L. wrote: > Environment > >- Tesseract 4.0.0-beta.3-249-g607e >- leptonica-1.76.0 >- Linux (hostname removed) 4.18.0-16-generic #17 >

Re: [tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

2019-03-29 Thread Shree Devi Kumar
The default page segmentation mode is different for command line and api. Specify it explicitly and test. On Fri, 29 Mar 2019, 22:12 Lucas L., wrote: > OK, I am running up against another issue, and it's getting weirder. Since > Tesseract does not take PDFs as input, this service does the deed o

Re: [tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

2019-03-29 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/pdf This is different from your installed version. On Fri, 29 Mar 2019, 22:30 Shree Devi Kumar, wrote: > The default page segmentation mode is different for command line and api. > Specify it explicitly and test.

Re: [tesseract-ocr] How to restrict OCR character set.

2019-03-29 Thread Shree Devi Kumar
On Thursday, March 28, 2019 at 11:03:59 PM UTC-7, shree wrote: >> >> See https://github.com/tesseract-ocr/tesseract/pull/2294 >> >> On Fri, 29 Mar 2019, 11:17 Martin Emmerson, wrote: >> >>> Is there a way to restrict the character set that tesseract-ocr will &g

Re: [tesseract-ocr] How to restrict OCR character set.

2019-03-29 Thread Shree Devi Kumar
/engrestrict_plus0.242_44.checkpoint wrote checkpoint. Finished! Error rate = 0.242 If you know the font used and customize training text to your data, you will get better results. On Sat, Mar 30, 2019 at 11:35 AM Shree Devi Kumar wrote: > try the finetuned traineddata from > > > https:

Re: [tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

2019-03-30 Thread Shree Devi Kumar
jtessboxeditor offers tesseract training for version 3.0x that's why I mentioned it. For tesseract4, training steps are very different. On Sat, Mar 30, 2019 at 1:14 PM wrote: > Hi, you might have got confused with my other question. I am actually > working on two languages. Neither of them are

Re: [tesseract-ocr] What is the current stable version of tesseract? And how to upgrade it from tesseract 3.04.01?

2019-03-30 Thread Shree Devi Kumar
You can use Alex's PPA https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr?field.series_filter=xenial On Sat, Mar 30, 2019 at 4:43 PM wrote: > I have installed tesseract using the following command: > >*sudo apt install tesseract-ocr* > > on Ubuntu 16.04 LTS. Working with python 2

Re: [tesseract-ocr] What is the current stable version of tesseract? And how to upgrade it from tesseract 3.04.01?

2019-03-31 Thread Shree Devi Kumar
> > Only this command will work: > > sudo add-apt-repository ppa:alex-p/tesseract-ocr > sudo apt-get update > > > > On Saturday, March 30, 2019 at 4:50:27 PM UTC+5:30, shree wrote: >> >> You can use Alex's PPA >> >> >> https://launchpa

Re: [tesseract-ocr] What is the current stable version of tesseract? And how to upgrade it from tesseract 3.04.01?

2019-03-31 Thread Shree Devi Kumar
I have not used OPENCV so can't help regarding that. On Sun, Mar 31, 2019 at 3:37 PM wrote: > Ok, thanks for the info. > > I installed openCV 3.4.2 by following the exact same steps given in this > tutorial > https://www.pyimagesearch.com/2015/07/20/install-opencv-3-0-and-python-3-4-on-ubuntu/

Re: [tesseract-ocr] What is the current stable version of tesseract? And how to upgrade it from tesseract 3.04.01?

2019-03-31 Thread Shree Devi Kumar
tesseract-ocr-eng > sudo apt-get install tesseract-ocr-vie > > How should I unistalled it, before installing tesseract 4 with: > > sudo add-apt-repository ppa:alex-p/tesseract-ocr > sudo apt-get update > > > On Sun, Mar 31, 2019 at 3:42 PM Shree Devi Kumar > wrote: >

Re: [tesseract-ocr] What is the current stable version of tesseract? And how to upgrade it from tesseract 3.04.01?

2019-03-31 Thread Shree Devi Kumar
/usr/bin/tesseract: No such file or directory > > > And here, https://github.com/tesseract-ocr/tesseract/wiki/Compiling > > I see that there is no tesseract 4 support for Ubuntu 16.04 LTS Xenial. > > > > > On Sunday, March 31, 2019 at 3:56:24 PM UTC+5:30, shree wrote:

Re: [tesseract-ocr] Initializing tesseract with TessBaseAPIInit4 from python

2019-04-01 Thread Shree Devi Kumar
See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!searchin/tesseract-ocr/zdenko%7Csort:date/tesseract-ocr/xvTFjYCDRQU/SI6du-4JBAAJ Example how to use tessseract C-API in python with cffi On Mon, Apr 1, 2019 at 1:23 PM Guru Govindan wrote: > Hi There, > I recently migrated

Re: [tesseract-ocr] Doing OCR on pdfs with embedded CID fonts

2019-04-02 Thread Shree Devi Kumar
Tesseract does not take pdfs as direct input. You have to convert pdf to images and provide that to tesseract. However there are many 3rd party applications which take pdf as input and have tesseract as backend to do OCR. On Tue, Apr 2, 2019 at 5:02 PM Kristóf Horváth wrote: > I just tried to d

Re: [tesseract-ocr] Does Tesseract Send Information to Google?

2019-04-02 Thread Shree Devi Kumar
Tesseract is a standalone app and can be run locally. On Tue, Apr 2, 2019 at 7:26 PM Dave Walsh wrote: > Hello, > > > My company is using Tesseract for OCR in an internal application. The > information contained may be sensitive in nature and be subject to GDPR > rules. Does anyone know if Tes

Re: [tesseract-ocr] Please help a new member to train tess

2019-04-02 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation for building tesseract. To install pre-built versions, see https://github.com/tesseract-ocr/tesseract/wiki#installation On Tue, Apr 2, 2019 at 8:50 PM Trong wrote: > Dear friends, > I 'm trying to tr

[tesseract-ocr] Tesseract 4 Training Tutorials

2019-04-02 Thread Shree Devi Kumar
I have setup a github repo with the required files and bash scripts for running Tesseract 4 Training Tutorials. https://github.com/Shreeshrii/tess4training Please give it a try and let me know of any problems. -- You received this message because you are subscribed to the Google Groups "tesser

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-03 Thread Shree Devi Kumar
Usually for LSTM training we are using synthetic images created by text2image program using training text and fonts using tesstrain.sh or tesstrain.py. Hence there is no question of binarization or dpi as the program creates images as expected by tesseract training process. On Wed, Apr 3, 2019 at

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-03 Thread Shree Devi Kumar
there is no such problem. > > What about training with our real data. I have enough images for training. > Should I need to do some preprocess like binary or resized dpi and then do > lstm training? > > On Wed, Apr 3, 2019 at 16:36 Shree Devi Kumar > wrote: > >>

Re: [tesseract-ocr] OCRing existing PDF

2019-04-03 Thread Shree Devi Kumar
dienne > Centre d'études acadiennes Anselme-Chiasson > Université de Moncton > Moncton (Nouveau-Brunswick) E1A 3E9 > (506) 858-4724 > > robert.rich...@umoncton.ca > http://www.umoncton.ca/umcm/ > > > Envoyé de mon iPad > > Le 2 avr. 2019 à 12:27, Shree Devi Ku

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-03 Thread Shree Devi Kumar
> >> What about training with our real data. I have enough images for >> training. Should I need to do some preprocess like binary or resized dpi >> and then do lstm training? >> >> On Wed, Apr 3, 2019 at 16:36 Shree Devi Kumar >> wrote: >> &

Re: [tesseract-ocr] how to improve No block overlapping textline

2019-04-03 Thread Shree Devi Kumar
Try tesseract test.jpg test --psm 6 lstm.train On Thu, Apr 4, 2019 at 10:33 AM 童虎 wrote: > I have an image, I run tesseract on it, can recognise the fist line > although some wrong word. > > But I run `tesseract test.jpg test lstm.train' to make lstf file. It can't > recognise the first line. T

Re: [tesseract-ocr] Training Tesseract 4 from Scratch

2019-04-04 Thread Shree Devi Kumar
#=== CHECK THAT TESSERACT AND TRAINING TOOLS ARE INSTALLED tesseract -v text2image -v unicharset_extractor -v set_unicharset_properties -v combine_lang_model -v lstmtraining -v lstmeval -v #=== MAKE DIRECTORIES AND DOWNLOAD REQUIRED FILES mkdir -p ~/tessscratch cd ~/tessscratch wget -O lstm.tra

Re: [tesseract-ocr] Training Tesseract 4 from Scratch

2019-04-04 Thread Shree Devi Kumar
if you want to use your own images, you don't need to run text2image with the training text and fonts. Instead, supply your list of box and tif files in the next step. On Thu, Apr 4, 2019 at 7:09 PM Shree Devi Kumar wrote: > #=== CHECK THAT TESSERACT AND TRAINING TOOLS ARE I

Re: [tesseract-ocr] How to train tesseract with ancient Greek character

2019-04-04 Thread Shree Devi Kumar
You don't need to add *"GFS Artemisia" as it may not have the Chinese characters.* Just add Greek character "Φ" to your training text. I think all fonts that you are using support it. Verify in generated tif files that it is getting rendered. On Thu, Apr 4, 2019 at 7:25 AM 易鑫 wrote: > Does any

Re: [tesseract-ocr] Tesseract different output on windows then linux

2019-04-06 Thread Shree Devi Kumar
See https://github.com/UB-Mannheim/tesseract/wiki for latest windows installers See https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata for latest linux installers The difference you see could also be because of different version of trained

Re: [tesseract-ocr] Tesseract different output on windows then linux

2019-04-06 Thread Shree Devi Kumar
the > eng.traineddata and now I'm getting the following assertion > > > lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file > ../../../../ccmain/tessedit.cpp, line 193 > > > On Sunday, April 7, 2019 at 12:48:54 AM UTC-4, shree wrote: >> >> See https

Re: [tesseract-ocr] Tesseract different output on windows then linux

2019-04-07 Thread Shree Devi Kumar
l 7, 2019 at 1:51:01 AM UTC-4, shree wrote: >> >> How did you get the traineddata? you need to usethe `raw` link. >> >> wget -O eng.traineddata >> https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata >> >> >> >> On Sun, Apr

Re: [tesseract-ocr] Training Tesseract 4 from Scratch

2019-04-07 Thread Shree Devi Kumar
*mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110* Your traineddata file path is incorrect or file does not exist On Sun, Apr 7, 2019 at 2:22 PM Trong wrote: > Hi, > I tried to train and got error > > *mgr_.Init(traineddata_path.c_str()):Er

Re: [tesseract-ocr] How to train tesseract with new script?

2019-04-07 Thread Shree Devi Kumar
Tesseract 4 LSTM training is done using tesseract, not tensowflow. It is easiest to train using synthetic training data generated with training text and fonts. For ancient scripts it may need to be finetuned further using real life images. I have tried training for Brahmi, Akkadian Cueniform and

Re: [tesseract-ocr] Making custom traineddata

2019-04-08 Thread Shree Devi Kumar
If you can provide another 40-50 lines of training data (text file) I will rerun the training On Mon, 8 Apr 2019, 22:11 Jankees Korstanje, wrote: > Hi Shree, > > We have tried your traineddata file for MRZ and noticed that it does not > detect the character X. > >

Re: [tesseract-ocr] Questions about recognize Chinese characters

2019-04-09 Thread Shree Devi Kumar
I think you will get better results with --oem 1. The legacy models are better only in limited cases. For complex scripts the LSTM engine and models are better, as far as I can tell. On Wed, 10 Apr 2019, 10:23 Aaron Shieh, wrote: > I get '焊接' with the following: > tesseract 67.png o -l chi_tra

Re: [tesseract-ocr] Need advice for training_text.txt

2019-04-10 Thread Shree Devi Kumar
where each row in the training text is close to what my final application will see: That would be preferable. On Wed, 10 Apr 2019, 21:07 Aaron Shieh, wrote: > Hi, > > I noticed in the langdata_lstm/chi_tra repo the training text contains > long lines of text, my application requires only identi

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-10 Thread Shree Devi Kumar
Hi Lorenzo, Thanks for detailed description of pre-processing steps. I will link from the wiki so that it is available for easy reference. Thank you for sharing. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and

Re: [tesseract-ocr] Re: seven segment display - 4.0 traineddata

2019-04-13 Thread Shree Devi Kumar
eally like to try it out. > Do you have the download still at hand somewhere? > Thanks! > > On Wednesday, March 29, 2017 at 5:40:32 PM UTC+2, shree wrote: >> >> Hi, >> >> I have built a 4.0 traineddata using some seven segment display fonts. >> Trained

Re: [tesseract-ocr] How to create a box file

2019-04-15 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data On Mon, Apr 15, 2019 at 6:34 PM anne wrote: > Hi, I'm really new to Tesseract and I want to train it to recognize a new > script but until now, I still don't understand how to make a box file. I've

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-17 Thread Shree Devi Kumar
>BTW, for anybody: is there a way to query a model or a checkpoint for the net_specs? There is no existing utility to do that. However, Ray had dumped the info for tessdata_fast (and partly for tessdata_best) which has been posted in the wiki at https://github.com/tesseract-ocr/tesseract/wiki/Data

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-18 Thread Shree Devi Kumar
> I have images and manually corrected Text with line coordinates. From those, I've generated .box files; What method did you use for generating the .box files? Please provide the image for the box file for test. On Thu, Apr 18, 2019 at 6:09 PM wrote: > Dear reader, > I want to improve devanag

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-18 Thread Shree Devi Kumar
Also see https://github.com/OCR-D/ocrd-train/pull/66 https://github.com/tesseract-ocr/tesseract/issues/2357#issuecomment-477239316 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
can test at your end. I am not getting the encoding related errors. Please use `--psm 6` with the lstm.train command. On Tue, Apr 23, 2019 at 1:53 PM Jochen Barth wrote: > Dear Shree, > I've tried it with the format below and combined letter-and-sign-symbols > (see attached fi

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
zip file is too big. Let me do an alternative upload. Training runs ok for me - Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf Loaded 13/13 lines (1-13) of document NKP/dp1

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
build/NKP_int-eval.txt: 142 words 106 75% common 0 0% inserted 36 25% changed On Tue, Apr 23, 2019 at 2:52 PM Shree Devi Kumar wrote: > zip file is too big. Let me do an alternative upload. > > Training runs ok for me - > > Warning: LSTMTrainer deserialized an LSTMRecognize

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
e > below »WordStr«! > > Kind regards, > Jochen > > > Am 23.04.19 um 12:02 schrieb Shree Devi Kumar: > > Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit > > See NKP.sh and folder NKP > > The first part of the script loops through the image

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
Which eng.traineddata did you use? There are three options >From tessdata, tessdata_best and tessdata_fast. On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, wrote: > Hello Shree, > > I realize this post is more than two years old now, but would appreciate > any help. > I tried your

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
thub.com/zdenop> released this on Jun 1, 2017 · 26 commits <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to 3.05 since this release On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak wrote: > Hi Shree, > > Thank you for quick response. > I used the trained data by

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
since it uses LSTM based OCR > engine. Higher accuracy is one of the essential requirements for my usecase. > Would you know if v4 supports extracting text from a two column text > structure image file at all? > Thank you for your quick response Shree! > > Regards, > Girir

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
issues with layout analysis. You could try other means of selecting text regions and using tesseract on those. On Sat, 27 Apr 2019, 02:57 Giriraj Bhojak, wrote: > Hi Shree, > > I just tried the v3.05.02 as well for different modes and I still couldn't > produce the output as yo

Re: [tesseract-ocr] Re: Recognition of "5" instead of "S"

2019-04-28 Thread Shree Devi Kumar
Finetuning with Courier font with a training text similar to image you are recognizing with more samples of 5 will give better result. On Sun, 28 Apr 2019, 20:19 RangerRick, wrote: > Ok. Now I have tried the "best" traindata file (no difference) and > removing the alpha layer (no difference). I

Re: [tesseract-ocr] Re: Recognition of "5" instead of "S"

2019-04-28 Thread Shree Devi Kumar
@tesseract-ocr:~/TEST$ tesseract morse.jpg - -l engmorse --tessdata-dir ~/tesstutorial Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 125 3AMWA DE FASMX QF5MXQ CQ CQ DE F5MXQ F5MXQ CQ DE F5MXQ ENS5MAA I III F5MXQ F5MXQ NHE K On Mon, Apr 29, 2019 at 12:20 AM Shree Devi

Re: [tesseract-ocr] Editing Box files

2019-04-28 Thread Shree Devi Kumar
It means that the font you are using has mapped English letters to these symbols. If you view the box file in that same font the symbols should show. Possibly the numbers for coordinates will also show up as symbols, based on the mapping. On Mon, Apr 29, 2019 at 9:21 AM anne wrote: > Haloo, I wa

<    1   2   3   4   5   6   7   8   9   10   >