Re: [tesseract-ocr] tesseract-ocr

2018-06-19 Thread Shree Devi Kumar
Which version of tesseract/. How did you train the fonts? What was accuracy level for training? How many iterations? On Tue, Jun 19, 2018 at 3:00 PM Navaneetha Bitla wrote: > Hi, this is Navaneetha > > i'm working in hand written character recognition project. > > I have trained 1300 different

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-20 Thread Shree Devi Kumar
I had done a training for sanskrit for both devanagari and IAST but it does not include cedilla for Sh I will add it and let you know. On Wed 20 Jun, 2018, 1:17 AM yajva, wrote: > I have tried Google OCR for recognizing Sanskrit text in Roman with > diacritics (IAST). It recognizes above macron

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
You will have better control on training if you use tesstrain.sh provided with tesseract. On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla wrote: > http://www.1001fonts.com/handwritten-fonts.html. > > the above link has 1900+ fonts from that site i have downloaded the ttf > files of fonts and co

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-20 Thread Shree Devi Kumar
I am attaching the OCRed text. Please correct it so that I can use as groundtruth for further training and testing. On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar wrote: > I had done a training for sanskrit for both devanagari and IAST but it > does not include cedilla for Sh > >

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
at 9:05 PM Navaneetha Bitla wrote: > can you help us by saying how to train with tesstrain.sh > > It will help all of us, we are thankful to you. > > On Wed, Jun 20, 2018 at 8:59 PM, Shree Devi Kumar > wrote: > >> You will have better control on training if you u

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
data \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 echo "## EVAL FINETUNED MODEL ##" lstmeval \ --model $trained_output_dir/$Lang-finetune-$Lang.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 fi On W

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
Here are the bash script files: 1. for finetune for impact training - add a font 2. for finetune plus-minus training - for adding a new character On Thu, Jun 21, 2018 at 1:40 AM Shree Devi Kumar wrote: > Attached is a BASH script for Finetune training for 'Impact' (refer to > R

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
> > Thank you very much sir Ma'am, not Sir. I am Mrs. Kumar. Let me know if you have any questions or need clarification regarding the scripts. I will post them on the wiki after any needed changes. > > -- You received this message because you are subscribed to the Google Groups "tesseract-o

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
Tesseract4 LSTM training is line based. On Thu 21 Jun, 2018, 12:25 PM chandra churh chatterjee, < chandrachurh.chatterje...@gmail.com> wrote: > Excuse me @Shree Devi Kumar can you please tell me whether data for > training tesseract 4.0 would be better if the data has image

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
n 21, 2018 at 12:24 PM, chandra churh chatterjee < > chandrachurh.chatterje...@gmail.com> wrote: > >> Excuse me @Shree Devi Kumar can you please tell me whether data for >> training tesseract 4.0 would be better if the data has images which have >> paragraphed hand writt

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
> Quite a few of these handwriting fonts are uppercase letters only (so lowercase come out as uppercase when typed) . What is the best type of [lang].training_text data to use for training these - is it uppercase only? It would depend on the application where training is being used. If you want s

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
You can use ALL fonts at once. However, I have had errors with box files not being created for some fonts and the tesstrain_utils.sh script dies only at end while checking whether files are readable or not. In that case have to restart the process again. On Thu, Jun 21, 2018 at 8:28 PM James Q w

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
PM UTC+3, shree wrote: >> >> Here are the bash script files: >> >> 1. for finetune for impact training - add a font >> 2. for finetune plus-minus training - for adding a new character >> >> On Thu, Jun 21, 2018 at 1:40 AM Shree Devi Kumar >> wrote: >&

Re: [tesseract-ocr] Getting error while creating .lstm files

2018-06-21 Thread Shree Devi Kumar
Look at src/training/language_specific.sh The list of default fonts for English is being picked up from there and you probably don't have them installed. Use fonts that are available. On Fri, Jun 22, 2018 at 9:20 AM Harathi Surya wrote: > Hi, > > I am trying to create .lstm files to finetune t

Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Shree Devi Kumar
Please try with a different psm and see if you get better results. If you share a sample image we can test and respond. On Fri, Jun 22, 2018 at 5:29 PM wrote: > Could someone please try to give me an answer for my language. > > On Friday, June 15, 2018 at 2:42:00 PM UTC+2, ahka.an...@gmail.com w

Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Shree Devi Kumar
Try adding a slight white border to images and see if that helps. On Fri, Jun 22, 2018 at 7:35 PM wrote: > > > > >

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-22 Thread Shree Devi Kumar
done >> >> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >>> >>> I am attaching the OCRed text. Please correct it so that I can use as >>> groundtruth for further training and testing. >>> >>> On Wed, Jun 20, 2018 at 3:15 PM

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-22 Thread Shree Devi Kumar
Sorry, there seems to be some regression in the file posted on github. I will upload again later. On Fri, Jun 22, 2018 at 7:56 PM Shree Devi Kumar wrote: > Please try with iast.traineddata model for tesseract.4.0.0-beta posted at > https://github.com/Shreeshrii/tessdata_sanskrit > >

Re: [tesseract-ocr] Re: Not getting correct output even after finetuning tesseract with new character

2018-06-22 Thread Shree Devi Kumar
Did you run the eval as given in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters Did you stop training and create a new traineddata file? Are you using the new traineddata file for testing? On Sat, Jun 23, 2018 at 12:36 AM Harathi Surya

Re: [tesseract-ocr] Re: Getting error while creating .lstm files

2018-06-22 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files On Sat, Jun 23, 2018 at 12:55 AM Harathi Surya wrote: > Hi Shree, > > Thanks for your reply. > I replaced fontlist argument 'Impact Condensed' with 'DejaVu Sans' to > create evalplusminus folder. >

Re: [tesseract-ocr] Re: Getting error while creating .lstm files

2018-06-22 Thread Shree Devi Kumar
The tutorial has been written by Ray Smith. I haven't tested the plus-minus as given. Check whether the fonts you are using have the plus-minus sign. Using one font is for the IMPACT tutorial with 400 iterations. For plus-minus you need to use the larger list of fonts. On Sat, Jun 23, 2018 at 1

Re: [tesseract-ocr] why my hocr file look like this

2018-06-23 Thread Shree Devi Kumar
tesseract test.png result horc You used wrong config file. It should be hocr not horc On Sat, Jun 23, 2018 at 12:23 PM Ben Zhang wrote: > Hi, All, > I used tesseract 3.05, and type 'tesseract test.png result horc' in > command line, get result.horc, in this file it has: > > *Provider* *Network

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-23 Thread Shree Devi Kumar
dtruth for further training and testing. >>> >>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar >>> wrote: >>> >>>> I had done a training for sanskrit for both devanagari and IAST but it >>>> does not include cedilla for Sh >>>

Re: [tesseract-ocr] "read_params_file: parameter not found: " for hindi

2018-06-25 Thread Shree Devi Kumar
looks like you are using wrong version of traineddata file ie. 3.0x hin.traineddata with code for tesseract4.0.0. On Mon, Jun 25, 2018 at 1:01 PM Kiran Sonar wrote: > Hi, > I am trying to get Hindi text from attached image. But when i set language > to "hin" i get "read_params_file: parameter no

Re: [tesseract-ocr] Re: "read_params_file: parameter not found: " for hindi

2018-06-25 Thread Shree Devi Kumar
i am not familiar with Tess4J 3.8.4. Have you tried it directly from command line? It is also possible that you are not using correct syntax for the command and the language name is being used as output file name, try the following tesseract input.png output -l hin On Mon, Jun 25, 2018 at

Re: [tesseract-ocr] java.lang.UnsatisfiedLinkError: The specified module could not be found.

2018-06-26 Thread Shree Devi Kumar
Please post in https://github.com/nguyenq/tess4j/issues On Tue, Jun 26, 2018 at 1:30 PM Kiran Sonar wrote: > I moved to tess4j_4.00 from tess4J_3.8.4 which is neccesary to use new > trainedData- best files. I am getting this error > Exception in thread "main" java.lang.UnsatisfiedLinkError: The

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-26 Thread Shree Devi Kumar
se for testing. >>>> >>>> >>>> On Thu, Jun 21, 2018 at 11:38 PM yajva wrote: >>>> >>>>> one more correction. >>>>> >>>>> >>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: >>

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-27 Thread Shree Devi Kumar
t;https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>>> >>>>>> Need to check that is it not overfitted. >>>>>&

Re: [tesseract-ocr] java.lang.UnsatisfiedLinkError: The specified module could not be found.

2018-06-28 Thread Shree Devi Kumar
< chandrachurh.chatterje...@gmail.com> wrote: > @Shree Devi Kumar , > Can I get a complete detailed description of the Neural Network > Architecture of the Tesseract 4 with diagram relating to what the net_spec > command line of lstm training specifies. > > On Tue, Jun 26, 2018 at

Re: [tesseract-ocr] How come tesseract 4.0 misses, what am I missing here?

2018-06-28 Thread Shree Devi Kumar
Rotate your shot to correct orientation and try. On 6/28/18, cohengil...@gmail.com wrote: > I'm quite new to tesseract and would like to use it in a project for OCR > purposes, > I found a tutorial on the web with photos, so I have executed tesseract > (tesseract 4.0.0-beta.2) on it, > and notic

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
I modified the makefile for ocrd-train to do fine-tuning. It is pasted below: export SHELL := /bin/bash LOCAL := $(PWD)/usr PATH := $(LOCAL)/bin:$(PATH) HOME := /home/ubuntu TESSDATA = $(HOME)/tessdata_best LANGDATA = $(HOME)/langdata # Name of the model to be built MODEL_NAME = frk # Name of

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
Vts > File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 : > !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 > !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 > Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed >

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
> ​ The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly. Oh, I remember now. I had changed that for ease in renaming files for some reason. > In this way can I train a model that, for example, only recognize uppercase characters, or numbers

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training On Sat, Jun 30, 2018 at 3:23 PM john wrote: > Encoding of string failed! Failure bytes: ffc2 ffa9 20 ffd8 > ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa ffd9

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
Then there must be a mismatch between the unicharset you are using and the training text. eg. check whether the copyright symbol is in your unicharset. On Sat, Jun 30, 2018 at 4:48 PM john wrote: > I saw that link. this error occured many times,how can i prevent that? > > On Saturday, June 30, 2

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
Also check that there is no tab or other unprintable character in your training text. Which version of tesseract are you using? show output of tesseract -v On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar wrote: > Then there must be a mismatch between the unicharset you are using and

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-30 Thread Shree Devi Kumar
ogle.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>>>> >>>>>>> Need to check that is it not overfitted. >>>>>>> >>

Re: [tesseract-ocr] parameter not found: tessedit_ocr_psm_mode

2018-07-01 Thread Shree Devi Kumar
what's the output for ? tesseract -v which tesseract which tesstrain.sh On Sun, Jul 1, 2018 at 8:39 PM Zohreh Khosrobeygi wrote: > Hi, > when i use the tesstrain.sh, I have been getting this error that is about > my fas.config. My config file is: > > tessedit_ocr_engine_mode 1 > tessedit_ocr_

Re: [tesseract-ocr] parameter not found: tessedit_ocr_psm_mode

2018-07-01 Thread Shree Devi Kumar
correct variable is tessedit_pageseg_mode On Sun, Jul 1, 2018 at 8:51 PM Shree Devi Kumar wrote: > what's the output for ? > > tesseract -v > > which tesseract > > which tesstrain.sh > > On Sun, Jul 1, 2018 at 8:39 PM Zohreh Khosrobeygi > wrote: > >>

Re: [tesseract-ocr] Train 2 language together

2018-07-01 Thread Shree Devi Kumar
The font being used does not support English. On Sun, Jul 1, 2018 at 10:06 PM Zohreh Khosrobeygi wrote: > Hi, > I have been training the text: > > 272-135031- BECAUSE YOU WERE SLEEPING INSTEAD OWHILE POOR SHAGGY SITS > THERE A COOING DOVE > فیلم و و , منابع سال آگهی آخرين آخرین بود. ساخت و

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
com/UB-Mannheim/tesseract/tree/v4.0.0-beta.1.20180414> >> >> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote: >>> >>> Also check that there is no tab or other unprintable character in your >>> training text. >>> >>> Whi

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
also see https://github.com/tesseract-ocr/tesseract/issues/549 On Mon, Jul 2, 2018 at 7:45 PM Shree Devi Kumar wrote: > You can use find_fonts with your training_text to locate the fonts to use. > > Modify the following command to match your directory setup and try > > echo &qu

Re: [tesseract-ocr] A friendly suggestion for the "tesseract-ocr" group members (Concern to all members)

2018-07-03 Thread Shree Devi Kumar
I have added a wiki page at https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-different-versions and updated for 3.04 and 4.0alpha. You can update for older versions. On Tue, Jul 3, 2018 at 1:30 AM wrote: > It seems with all languages and revisions, people (including me) tend to >

Re: [tesseract-ocr] how to improve dot-matrix digits recognize accuracy

2018-07-06 Thread Shree Devi Kumar
You could try finetuning for the dotmatrix font. On Fri, Jul 6, 2018 at 3:43 PM Wenjie Chen wrote: > Hi folks, > > Below is the dot-matrix digits picture, *tesseract *recognize it > uncorrect without any pre-processing. > >

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Shree Devi Kumar
try --psm 6 On Fri, Jul 6, 2018 at 2:23 PM Alberto Andreotti wrote: > Hello, > > I'm having problems with the simplest image possible. > It's a screenshot from GEdit(Ubuntu's text editor), with numbers and > points. This is what I get, > > 23.78 > 15 > 1.6 > 17.6 > 25 > 225 > 2235 > 0.5 > > Albe

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
See the following link to comment by Ray regarding building of Training data https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 On Fri 6 Jul, 2018, 10:38 PM James Q, wrote: > No tool I can think of. What I would do is edit the file in a large text > file editor (such a

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
Also see a community contributed perl script for generating langdata in https://github.com/tesseract-ocr/tesseract/tree/master/contrib On Fri 6 Jul, 2018, 10:52 PM Shree Devi Kumar, wrote: > See the following link to comment by Ray regarding building of Training > data > > > htt

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-11 Thread Shree Devi Kumar
hub.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata >>> >>> Attached is the OCRed output for pages 13-24 of dark pdf with it. >>> >>> I am still training a different variation. >>> >>> >>> >>> On Wed,

Re: [tesseract-ocr] why tesseract gives junk value for japanese language?

2018-07-12 Thread Shree Devi Kumar
Try traineddata from tessdata_best and tessdata_fast On Thu 12 Jul, 2018, 6:45 PM mahendrag gajera, wrote: > Hello all > > I am try to ocr japanese images via below code. But it give junk character. > My tesseract version is 4.0 > > Please let me know what is missing here. > > void Test(char* im

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-12 Thread Shree Devi Kumar
> Will wait for next ver. >>>> >>>> >>>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote: >>>>> >>>>> I have uploaded a new version of traineddata file at >>>>> >>>>> https://github.com/

Re: [tesseract-ocr] How to use tesseract 4 engineMode 2 ( Legacy + LSTM engines)?

2018-07-12 Thread Shree Devi Kumar
The traineddata files can hold both types of models. The OCR Engine mode chooses which ones get used. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#format-of-traineddata-files On Fri, Jul 13, 2018 at 9:31 AM 于洋 wrote: > Tesseract 4 introduced new LSTM engine. The LSTM engine needs

Re: [tesseract-ocr] using tesseract4 works fine but with oem 0 "couldn't load any languages"

2018-07-14 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/blob/master/unittest/osd_test.cc On Sat 14 Jul, 2018, 8:23 PM simon mackenzie, wrote: > I am using tesseract4 and all working fine with english. However > tesseract4 cannot detect page orientation so I want to use tesseract3 for > this. > > I though

Re: [tesseract-ocr] "lstmtraining" stopped but not finished?

2018-07-15 Thread Shree Devi Kumar
Did you figure out what was causing this? On Thu, Jul 12, 2018 at 8:15 AM Dd U wrote: > Hello guys please help me. > > > I'm trying to training for improve Japanese language. Then I have a > problem now. > > > lstmtraining is stopped but not finished. It does not using CPU anymore > and nothing

Re: [tesseract-ocr] multiple problem with fine tuning

2018-07-16 Thread Shree Devi Kumar
> first of all in some words in tiff files the characters are not joined. Make sure to include ZWNJ and ZWJ in your unicharset. > box file generated is from left to right but it should be RTL According to Ray that is intentional. > is using lstmtraining.exe the next and final step Yes. tesst

Re: [tesseract-ocr] The exposure option for tesstrain.sh

2018-07-16 Thread Shree Devi Kumar
Simply speaking, Exposure setting is similar to a scanner's setting, -1 -2 make it lighter, 1, 2, 3 etc make the text darker and thicker. On Tue 17 Jul, 2018, 6:37 AM 'John Lee Ward' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > Does anyone know of a document or can someone expl

Re: [tesseract-ocr] Questions about training korean language in tesseract 4.0

2018-07-19 Thread Shree Devi Kumar
Using tesstrain.sh with korean training text. You can see the format of generated box files through that. On Thu, Jul 19, 2018 at 12:06 PM Soumik Ranjan Dasgupta < srd1...@cse.jgec.ac.in> wrote: > 2) For checking the fonts used in generating the traineddata for your > language, you can see trai

Re: [tesseract-ocr] What is the purpose of trained data files present under tessdata/script folder

2018-07-19 Thread Shree Devi Kumar
Files in tessdata are for a particular language eg. Hindi, Sanskrit, Marathi, Nepali. Files in tessdata/script are for a particular script used for writing the languages eg. Devanagari. Also note that most script files also include support for English. So, if you have a document with Hindi+Engli

Re: [tesseract-ocr] How to train by tesseract 4.00

2018-07-20 Thread Shree Devi Kumar
Please ask at https://github.com/OCR-D/ocrd-train/issues for ocr-d related questions. On Fri, Jul 20, 2018 at 11:36 AM Emiliano Isaza Villamizar wrote: > Hi Shree, > > I've been trying to use this repo but I keep getting this error when I run > any target with OCR-D. > > On Sunday, June 3, 2018

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-21 Thread Shree Devi Kumar
--linedata_only\ You need space before the continuation mark \ On Sat 21 Jul, 2018, 10:00 PM , wrote: > can u please point out the place where to put the space > > thank you > > On Saturday, July 21, 2018 at 12:12:22 PM UTC-4, thiyam...@gmail.com > wrote: >> >> My command is >> >> >> usr/share/

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Shree Devi Kumar
needs two dashes, On Sun, Jul 22, 2018 at 12:29 PM wrote: > hello again, i modified the error in the way you said and there is no > error. but now the same error of unrecognised is occured in output_dir. > the error is > ERROR: Unrecognized argument -–output_dir > > my command is > > /usr/share/

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Shree Devi Kumar
able > ERROR: /tmp/tmp.pBWa4wRHmt/ben/ben.“lohit-bengali”.exp0.box does not exist > or is not readable > > SO , please tell is all the fonts which are in this FONTS folder are > already installed to tesseract or not? > > > On Sun, Jul 22, 2018 at 7:15 AM, Jennil Thiyam

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-23 Thread Shree Devi Kumar
Which tessdata repository are you using for your trained data files? tessdata tessdata_best tessdata_fast On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki, wrote: > Hi. > > I tried new tesseract and traineddata for Japanese (both jpn.traineddata > and Japanese.traineddata). > > It's very good r

Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-23 Thread Shree Devi Kumar
Which version of tesseract are you using? Please post output of tesseract -v On Tue 24 Jul, 2018, 2:26 AM Emiliano Isaza Villamizar, wrote: > Hello everyone, > > > 'm trying to train tesseract to improve the detection of some prices such > as: CN¥2,400.48. I got got to a point that I keep gett

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-24 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tessdata_fast#example---jpn-and--japanese for Ray's comment regarding the 'script' traineddata. preserve_interword_spaces 1 was added via jpn.config to jpn.traineddata file and other CJK languages to fix this issue - see https://github.com/tesseract

Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2018-07-26 Thread Shree Devi Kumar
There is no official traineddata for san_latn or last. I have created some experimental versions but the output is not fully accurate. On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, wrote: > You're telling tesseract that your text is in Latin. You need the > traineddata for san-lat. > > -- >

Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2018-07-26 Thread Shree Devi Kumar
You can try IAST ones from https://github.com/Shreeshrii/tessdata_shreetest?files=1 On Fri 27 Jul, 2018, 8:27 AM Shree Devi Kumar, wrote: > There is no official traineddata for san_latn or last. I have created some > experimental versions but the output is not fully accurate. > > &

Re: [tesseract-ocr] Can't symlink into tessdata anymore?

2018-07-27 Thread Shree Devi Kumar
@zdenko podobny Please see https://github.com/tesseract-ocr/tessdata/issues/18 ita.special-words missing #18 On Fri, Jul 27, 2018 at 11:55 AM Zdenko Podobny wrote: > symlink is filesystem feature and tesseract use standard C++ function for > reading/writing files from filesystem, so there is n

Re: [tesseract-ocr] tesseract-4.0.0-beta.3 - testing problem

2018-07-28 Thread Shree Devi Kumar
Test related info has been moved to a new repo under tesseract-ocr https://github.com/tesseract-ocr/test You need to update that submodule (similar to googletest) for all files to be available. It's possible that the wiki has not been updated for the same, you can add appropriate instructions to

Re: [tesseract-ocr] combine_tessdata. Failed to read /usr/share/tesseract-ocr/tessdata/foo.traineddata

2018-07-29 Thread Shree Devi Kumar
Continue_from should be used when you want to train a new language based on an existing language or to add some characters to an existing language. There is no existing language called 'foo' - you should replace it with the lang code for the language you are training. On Sun, Jul 29, 2018 at 9:44

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-02 Thread Shree Devi Kumar
Please use latest scripts from https://github.com/OCR-D/ocrd-train On Fri, Aug 3, 2018 at 4:41 AM May wrote: > > > > > >

Re: [tesseract-ocr] Error on combine_lang_model script; Null char=2 Invalid format in radical table at line 4: 3400 1.4 Creation of encoded unicharset failed!! Error writing recoder!!

2018-08-05 Thread Shree Devi Kumar
You are using an old version of tesseract. Please use the latest version from github. Make sure you remove/uninstall old version. You error is related to radical stroke file in langdata. Make sure you use latest version of langdata repo. >Invalid format in radical table at line 4: 34001.4 O

Re: [tesseract-ocr] tesseract-4.0.0-beta.3 - testing problem

2018-08-06 Thread Shree Devi Kumar
ree Devi Kumar: > > Test related info has been moved to a new repo under tesseract-ocr > > https://github.com/tesseract-ocr/test > > > > You need to update that submodule (similar to googletest) for all files > > to be available. > > > > It's possible th

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-06 Thread Shree Devi Kumar
Ocr-d scripts are geared towards tesseract 4.0.x. you are trying to use it with tesseract 3.05. On Tue 7 Aug, 2018, 10:50 AM May, wrote: > Hey Shree > > I also tried with the orignal script from the github. But faced the same > issue with the process stuck at unicharset_output. > > >

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
lstm training can take weeks, days, hours depending on the options chosen. you have given complete network spec, so that is training from scratch. Please see the following training wiki page for training related info: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 On Tue

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
language not fully supported ? Answering these questions will help you decide what training, if any, is required. On Tue, Aug 7, 2018 at 1:59 PM Shree Devi Kumar wrote: > lstm training can take weeks, days, hours depending on the options chosen. > > you have given complete network spec

Re: [tesseract-ocr] tesseract not able to detect handwritten text even after improving image quality

2018-08-07 Thread Shree Devi Kumar
see FAQ https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-use-tesseract-for-handwriting-recognition Recently a lot of people have tried to train 4.0 using handwriting fonts, however, there has been no report as to the level of success they have had doing it. On Tue, Aug 7, 2018 at 3:28

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
Re finetuning - see https://github.com/tesseract-ocr/tesseract/issues/1782#issuecomment-411018986 Have you tried to provide each word separately (eg. using opencv ) for recognition? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscrib

Re: [tesseract-ocr] tesseract not able to detect handwritten text even after improving image quality

2018-08-08 Thread Shree Devi Kumar
https://groups.google.com/forum/#!searchin/tesseract-ocr/handwriting%7Csort:date On Wed, Aug 8, 2018 at 6:21 PM wrote: > Hi Shree, > > I'm still unable to extract images from given png can you please suggest > me any other links > > Regards > Rahul > > On Wednesday, August 8, 2018 at 11:29:07 AM

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-08 Thread Shree Devi Kumar
i think this could be if your new traineddats is not trained to as high a accuracy level as the eng traineddata. You can setup a debug log to verify this. see https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865 for details On Wed, Aug 8, 2018 at 6:04 PM wrote: > i'm tr

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-09 Thread Shree Devi Kumar
output tesseract.log file should be produced in the directory from where you are running the command, usually where your OCR output is created. On Thu, Aug 9, 2018 at 3:48 PM wrote: > Hello Shree, thank you for your prompt reply. > > I have now changed the logfile as instructed. Where can i find

Re: [tesseract-ocr] Re: tesseract training flags to rtl languages

2018-08-09 Thread Shree Devi Kumar
There is an Urdu traineddata for tesseract 4. Have you tried it See https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017 You can also check script/Arabic which should also support Urdu. Please provide feedback as to its accuracy for Urd

Re: [tesseract-ocr] Re: tesseract training flags to rtl languages

2018-08-09 Thread Shree Devi Kumar
wrong in training. can you please point out. thank you. > > On Thu, Aug 9, 2018 at 7:39 PM Shree Devi Kumar > wrote: > >> There is an Urdu traineddata for tesseract 4. Have you tried it >> >> See >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#up

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread Shree Devi Kumar
I do not know about the internal algorithms used by tesseract. If you are having accuracy issues with certain letters and digits, I will suggest that you fine-tune for impact using the images or similar font. Please see wiki page on training 4.0 for the command - look for fine tuning for new fon

Re: [tesseract-ocr] cannot install new version, please help me

2018-08-10 Thread Shree Devi Kumar
uninstall all versions of tesseract and libtesseract-dev then install using ppa from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr On Sat, Aug 11, 2018 at 11:08 AM Kimchi wrote: > Environment > >- Tesseract Version: 3.04 >- Commit Number: 3.04 >- Platform: ubuntu 16.

Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-12 Thread Shree Devi Kumar
sudo apt-get remove tesseract-ocr sudo apt-get remove libtesseract-dev sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt-get update sudo apt install tesseract-ocr sudo apt install libtesseract-dev The above will unintsall and then install the latest version binaries from ppa. Why are y

Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-14 Thread Shree Devi Kumar
libtool: install: /usr/bin/install -c .libs/combine_lang_model /usr/local/bin/combine_lang_model libtool: install: /usr/bin/install -c .libs/combine_tessdata /usr/local/bin/combine_tessdata libtool: install: /usr/bin/install -c .libs/dawg2wordlist /usr/local/bin/dawg2wordlist libtool: install: /usr

Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-14 Thread Shree Devi Kumar
│ │ │ └── wordlist2dawg.o On Wed, Aug 15, 2018 at 9:35 AM Shree Devi Kumar wrote: > libtool: install: /usr/bin/install -c .libs/combine_lang_model > /usr/local/bin/combine_lang_model > libtool: install: /usr/bin/install -c .libs/combine_tessdata > /usr/local/bin/comb

Re: [tesseract-ocr] How to read texts from a table into arrays with tesseract, given the cooridnates of the column and row boundaries?

2018-08-15 Thread Shree Devi Kumar
check whether HOCR or TSV outputs are useful. On Wed, Aug 15, 2018 at 4:24 PM, Bec Zhao wrote: > Hi, > > I want to extract texts from tables into arrays that represents the rows > and columns of the table. > I have already used opensv to obtain the precise boundaries of the table, > now I want t

Re: [tesseract-ocr] Make lstm for some files

2018-08-16 Thread Shree Devi Kumar
You need to make lstmf file for each of these. eg. tesseract fas.B_Mitra.exp0.tif fas.B_Mitra.exp0 --psm 6 lstm.train will create fas.B_Mitra.exp0.lstmf On Thu, Aug 16, 2018 at 5:40 PM, Zohreh Khosrobeygi wrote: > I have some tif and box files for each font for example: > fas.B_Mitra.exp

Re: [tesseract-ocr] Infinite Loop of Compute CTC targets failed!

2018-08-17 Thread Shree Devi Kumar
Please build the latest code beta.4 and run the same test. On Fri, Aug 17, 2018 at 4:44 PM, wrote: > ### Environment > > * **Tesseract Version**: 4.0.0-beta.1-306-g45b11cd > * **Commit Number**: 4.0.0-beta.1-306-g45b11cd > * **Platform**: Ubuntu x86_64 GNU/Linux > ### Current Behavior: > > Infin

Re: [tesseract-ocr] Make lstm for some files

2018-08-19 Thread Shree Devi Kumar
exit 1 > > Tesseract -v: > tesseract 4.0.0-beta.1 > leptonica-1.74.4 > libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib > 1.2.8 > > Found AVX2 > Found AVX > Found SSE > > > > > > On Thu, Aug 16, 2018 at 6:28 PM Shree Devi Kumar &

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-20 Thread Shree Devi Kumar
If you want to change parameters, please look at the replace layers option. With fine tuning you cannot change them. On Tue 21 Aug, 2018, 7:27 AM , wrote: > Is it possible to change the parameters when Fine Tuning? > > The documentation says "Fine tuning is the process of training an existing >

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
rate = 0.001, momentum=0.5 I have not tried changing the parameters even with replace layers. Do provide feedback on your experience, On Tue, Aug 21, 2018 at 12:00 PM Jacob Biros wrote: > Thank you for the prompt response! > > On Tue, Aug 21, 2018 at 1:14 PM, Shree Devi Kumar > wrote: &

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
When you specify the complete network spec as in --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ It probably treats it as a training from scratch and ignores the continue_from. I haven't looked at how the training command is parsed. I just followed Ray's examples for th

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
t; I'm not sure if we will move forward with replacing the layers, but I will > post about it if we do. > > Thanks again! > > On Tue, Aug 21, 2018 at 6:08 PM, Shree Devi Kumar > wrote: > >> >lstmtraining --model_output ./layer_from_deva/layer --continue_from >

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
On Tue, Aug 21, 2018 at 1:16 PM wrote: > Sorry, one more question. We set up 4 different machines all running the > command below except for minor differences in the momentum and the learning > rate. Changing the momentum and learning rate in this situation, because > it is fine tuning, shouldn

Re: [tesseract-ocr] IF I could make .unicharset by box/tif pairs instead of fonts files by tesstrain.sh?

2018-08-27 Thread Shree Devi Kumar
When using tesstrain.sh, you can add --save_box_tiff to the command line. Original tesstrain.sh did not move box/tiff alongwith lstmf files (they remained in /tmp directory). I had modified it first to move box/tiff in all cases along with lstmf files. This option now gives the user the choice w

Re: [tesseract-ocr] Experiment with Thai language

2018-08-31 Thread Shree Devi Kumar
>Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in language '' I don't know what causes this kind of warning and how to solve it so I just continue the training. These are related to normalization and validation of the training text. Please see https://github.com/tesseract-ocr/te

Re: [tesseract-ocr] Which repo should I use? langdata_lstm or langdata?

2018-08-31 Thread Shree Devi Kumar
langdata_lstm has the source training data used for 4.0.0alpha traineddata files. However, the exact usage of files or scripts to use them have not been provided. The training text in langdata_lstm is much larger. If you want only to finetune, you should consider limiting the pages when running te

Re: [tesseract-ocr] Error in creating LSTM training data using tesstrain.sh

2018-09-01 Thread Shree Devi Kumar
> read_params_file: Can't open lstm.train lstm.train is a config file which is not found. It is there in tesseract/tessdata/configs Make sure it is there in your tessdata directory or your path and can be found. On Sun, Sep 2, 2018 at 3:40 AM, Shandigutt wrote: > Hi, > > I was trying to creat

  1   2   3   4   5   6   7   8   9   >