Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-20 Thread Shree Devi Kumar
I had done a training for sanskrit for both devanagari and IAST but it does not include cedilla for Sh I will add it and let you know. On Wed 20 Jun, 2018, 1:17 AM yajva, wrote: > I have tried Google OCR for recognizing Sanskrit text in Roman with > diacritics (IAST). It recognizes above macron

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
You will have better control on training if you use tesstrain.sh provided with tesseract. On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla wrote: > http://www.1001fonts.com/handwritten-fonts.html. > > the above link has 1900+ fonts from that site i have downloaded the ttf > files of fonts and co

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-20 Thread Shree Devi Kumar
I am attaching the OCRed text. Please correct it so that I can use as groundtruth for further training and testing. On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar wrote: > I had done a training for sanskrit for both devanagari and IAST but it > does not include cedilla for Sh > >

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
at 9:05 PM Navaneetha Bitla wrote: > can you help us by saying how to train with tesstrain.sh > > It will help all of us, we are thankful to you. > > On Wed, Jun 20, 2018 at 8:59 PM, Shree Devi Kumar > wrote: > >> You will have better control on training if you u

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
data \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 echo "## EVAL FINETUNED MODEL ##" lstmeval \ --model $trained_output_dir/$Lang-finetune-$Lang.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt \ --verbosity 0 fi On W

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
Here are the bash script files: 1. for finetune for impact training - add a font 2. for finetune plus-minus training - for adding a new character On Thu, Jun 21, 2018 at 1:40 AM Shree Devi Kumar wrote: > Attached is a BASH script for Finetune training for 'Impact' (refer to > R

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread Shree Devi Kumar
> > Thank you very much sir Ma'am, not Sir. I am Mrs. Kumar. Let me know if you have any questions or need clarification regarding the scripts. I will post them on the wiki after any needed changes. > > -- You received this message because you are subscribed to the Google Groups "tesseract-o

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
Tesseract4 LSTM training is line based. On Thu 21 Jun, 2018, 12:25 PM chandra churh chatterjee, < chandrachurh.chatterje...@gmail.com> wrote: > Excuse me @Shree Devi Kumar can you please tell me whether data for > training tesseract 4.0 would be better if the data has image

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
n 21, 2018 at 12:24 PM, chandra churh chatterjee < > chandrachurh.chatterje...@gmail.com> wrote: > >> Excuse me @Shree Devi Kumar can you please tell me whether data for >> training tesseract 4.0 would be better if the data has images which have >> paragraphed hand writt

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
> Quite a few of these handwriting fonts are uppercase letters only (so lowercase come out as uppercase when typed) . What is the best type of [lang].training_text data to use for training these - is it uppercase only? It would depend on the application where training is being used. If you want s

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
wrote: > Hi Shree, I'm trying out the script you posted earlier which is great so > thank you! I was wondering how many fonts I can specify at once in the > 'fonts_for_training' list. I have run it with 9 fonts at once and that > seems fine but I would like to do 100s

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread Shree Devi Kumar
, 2018 at 10:03 PM wrote: > @Shree > > Thanks for providing the two bash scripts > I want to ask you about tesstrain.sh and tesstrain_utils.sh, Is there > something that must be edited before running lstmtrain_finetune_impact.sh ? > > On Wednesday, June 20, 2018 at 11:56:27

Re: [tesseract-ocr] Getting error while creating .lstm files

2018-06-21 Thread Shree Devi Kumar
Look at src/training/language_specific.sh The list of default fonts for English is being picked up from there and you probably don't have them installed. Use fonts that are available. On Fri, Jun 22, 2018 at 9:20 AM Harathi Surya wrote: > Hi, > > I am trying to create .lstm files to finetune t

Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Shree Devi Kumar
Please try with a different psm and see if you get better results. If you share a sample image we can test and respond. On Fri, Jun 22, 2018 at 5:29 PM wrote: > Could someone please try to give me an answer for my language. > > On Friday, June 15, 2018 at 2:42:00 PM UTC+2, ahka.an...@gmail.com w

Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Shree Devi Kumar
h3.googleusercontent.com/-E88ArfnXFP4/Wy0CMbscrVI/AAQ/YUhFh9aYMx0_CiqhK-qBVnX3l5YsyZ6FwCLcBGAs/s1600/24-block-0-L-25.png> > > Thanks for the reply > Those are two line examples. > > On Friday, June 22, 2018 at 3:59:23 PM UTC+2, shree wrote: >> >> Please try with a d

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-22 Thread Shree Devi Kumar
done >> >> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >>> >>> I am attaching the OCRed text. Please correct it so that I can use as >>> groundtruth for further training and testing. >>> >>> On Wed, Jun 20, 2018 at 3:15 PM

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-22 Thread Shree Devi Kumar
Sorry, there seems to be some regression in the file posted on github. I will upload again later. On Fri, Jun 22, 2018 at 7:56 PM Shree Devi Kumar wrote: > Please try with iast.traineddata model for tesseract.4.0.0-beta posted at > https://github.com/Shreeshrii/tessdata_sanskrit > >

Re: [tesseract-ocr] Re: Not getting correct output even after finetuning tesseract with new character

2018-06-22 Thread Shree Devi Kumar
Did you run the eval as given in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters Did you stop training and create a new traineddata file? Are you using the new traineddata file for testing? On Sat, Jun 23, 2018 at 12:36 AM Harathi Surya

Re: [tesseract-ocr] Re: Getting error while creating .lstm files

2018-06-22 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files On Sat, Jun 23, 2018 at 12:55 AM Harathi Surya wrote: > Hi Shree, > > Thanks for your reply. > I replaced fontlist argument 'Impact Condensed' with 'DejaVu Sans'

Re: [tesseract-ocr] Re: Getting error while creating .lstm files

2018-06-22 Thread Shree Devi Kumar
18 at 1:13 AM Harathi Surya wrote: > Sorry by mistake uploaded the wrong file. Please find the attached file > for the output i got. > > Thanks, > Harathi > > On Friday, June 22, 2018 at 12:41:25 PM UTC-7, Harathi Surya wrote: >> >> Thanks Shree, >> >>

Re: [tesseract-ocr] why my hocr file look like this

2018-06-23 Thread Shree Devi Kumar
tesseract test.png result horc You used wrong config file. It should be hocr not horc On Sat, Jun 23, 2018 at 12:23 PM Ben Zhang wrote: > Hi, All, > I used tesseract 3.05, and type 'tesseract test.png result horc' in > command line, get result.horc, in this file it has: > > *Provider* *Network

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-23 Thread Shree Devi Kumar
> > On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: >> >> done >> >> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >>> >>> I am attaching the OCRed text. Please correct it so that I can use as >>> groun

Re: [tesseract-ocr] "read_params_file: parameter not found: " for hindi

2018-06-25 Thread Shree Devi Kumar
looks like you are using wrong version of traineddata file ie. 3.0x hin.traineddata with code for tesseract4.0.0. On Mon, Jun 25, 2018 at 1:01 PM Kiran Sonar wrote: > Hi, > I am trying to get Hindi text from attached image. But when i set language > to "hin" i get "read_params_file: parameter no

Re: [tesseract-ocr] Re: "read_params_file: parameter not found: " for hindi

2018-06-25 Thread Shree Devi Kumar
i am not familiar with Tess4J 3.8.4. Have you tried it directly from command line? It is also possible that you are not using correct syntax for the command and the language name is being used as output file name, try the following tesseract input.png output -l hin On Mon, Jun 25, 2018 at

Re: [tesseract-ocr] java.lang.UnsatisfiedLinkError: The specified module could not be found.

2018-06-26 Thread Shree Devi Kumar
Please post in https://github.com/nguyenq/tess4j/issues On Tue, Jun 26, 2018 at 1:30 PM Kiran Sonar wrote: > I moved to tess4j_4.00 from tess4J_3.8.4 which is neccesary to use new > trainedData- best files. I am getting this error > Exception in thread "main" java.lang.UnsatisfiedLinkError: The

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-26 Thread Shree Devi Kumar
the same text. Here's the doc used for the first. > png. This is slightly darker, but the one sent earlier is cleaner. Let me > know which is more amenable for OCRing. I use PDF Shaper to extract images > and convert to png using xnview. > > On Tuesday, June 26, 2018 at 7:48

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-27 Thread Shree Devi Kumar
recognized as ṛ > consistently. Can these be addressed ? > I am using tesseract 4 alpha windows build from command line. > > Are the dev files in repos ? > > > On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: >> >> I had used ghostview to convert

Re: [tesseract-ocr] java.lang.UnsatisfiedLinkError: The specified module could not be found.

2018-06-28 Thread Shree Devi Kumar
< chandrachurh.chatterje...@gmail.com> wrote: > @Shree Devi Kumar , > Can I get a complete detailed description of the Neural Network > Architecture of the Tesseract 4 with diagram relating to what the net_spec > command line of lstm training specifies. > > On Tue, Jun 26, 2018 at

Re: [tesseract-ocr] How come tesseract 4.0 misses, what am I missing here?

2018-06-28 Thread Shree Devi Kumar
Rotate your shot to correct orientation and try. On 6/28/18, cohengil...@gmail.com wrote: > I'm quite new to tesseract and would like to use it in a project for OCR > purposes, > I found a tutorial on the web with photos, so I have executed tesseract > (tesseract 4.0.0-beta.2) on it, > and notic

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
I modified the makefile for ocrd-train to do fine-tuning. It is pasted below: export SHELL := /bin/bash LOCAL := $(PWD)/usr PATH := $(LOCAL)/bin:$(PATH) HOME := /home/ubuntu TESSDATA = $(HOME)/tessdata_best LANGDATA = $(HOME)/langdata # Name of the model to be built MODEL_NAME = frk # Name of

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
. tessdata_best/deu.traineddata for German. On Fri, Jun 29, 2018 at 9:03 PM Lorenzo Bolzani wrote: > Hi Shree, thanks for your answer. > > I tried the script setting: > > TESSDATA=extracted # here I have the eng.lstm and > eng.trainedata > LANGDATA=langdata-mast

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
> ​ The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly. Oh, I remember now. I had changed that for ease in renaming files for some reason. > In this way can I train a model that, for example, only recognize uppercase characters, or numbers

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training On Sat, Jun 30, 2018 at 3:23 PM john wrote: > Encoding of string failed! Failure bytes: ffc2 ffa9 20 ffd8 > ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa ffd9

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote: >> >> see >> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training >> >> On Sat, Jun 30, 2018 at 3:23 PM john wrote: >> >>> Encoding of string failed! Fai

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-06-30 Thread Shree Devi Kumar
Also check that there is no tab or other unprintable character in your training text. Which version of tesseract are you using? show output of tesseract -v On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar wrote: > Then there must be a mismatch between the unicharset you are using and

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-30 Thread Shree Devi Kumar
I have uploaded a new version of traineddata file at https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata Attached is the OCRed output for pages 13-24 of dark pdf with it. I am still training a different variation. On Wed, Jun 27, 2018 at 6:46 PM Shree

Re: [tesseract-ocr] parameter not found: tessedit_ocr_psm_mode

2018-07-01 Thread Shree Devi Kumar
what's the output for ? tesseract -v which tesseract which tesstrain.sh On Sun, Jul 1, 2018 at 8:39 PM Zohreh Khosrobeygi wrote: > Hi, > when i use the tesstrain.sh, I have been getting this error that is about > my fas.config. My config file is: > > tessedit_ocr_engine_mode 1 > tessedit_ocr_

Re: [tesseract-ocr] parameter not found: tessedit_ocr_psm_mode

2018-07-01 Thread Shree Devi Kumar
correct variable is tessedit_pageseg_mode On Sun, Jul 1, 2018 at 8:51 PM Shree Devi Kumar wrote: > what's the output for ? > > tesseract -v > > which tesseract > > which tesstrain.sh > > On Sun, Jul 1, 2018 at 8:39 PM Zohreh Khosrobeygi > wrote: > >>

Re: [tesseract-ocr] Train 2 language together

2018-07-01 Thread Shree Devi Kumar
The font being used does not support English. On Sun, Jul 1, 2018 at 10:06 PM Zohreh Khosrobeygi wrote: > Hi, > I have been training the text: > > 272-135031- BECAUSE YOU WERE SLEEPING INSTEAD OWHILE POOR SHAGGY SITS > THERE A COOING DOVE > فیلم و و , منابع سال آگهی آخرين آخرین بود. ساخت و

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
com/UB-Mannheim/tesseract/tree/v4.0.0-beta.1.20180414> >> >> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote: >>> >>> Also check that there is no tab or other unprintable character in your >>> training text. >>> >>> Whi

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
also see https://github.com/tesseract-ocr/tesseract/issues/549 On Mon, Jul 2, 2018 at 7:45 PM Shree Devi Kumar wrote: > You can use find_fonts with your training_text to locate the fonts to use. > > Modify the following command to match your directory setup and try > > echo &qu

Re: [tesseract-ocr] A friendly suggestion for the "tesseract-ocr" group members (Concern to all members)

2018-07-03 Thread Shree Devi Kumar
I have added a wiki page at https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-different-versions and updated for 3.04 and 4.0alpha. You can update for older versions. On Tue, Jul 3, 2018 at 1:30 AM wrote: > It seems with all languages and revisions, people (including me) tend to >

Re: [tesseract-ocr] how to improve dot-matrix digits recognize accuracy

2018-07-06 Thread Shree Devi Kumar
You could try finetuning for the dotmatrix font. On Fri, Jul 6, 2018 at 3:43 PM Wenjie Chen wrote: > Hi folks, > > Below is the dot-matrix digits picture, *tesseract *recognize it > uncorrect without any pre-processing. > >

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Shree Devi Kumar
try --psm 6 On Fri, Jul 6, 2018 at 2:23 PM Alberto Andreotti wrote: > Hello, > > I'm having problems with the simplest image possible. > It's a screenshot from GEdit(Ubuntu's text editor), with numbers and > points. This is what I get, > > 23.78 > 15 > 1.6 > 17.6 > 25 > 225 > 2235 > 0.5 > > Albe

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
See the following link to comment by Ray regarding building of Training data https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 On Fri 6 Jul, 2018, 10:38 PM James Q, wrote: > No tool I can think of. What I would do is edit the file in a large text > file editor (such a

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
Also see a community contributed perl script for generating langdata in https://github.com/tesseract-ocr/tesseract/tree/master/contrib On Fri 6 Jul, 2018, 10:52 PM Shree Devi Kumar, wrote: > See the following link to comment by Ray regarding building of Training > data > > > htt

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-11 Thread Shree Devi Kumar
What about ocr with eng+iast On Wed 11 Jul, 2018, 7:44 PM yajva, wrote: > shree > namaste > > I am trying to OCR the attached image. Getting not so good results. Even > for text which is apparently clear. Eg. in the first line, B is recognized > as H, under dot for 't&

Re: [tesseract-ocr] why tesseract gives junk value for japanese language?

2018-07-12 Thread Shree Devi Kumar
Try traineddata from tessdata_best and tessdata_fast On Thu 12 Jul, 2018, 6:45 PM mahendrag gajera, wrote: > Hello all > > I am try to ocr japanese images via below code. But it give junk character. > My tesseract version is 4.0 > > Please let me know what is missing here. > > void Test(char* im

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-12 Thread Shree Devi Kumar
Thank you for your feedback of eng+ I will try training for this and get back. On Thu, Jul 12, 2018 at 2:18 PM yajva wrote: > eng+iast-plus-3600 => no diacritics at all > Latin+iast-plus-3600 => only macrons none other > > > > On Thursday, July 12, 2018 at 1:12:25

Re: [tesseract-ocr] How to use tesseract 4 engineMode 2 ( Legacy + LSTM engines)?

2018-07-12 Thread Shree Devi Kumar
The traineddata files can hold both types of models. The OCR Engine mode chooses which ones get used. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#format-of-traineddata-files On Fri, Jul 13, 2018 at 9:31 AM 于洋 wrote: > Tesseract 4 introduced new LSTM engine. The LSTM engine needs

Re: [tesseract-ocr] using tesseract4 works fine but with oem 0 "couldn't load any languages"

2018-07-14 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/blob/master/unittest/osd_test.cc On Sat 14 Jul, 2018, 8:23 PM simon mackenzie, wrote: > I am using tesseract4 and all working fine with english. However > tesseract4 cannot detect page orientation so I want to use tesseract3 for > this. > > I though

Re: [tesseract-ocr] "lstmtraining" stopped but not finished?

2018-07-15 Thread Shree Devi Kumar
Did you figure out what was causing this? On Thu, Jul 12, 2018 at 8:15 AM Dd U wrote: > Hello guys please help me. > > > I'm trying to training for improve Japanese language. Then I have a > problem now. > > > lstmtraining is stopped but not finished. It does not using CPU anymore > and nothing

Re: [tesseract-ocr] multiple problem with fine tuning

2018-07-16 Thread Shree Devi Kumar
step Yes. tesstrain.sh process only creates a 'starter traineddata' (unlike for tesseract3). On Mon, Jul 16, 2018 at 2:12 PM Hosein Khoshdel wrote: > hi before asking my question i want to thank shree whose comments are very > helpful both here and in github repo of tesseract

Re: [tesseract-ocr] The exposure option for tesstrain.sh

2018-07-16 Thread Shree Devi Kumar
Simply speaking, Exposure setting is similar to a scanner's setting, -1 -2 make it lighter, 1, 2, 3 etc make the text darker and thicker. On Tue 17 Jul, 2018, 6:37 AM 'John Lee Ward' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > Does anyone know of a document or can someone expl

Re: [tesseract-ocr] Questions about training korean language in tesseract 4.0

2018-07-19 Thread Shree Devi Kumar
Using tesstrain.sh with korean training text. You can see the format of generated box files through that. On Thu, Jul 19, 2018 at 12:06 PM Soumik Ranjan Dasgupta < srd1...@cse.jgec.ac.in> wrote: > 2) For checking the fonts used in generating the traineddata for your > language, you can see trai

Re: [tesseract-ocr] What is the purpose of trained data files present under tessdata/script folder

2018-07-19 Thread Shree Devi Kumar
Files in tessdata are for a particular language eg. Hindi, Sanskrit, Marathi, Nepali. Files in tessdata/script are for a particular script used for writing the languages eg. Devanagari. Also note that most script files also include support for English. So, if you have a document with Hindi+Engli

Re: [tesseract-ocr] How to train by tesseract 4.00

2018-07-20 Thread Shree Devi Kumar
Please ask at https://github.com/OCR-D/ocrd-train/issues for ocr-d related questions. On Fri, Jul 20, 2018 at 11:36 AM Emiliano Isaza Villamizar wrote: > Hi Shree, > > I've been trying to use this repo but I keep getting this error when I run > any target with OCR-D. >

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-21 Thread Shree Devi Kumar
--linedata_only\ You need space before the continuation mark \ On Sat 21 Jul, 2018, 10:00 PM , wrote: > can u please point out the place where to put the space > > thank you > > On Saturday, July 21, 2018 at 12:12:22 PM UTC-4, thiyam...@gmail.com > wrote: >> >> My command is >> >> >> usr/share/

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Shree Devi Kumar
/tesseract-ocr/4.00/tessdata \ > > -–output_dir /home/jennil/Desktop/pro/output/ben_output \ > > --fontlist “Lohit Bengali” > > > please do help > > On Saturday, July 21, 2018 at 1:42:41 PM UTC-4, shree wrote: >> >> --linedata_only\ >> >> You need space

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-22 Thread Shree Devi Kumar
able > ERROR: /tmp/tmp.pBWa4wRHmt/ben/ben.“lohit-bengali”.exp0.box does not exist > or is not readable > > SO , please tell is all the fonts which are in this FONTS folder are > already installed to tesseract or not? > > > On Sun, Jul 22, 2018 at 7:15 AM, Jennil Thiyam

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-23 Thread Shree Devi Kumar
Which tessdata repository are you using for your trained data files? tessdata tessdata_best tessdata_fast On Tue 24 Jul, 2018, 9:01 AM Atsuyoshi Suzuki, wrote: > Hi. > > I tried new tesseract and traineddata for Japanese (both jpn.traineddata > and Japanese.traineddata). > > It's very good r

Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-23 Thread Shree Devi Kumar
Which version of tesseract are you using? Please post output of tesseract -v On Tue 24 Jul, 2018, 2:26 AM Emiliano Isaza Villamizar, wrote: > Hello everyone, > > > 'm trying to train tesseract to improve the detection of some prices such > as: CN¥2,400.48. I got got to a point that I keep gett

Re: [tesseract-ocr] Unnecessary extra space with Japanese.traineddata

2018-07-24 Thread Shree Devi Kumar
40 AM Atsuyoshi Suzuki < atuyosi.unloc...@gmail.com> wrote: > Hi Shree. > > I use tessdata_fast. > > > 2018年7月24日火曜日 13時44分40秒 UTC+9 shree: >> >> Which tessdata repository are you using for your trained data files? >> >> tessdata >> tessdata_bes

Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2018-07-26 Thread Shree Devi Kumar
There is no official traineddata for san_latn or last. I have created some experimental versions but the output is not fully accurate. On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, wrote: > You're telling tesseract that your text is in Latin. You need the > traineddata for san-lat. > > -- >

Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2018-07-26 Thread Shree Devi Kumar
You can try IAST ones from https://github.com/Shreeshrii/tessdata_shreetest?files=1 On Fri 27 Jul, 2018, 8:27 AM Shree Devi Kumar, wrote: > There is no official traineddata for san_latn or last. I have created some > experimental versions but the output is not fully accurate. > > &

Re: [tesseract-ocr] Can't symlink into tessdata anymore?

2018-07-27 Thread Shree Devi Kumar
@zdenko podobny Please see https://github.com/tesseract-ocr/tessdata/issues/18 ita.special-words missing #18 On Fri, Jul 27, 2018 at 11:55 AM Zdenko Podobny wrote: > symlink is filesystem feature and tesseract use standard C++ function for > reading/writing files from filesystem, so there is n

Re: [tesseract-ocr] tesseract-4.0.0-beta.3 - testing problem

2018-07-28 Thread Shree Devi Kumar
Test related info has been moved to a new repo under tesseract-ocr https://github.com/tesseract-ocr/test You need to update that submodule (similar to googletest) for all files to be available. It's possible that the wiki has not been updated for the same, you can add appropriate instructions to

Re: [tesseract-ocr] combine_tessdata. Failed to read /usr/share/tesseract-ocr/tessdata/foo.traineddata

2018-07-29 Thread Shree Devi Kumar
Continue_from should be used when you want to train a new language based on an existing language or to add some characters to an existing language. There is no existing language called 'foo' - you should replace it with the lang code for the language you are training. On Sun, Jul 29, 2018 at 9:44

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-02 Thread Shree Devi Kumar
Please use latest scripts from https://github.com/OCR-D/ocrd-train On Fri, Aug 3, 2018 at 4:41 AM May wrote: > > > > > >

Re: [tesseract-ocr] Error on combine_lang_model script; Null char=2 Invalid format in radical table at line 4: 3400 1.4 Creation of encoded unicharset failed!! Error writing recoder!!

2018-08-05 Thread Shree Devi Kumar
You are using an old version of tesseract. Please use the latest version from github. Make sure you remove/uninstall old version. You error is related to radical stroke file in langdata. Make sure you use latest version of langdata repo. >Invalid format in radical table at line 4: 34001.4 O

Re: [tesseract-ocr] tesseract-4.0.0-beta.3 - testing problem

2018-08-06 Thread Shree Devi Kumar
One of the tests is for developers to verify that all traineddata files are valid and load ok, so it needs the complete repo for tessdata_fast and tessdata_best. The tests have not been setup for users. On Mon 6 Aug, 2018, 1:44 PM Marco Atzeri, wrote: > Am 28.07.2018 um 10:08 schrieb Sh

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-06 Thread Shree Devi Kumar
Ocr-d scripts are geared towards tesseract 4.0.x. you are trying to use it with tesseract 3.05. On Tue 7 Aug, 2018, 10:50 AM May, wrote: > Hey Shree > > I also tried with the orignal script from the github. But faced the same > issue with the process stuck at unicharset_output.

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
Mhl3ypueJggCLcBGAs/s1600/Capture.PNG> > > > On Monday, August 6, 2018 at 11:42:40 PM UTC-7, May wrote: >> >> Thanks a lot Shree. I tried the tesseract 4.0 and the training is working >> well until it reaches the lstm-training step and got stuck there. I am >> total

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
language not fully supported ? Answering these questions will help you decide what training, if any, is required. On Tue, Aug 7, 2018 at 1:59 PM Shree Devi Kumar wrote: > lstm training can take weeks, days, hours depending on the options chosen. > > you have given complete network spec

Re: [tesseract-ocr] tesseract not able to detect handwritten text even after improving image quality

2018-08-07 Thread Shree Devi Kumar
see FAQ https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-use-tesseract-for-handwriting-recognition Recently a lot of people have tried to train 4.0 using handwriting fonts, however, there has been no report as to the level of success they have had doing it. On Tue, Aug 7, 2018 at 3:28

Re: [tesseract-ocr] Re: OCR-d failed at Unicharset line -Help!

2018-08-07 Thread Shree Devi Kumar
Re finetuning - see https://github.com/tesseract-ocr/tesseract/issues/1782#issuecomment-411018986 Have you tried to provide each word separately (eg. using opencv ) for recognition? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscrib

Re: [tesseract-ocr] tesseract not able to detect handwritten text even after improving image quality

2018-08-08 Thread Shree Devi Kumar
https://groups.google.com/forum/#!searchin/tesseract-ocr/handwriting%7Csort:date On Wed, Aug 8, 2018 at 6:21 PM wrote: > Hi Shree, > > I'm still unable to extract images from given png can you please suggest > me any other links > > Regards > Rahul > > On Wedne

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-08 Thread Shree Devi Kumar
i think this could be if your new traineddats is not trained to as high a accuracy level as the eng traineddata. You can setup a debug log to verify this. see https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865 for details On Wed, Aug 8, 2018 at 6:04 PM wrote: > i'm tr

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-09 Thread Shree Devi Kumar
output tesseract.log file should be produced in the directory from where you are running the command, usually where your OCR output is created. On Thu, Aug 9, 2018 at 3:48 PM wrote: > Hello Shree, thank you for your prompt reply. > > I have now changed the logfile as instructed. Wh

Re: [tesseract-ocr] Re: tesseract training flags to rtl languages

2018-08-09 Thread Shree Devi Kumar
There is an Urdu traineddata for tesseract 4. Have you tried it See https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017 You can also check script/Arabic which should also support Urdu. Please provide feedback as to its accuracy for Urd

Re: [tesseract-ocr] Re: tesseract training flags to rtl languages

2018-08-09 Thread Shree Devi Kumar
wrong in training. can you please point out. thank you. > > On Thu, Aug 9, 2018 at 7:39 PM Shree Devi Kumar > wrote: > >> There is an Urdu traineddata for tesseract 4. Have you tried it >> >> See >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#up

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread Shree Devi Kumar
font/impact. Use eng.traineddata as base, 50-100 lines of training text and 300-400 iterations max. On Fri 10 Aug, 2018, 8:39 PM , wrote: > Hi Shree, just a quick update. > > I've now looked into this output tesseract.log further and now understand > how it works and how i

Re: [tesseract-ocr] cannot install new version, please help me

2018-08-10 Thread Shree Devi Kumar
uninstall all versions of tesseract and libtesseract-dev then install using ppa from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr On Sat, Aug 11, 2018 at 11:08 AM Kimchi wrote: > Environment > >- Tesseract Version: 3.04 >- Commit Number: 3.04 >- Platform: ubuntu 16.

Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-12 Thread Shree Devi Kumar
sudo apt-get remove tesseract-ocr sudo apt-get remove libtesseract-dev sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt-get update sudo apt install tesseract-ocr sudo apt install libtesseract-dev The above will unintsall and then install the latest version binaries from ppa. Why are y

Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-14 Thread Shree Devi Kumar
libtool: install: /usr/bin/install -c .libs/combine_lang_model /usr/local/bin/combine_lang_model libtool: install: /usr/bin/install -c .libs/combine_tessdata /usr/local/bin/combine_tessdata libtool: install: /usr/bin/install -c .libs/dawg2wordlist /usr/local/bin/dawg2wordlist libtool: install: /usr

Re: [tesseract-ocr] Training tools don't get built when building tesseract from souce

2018-08-14 Thread Shree Devi Kumar
│ │ │ └── wordlist2dawg.o On Wed, Aug 15, 2018 at 9:35 AM Shree Devi Kumar wrote: > libtool: install: /usr/bin/install -c .libs/combine_lang_model > /usr/local/bin/combine_lang_model > libtool: install: /usr/bin/install -c .libs/combine_tessdata > /usr/local/bin/comb

Re: [tesseract-ocr] How to read texts from a table into arrays with tesseract, given the cooridnates of the column and row boundaries?

2018-08-15 Thread Shree Devi Kumar
check whether HOCR or TSV outputs are useful. On Wed, Aug 15, 2018 at 4:24 PM, Bec Zhao wrote: > Hi, > > I want to extract texts from tables into arrays that represents the rows > and columns of the table. > I have already used opensv to obtain the precise boundaries of the table, > now I want t

Re: [tesseract-ocr] Make lstm for some files

2018-08-16 Thread Shree Devi Kumar
You need to make lstmf file for each of these. eg. tesseract fas.B_Mitra.exp0.tif fas.B_Mitra.exp0 --psm 6 lstm.train will create fas.B_Mitra.exp0.lstmf On Thu, Aug 16, 2018 at 5:40 PM, Zohreh Khosrobeygi wrote: > I have some tif and box files for each font for example: > fas.B_Mitra.exp

Re: [tesseract-ocr] Infinite Loop of Compute CTC targets failed!

2018-08-17 Thread Shree Devi Kumar
Please build the latest code beta.4 and run the same test. On Fri, Aug 17, 2018 at 4:44 PM, wrote: > ### Environment > > * **Tesseract Version**: 4.0.0-beta.1-306-g45b11cd > * **Commit Number**: 4.0.0-beta.1-306-g45b11cd > * **Platform**: Ubuntu x86_64 GNU/Linux > ### Current Behavior: > > Infin

Re: [tesseract-ocr] Make lstm for some files

2018-08-19 Thread Shree Devi Kumar
exit 1 > > Tesseract -v: > tesseract 4.0.0-beta.1 > leptonica-1.74.4 > libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib > 1.2.8 > > Found AVX2 > Found AVX > Found SSE > > > > > > On Thu, Aug 16, 2018 at 6:28 PM Shree Devi Kumar &

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-20 Thread Shree Devi Kumar
If you want to change parameters, please look at the replace layers option. With fine tuning you cannot change them. On Tue 21 Aug, 2018, 7:27 AM , wrote: > Is it possible to change the parameters when Fine Tuning? > > The documentation says "Fine tuning is the process of training an existing >

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
rate = 0.001, momentum=0.5 I have not tried changing the parameters even with replace layers. Do provide feedback on your experience, On Tue, Aug 21, 2018 at 12:00 PM Jacob Biros wrote: > Thank you for the prompt response! > > On Tue, Aug 21, 2018 at 1:14 PM, Shree Devi Kumar > wrote: &

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
When you specify the complete network spec as in --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ It probably treats it as a training from scratch and ignores the continue_from. I haven't looked at how the training command is parsed. I just followed Ray's examples for th

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
t; I'm not sure if we will move forward with replacing the layers, but I will > post about it if we do. > > Thanks again! > > On Tue, Aug 21, 2018 at 6:08 PM, Shree Devi Kumar > wrote: > >> >lstmtraining --model_output ./layer_from_deva/layer --continue_from >

Re: [tesseract-ocr] Changing Parameters when Fine Tuning

2018-08-21 Thread Shree Devi Kumar
On Tue, Aug 21, 2018 at 1:16 PM wrote: > Sorry, one more question. We set up 4 different machines all running the > command below except for minor differences in the momentum and the learning > rate. Changing the momentum and learning rate in this situation, because > it is fine tuning, shouldn

Re: [tesseract-ocr] IF I could make .unicharset by box/tif pairs instead of fonts files by tesstrain.sh?

2018-08-27 Thread Shree Devi Kumar
When using tesstrain.sh, you can add --save_box_tiff to the command line. Original tesstrain.sh did not move box/tiff alongwith lstmf files (they remained in /tmp directory). I had modified it first to move box/tiff in all cases along with lstmf files. This option now gives the user the choice w

Re: [tesseract-ocr] Experiment with Thai language

2018-08-31 Thread Shree Devi Kumar
>Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in language '' I don't know what causes this kind of warning and how to solve it so I just continue the training. These are related to normalization and validation of the training text. Please see https://github.com/tesseract-ocr/te

Re: [tesseract-ocr] Which repo should I use? langdata_lstm or langdata?

2018-08-31 Thread Shree Devi Kumar
langdata_lstm has the source training data used for 4.0.0alpha traineddata files. However, the exact usage of files or scripts to use them have not been provided. The training text in langdata_lstm is much larger. If you want only to finetune, you should consider limiting the pages when running te

Re: [tesseract-ocr] Error in creating LSTM training data using tesstrain.sh

2018-09-01 Thread Shree Devi Kumar
> read_params_file: Can't open lstm.train lstm.train is a config file which is not found. It is there in tesseract/tessdata/configs Make sure it is there in your tessdata directory or your path and can be found. On Sun, Sep 2, 2018 at 3:40 AM, Shandigutt wrote: > Hi, > > I was trying to creat

Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

2018-09-03 Thread Shree Devi Kumar
> Then I tried to create a starter traineddata file using combine_lang_model script. I used the below command for that, When you run tesstrain.sh, it creates the starter traineddata using combine_lang_model script. See below for messages from a small test run. + /home/ubuntu/tesseract/src/train

<    1   2   3   4   5   6   7   8   9   10   >