Re: [tesseract-ocr] Tesseract fails to recognize multi column paragraphs

2020-02-11 Thread Shree Devi Kumar
What psm are you using? On Tue, Feb 11, 2020, 20:46 KOLLOL CHOWDHURY wrote: > Hi, > > There are certain pages with multi column and when I try to OCR it, it > doesn't recognise the multi column and takes all the words in a particular > line . > > I am using Tesseract 4.01 and trying to output an

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-02-16 Thread Shree Devi Kumar
Try lstmtraining again for 1000 iterations with --debug_level -1 On Mon, Feb 17, 2020, 01:46 Wincent Balin wrote: > Hello all, > > after preparing ground truth files for Akkadian language, I started the > training using the *tesstrain *Makefile, but over 400 iterations > later, the output

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-02-16 Thread Shree Devi Kumar
I had done a test training for Akkadian sometime back. I will see if I still have the files. On Mon, Feb 17, 2020, 12:53 Shree Devi Kumar wrote: > Try lstmtraining again for 1000 iterations with --debug_level -1 > > > > > On Mon, Feb 17, 2020, 01:46 Wincent Balin wro

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-02-17 Thread Shree Devi Kumar
ter I removed some fonts (this training was done using tesstrain.sh). On Mon, Feb 17, 2020 at 1:23 PM Shree Devi Kumar wrote: > I had done a test training for Akkadian sometime back. I will see if I > still have the files. > > On Mon, Feb 17, 2020, 12:53 Shree Devi Kumar wrote: >

Re: [tesseract-ocr] Re: Using tesseract on browser page insufficient

2020-02-19 Thread Shree Devi Kumar
You are using an old version of software. See https://tesseract-ocr.github.io/tessdoc/Home.html On Wed, Feb 19, 2020 at 10:47 PM Alexander Dietz wrote: > > > On Wednesday, February 19, 2020 at 5:45:09 PM UTC+1, Lakshay Saini wrote: >> >> Hello, >> >> It all depends on the image quality, that's

Re: [tesseract-ocr] How to install current version of tesseract on Ubuntu 16.04.6

2020-02-20 Thread Shree Devi Kumar
If you are on Ubuntu, use Alex's ppa. The link should be on the tessdoc documentation page that I had referred to earlier. On Thu, Feb 20, 2020 at 1:30 PM Alexander Dietz wrote: > How do I install a current version of tesseract on Ubuntu 16.04.6? I did > > sudo apt install tesseract-ocr >

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-02-22 Thread Shree Devi Kumar
Balin wrote: > Hello Shree, > > I tried that. The command was > > lstmtraining --traineddata data/akk/akk.traineddata --old_traineddata > /usr/share/tesseract-ocr/4.00/tessdata/akk-1m.traineddata --continue_from > data/akk-1m/akk.lstm --model_output dat

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-06 Thread Shree Devi Kumar
Language codes recognized for tesseract training are listed in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L21 I will suggest that you use a language similar to your ancient language and do training. You can rename file with your proper language code at

Re: [tesseract-ocr] Re: Trained model works in command line but not bytedeco service?

2020-03-06 Thread Shree Devi Kumar
The files from tessdata_best only support the lstm mode ie --oem 1. Please check what mode your web service is using. On Fri, Mar 6, 2020, 19:27 Adam Funk wrote: > Hi again, > > I've updated the web service to use a newer version: > > compile group: 'org.bytedeco', name: 'tesseract-platform'

Re: [tesseract-ocr] Ban some characters on tessseract ( '/' , '|' , ',' , ...)

2020-03-06 Thread Shree Devi Kumar
Search for whitelist / blacklist in forum for ways to restrict the characters. On Fri, Mar 6, 2020, 18:19 Guillaume de Rybel wrote: > Hi, my work is to recognize license plates, and sometimes, tesseract > recognize some special characters. I need to 'ban' those characters : '/' , > '|' , ',' . >

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-06 Thread Shree Devi Kumar
Is the language RTL like Arabic? The language code is used for picking up related files from langdata or langdata_lstm repo. RTL languages have slightly different processing. On Fri, Mar 6, 2020, 23:05 aby tesh wrote: > The character set of the language is new and is not in any way similar to

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-06 Thread Shree Devi Kumar
ere a > proper solution? > > What is the first step should i do? > > Le ven. 6 mars 2020 à 20:46, Shree Devi Kumar a > écrit : > >> Is the language RTL like Arabic? >> >> The language code is used for picking up related files from langdata or >>

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-06 Thread Shree Devi Kumar
to change this and > train the system? > > Le ven. 6 mars 2020 à 20:57, Shree Devi Kumar a > écrit : > >> If you plan to use ara as the language code, you should change the files >> in --langdata_dir ./tesslang/ara to the files for your language. Eg. >> The traini

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-07 Thread Shree Devi Kumar
I have created an example traineddata for xsa. I will upload later today. You can then modify with a larger training text and run training. On Sat, Mar 7, 2020, 02:58 aby tesh wrote: > I think it is, most likely , Right To Left, it has passed that error now >>> using eng since i only have the tr

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-07 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tesstrain-xsa On Sat, Mar 7, 2020 at 6:54 PM Shree Devi Kumar wrote: > I have created an example traineddata for xsa. I will upload later today. > You can then modify with a larger training text and run training. > > On Sat, Mar 7, 2020, 02

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-09 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#hardware-software-requirements On Tue, Mar 10, 2020, 03:41 aby tesh wrote: > Hey, > > I followed the steps in the readme file, and i started the lstmtraining, > but it seems my current computer's processor can't handl

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-09 Thread Shree Devi Kumar
If you can share a large enough training text and fonts, I can rerun the training. On Tue, Mar 10, 2020, 03:41 aby tesh wrote: > Hey, > > I followed the steps in the readme file, and i started the lstmtraining, > but it seems my current computer's processor can't handle the training for > a long

Re: [tesseract-ocr] Re: Failed loading language 'eng'

2020-03-11 Thread Shree Devi Kumar
One possibility is that the eng.traineddata file you have is not compatible with the latest tesseract version you are using. The other possibility is that the Java userbot is calling tesseract with the wrong --oem. I have cc:ed Quan for advice regarding tess4j and Java. On Wed, Mar 11, 2020, 17:

Re: [tesseract-ocr] Obtain both PDF and HOCR output from single scan?

2020-03-11 Thread Shree Devi Kumar
Use both at end of command line eg. tesseract image outbase -l foo --oem 1 hocr pdf On Thu, Mar 12, 2020, 03:59 Chris Falter wrote: > Hi, > > My project is using Tesseract 4.x to scan multi-page TIFFs. We need to > obtain HOCR output to perform some analytics, and we need to obtain a > searchab

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-14 Thread Shree Devi Kumar
Are all these Unicode fonts? What about training text in utf-8 Unicode encoding? On Sat, Mar 14, 2020, 22:37 aby tesh wrote: > Hey shree, I have compiled all relevant fonts and attached them below. I > am not sure know how i can generate text data with it. > > On Tuesday, March 1

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-14 Thread Shree Devi Kumar
wrote: > That is what i am not getting, i don't think they all are unicode fonts, i > couldn't get one. Some render on my machine (Linux) some don't. > > On Saturday, March 14, 2020 at 8:45:46 PM UTC+3, shree wrote: >> >> Are all these Unicode fonts? >>

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-15 Thread Shree Devi Kumar
4:32:08 AM UTC+3, shree wrote: >> >> I had used the findfonts feature of text2image and found only two fonts >> that rendered the xsa text. I will check the fonts that you sent. What >> about training text? Unless you have some more text, it will be difficult >> to do

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-15 Thread Shree Devi Kumar
ning text from a range of characters/word list, similar to The tool language_metrics runs Tesseract OCR over images of random word > sequences, which are created out of the supplied wordlist, On Mon, Mar 16, 2020 at 2:32 AM Wincent Balin wrote: > Maybe http://dasi.cnr.it does have somethi

Re: [tesseract-ocr] How to use trained data from folder tessdata_best?

2020-03-18 Thread Shree Devi Kumar
tesseract sample.tiff out -l pol --tessdata-dir \tessdata_best On Wed, Mar 18, 2020, 23:48 Leo Artacho wrote: > Hi All, > > I wonder what´s the correct way to call the *best* trained model files. > > I have downloaded one of the best trained models and created this folder: \ > *tessdata_best* >

Re: [tesseract-ocr] What is the difference between script *.traineddata and normal *.traineddata models

2020-03-19 Thread Shree Devi Kumar
Script traineddata have been trained on all languages written in that script plus English. So Script/Arabic would have been trained with ara, fas, urd (etc) + eng .Please check the Readme file in tessdata_best / tessdata_fast repo for explanation by Ray. You have to try both for your use case to

Re: [tesseract-ocr] Best export method

2020-03-20 Thread Shree Devi Kumar
Take a look at gimagereader, which uses tesseract . It has the options you are looking for. On Fri, Mar 20, 2020, 17:55 Dayton wrote: > I have output to hocr and tsv but I still get the all text without hard > return or any separation between paragraphs. > > Is there an HOCR tool which allows to

Re: [tesseract-ocr] Re: What is the difference between script *.traineddata and normal *.traineddata models

2020-03-20 Thread Shree Devi Kumar
Yes and the result of the two commands could be different. On Fri, Mar 20, 2020, 17:43 Essam Zaky wrote: > Thanks @Shreeshrii > > So the following commands recognize Arabic/English text > tesseract AE.jpg AE1 -l ara+eng > tesseract AE.jpg AE2 -l script/Arabic > > > > بتاريخ الخميس، 19 مارس، 2020

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-24 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tesstrain-xsa/blob/master/langdata/latin2unicode.sh It has sed substitution commands for going from transliteration to Unicode for xsa, based on mapping shown in Wikipedia and other web pages. On Mon, Mar 23, 2020, 01:58 Wincent Balin wrote: > Hi Sh

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-03-24 Thread Shree Devi Kumar
01, momentum=0.5 null char=2 > > What does null char=374 in the line 93 mean? > I don't know. Please look at the unicharset files, they usually have a line related to NULL right near the top. > > On Sat, 22 Feb 2020 at 10:56, Shree Devi Kumar > wrote: > >> try w

Re: [tesseract-ocr] How to prepare fonts folder to train from scratch

2020-03-24 Thread Shree Devi Kumar
As far as I know no one has replicated the LSTM training done from scratch by Ray. On Wed, Mar 25, 2020, 01:35 Essam Zaky wrote: > Hi Dears , > > I would like to build *.traindata from scratch specially for English and > Arabic > > So lets talk about English as example > my question how to pre

Re: [tesseract-ocr] Re: How to prepare fonts folder to train from scratch

2020-03-24 Thread Shree Devi Kumar
AFAIK Ray is involved in other projects at Google. Unlikely to get a reply from him. See https://github.com/tesseract-ocr/tesstrain/wiki for training done by @stweil on similar scale for Fraktur. The pages list the hardware requirements, time taken etc. Please check that you have enough resources

Re: [tesseract-ocr] Re: How to prepare fonts folder to train from scratch

2020-03-25 Thread Shree Devi Kumar
The issue with Arabic is related to RTL processing and how punctuation and digits are handled. If your training text does not have them, you will have greater success. On Wed, Mar 25, 2020, 15:32 Essam Zaky wrote: > Thanx @Loranzo and @Shree > i will give try to fine tune , and if the

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-03-26 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tesstrain-akk which has the LSTM training input, training steps and resulting traineddata files. You can change the training text and fonts to customize and further finetune the models. -- You received this message because you are subscribed to the Google

Re: [tesseract-ocr] Help for training Akkadian language for Tesseract 4 needed

2020-03-26 Thread Shree Devi Kumar
Wincent, FYI I use a combination of bash script and makefile for running training, since I am not able to control the processing via makefile On Thu, Mar 26, 2020 at 4:20 PM Shree Devi Kumar wrote: > Please see https://github.com/Shreeshrii/tesstrain-akk which has the LSTM > training

Re: [tesseract-ocr] Generate Arabic PLUS traineddat gives error

2020-03-28 Thread Shree Devi Kumar
Please check that you have used the correct path for the traineddata file. Please share the lstmtraining command that you used before this for training. On Sat, Mar 28, 2020, 22:56 Essam Zaky wrote: > Dear @Shreeshrii > I had followed your bash script to add Andalus font in the Arabic > lanagua

Re: [tesseract-ocr] Generate Arabic PLUS traineddat gives error

2020-03-28 Thread Shree Devi Kumar
/eng.traineddata \ --model_output ../tesstutorial/trainplusminus/eng_plusminus.traineddata --traineddata needs to be same in both commands. On Sun, Mar 29, 2020 at 6:45 AM Shree Devi Kumar wrote: > Please check that you have used the correct path for the traineddata file. > > Please

Re: [tesseract-ocr] Generate Arabic PLUS traineddat gives error

2020-03-28 Thread Shree Devi Kumar
r as best 85 ? > > > بتاريخ الأحد، 29 مارس، 2020 5:06:16 ص UTC+2، كتب shree: >> >> See >> https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.sh >> >> lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \ >> --conti

Re: [tesseract-ocr] Generate Arabic PLUS traineddat gives error

2020-03-29 Thread Shree Devi Kumar
to show me how to prepare the training text. > > example > what is the recommended text size > how many character instance repeated in the training set > , what about ligatures, how to handle it and how to add it in unicharset > > > بتاريخ الأحد، 29 مارس، 2020 7:50:54 ص

Re: [tesseract-ocr] Generate Arabic PLUS traineddat gives error

2020-03-29 Thread Shree Devi Kumar
On Sun, Mar 29, 2020 at 5:30 PM Essam Zaky wrote: > I read this page but still need more information about how to build > training data set > say i would train the engine to recognize field contain 15 digit > is it enough to give small text file contain the 10 digits from 0 to 9 > or should i pre

Re: [tesseract-ocr] is psm wrong?

2020-03-29 Thread Shree Devi Kumar
If you want to recognise images with exposure of -5 or -10 then you need to train with it. Check your images, I think those images will be too light to be recognised correctly. On Sun, Mar 29, 2020, 19:31 Pndaza wrote: > i finetuned myanmar traineddata and i got accuracy above 95%. > But somethi

Re: [tesseract-ocr] Is possible to get the position of every character as output?

2020-04-02 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tessdoc/blob/master/APIExample.md#example-to-get-hocr-output-with-alternative-symbol-choices-per-character-lstm On Thu, Apr 2, 2020 at 11:53 PM Renan Neri Pereira wrote: > I want to know if is possible to have a output with position of every > character tha

Re: [tesseract-ocr] Digit recognition errors / training

2020-04-02 Thread Shree Devi Kumar
try finetune for impact using your font. On Thu, Apr 2, 2020 at 11:51 PM Suppressed wrote: > Im working on a project in which I need to read digit values from an > image, then do tasks based on the values that get extracted. > Because of this, mistakes arent really acceptable. I attached the pic

Re: [tesseract-ocr] fine tuning from traineddata_best

2020-04-03 Thread Shree Devi Kumar
As per the info given by Ray Smith, lead developer of tesseract, if you just need to fine-tune for a new font face, use fine-tune by impact. His example uses the training text from langdata repo (approx 80 lines) rendered with the font, generating lstmf files and then running lstmtraining on that

Re: [tesseract-ocr] fine tuning from traineddata_best

2020-04-03 Thread Shree Devi Kumar
work. Both will also be quite fast to try, as you only need to run 400 iterations. On Fri, Apr 3, 2020, 16:53 Shree Devi Kumar wrote: > As per the info given by Ray Smith, lead developer of tesseract, if you > just need to fine-tune for a new font face, use fine-tune by impact. > >

Re: [tesseract-ocr] Digit recognition errors / training

2020-04-03 Thread Shree Devi Kumar
wrote: > You got any guides or threads that could help me in the process? Im kinda > lost, not gonna lie. > > 2020. április 3., péntek 4:54:11 UTC+3 időpontban shree a következőt írta: >> >> try finetune for impact using your font. >> >> On Thu, Apr 2, 2020 at 1

Re: [tesseract-ocr] How to view lstmf file

2020-04-03 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesseract/issues/2669 for related discussions. @stweil has made the feature available in his branch of repo. On Fri, Apr 3, 2020, 17:02 Essam Zaky wrote: > Hi Dears > Is there a tool to view lstmf , i would like to see the input image to > model and wh

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

2020-04-08 Thread Shree Devi Kumar
I suggest you fine-tune Latin.traineddata using text of the kind you expect. It will have a smaller unicharset and when you convert to fast integer model, it should be smaller in size. On Wed, Apr 8, 2020, 20:39 O CR wrote: > Hi all, > > I try to read names on images with tesseract LSTM. Names l

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Shree Devi Kumar
Why do you want to fine-tune eng to get to hindi traineddata? You can fine-tune hin.traineddata or script/Devanagari.traineddata. On Wed, Apr 8, 2020, 21:00 Piyush Chandra wrote: > When I downloaded the devenagari.unicharset, Latin.unicharset and > radical-stroke.txt > , it worked. What are the

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Shree Devi Kumar
devenagari.unicharset, Latin.unicharset and radical-stroke.txt The script unicharset are useful in setting character properties. For most scripts they are already available in langadata_lstm. I don't think they are mandatory for lstm training but by copying them once you can avoid the warning mes

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Shree Devi Kumar
components On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra wrote: > Thank you Shree for giving the overview. > > Could you please help me understand your last point? Your unicharset > should have Unicode codepoints. what does that mean? any example would be > helpful. I was actually using ak

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Shree Devi Kumar
# Normalization mode - 2, 1 - for unicharset_extractor and Pass through Recoder for combine_lang_model ifeq ($(LANG_TYPE),Indic) NORM_MODE =2 RECODER =--pass_through_recoder On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar wrote: > Unicharset will look like the following: >

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

2020-04-10 Thread Shree Devi Kumar
s" *--lang Latin* > --linedata_only --noextract_font_properties --langdata_dir ./langdata > --tessdata_dir ./tessdata --output_dir ./output > > Op woensdag 8 april 2020 18:27:15 UTC+2 schreef shree: >> >> I suggest you fine-tune Latin.traineddata using text of t

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

2020-04-10 Thread Shree Devi Kumar
model to integer. But it's still slower then the fast integer Latin > model > Any other ideas to make it faster? > > Op vrijdag 10 april 2020 14:17:55 UTC+2 schreef shree: >> >> The file is probably there as script/Latin.traineddata >> You can copy to wherever yo

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-13 Thread Shree Devi Kumar
> if a tested app is compiled using Release build it is 30% faster, but still very slow. Debug builds are going to be slower. I tested with command line on linux. The tif file does take long to recognize. Changing file to 300 dpi and smaller size speeded up the time somewhat. If all your images a

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Shree Devi Kumar
eme sequence:D=0x902 > Normalization failed for string 'ं' > Invalid start of grapheme sequence:M=0x93f > Normalization failed for string 'ि' > > > On Tuesday, 14 April 2020 17:01:20 UTC+5:30, Piyush Chandra wrote: >> >> Hi Shree, >> >>

Re: [tesseract-ocr] Re: textline finding fail

2020-04-14 Thread Shree Devi Kumar
I have also noticed the same for Javanese and Balinese scriptts. On Tue, Apr 14, 2020, 09:42 Pndaza wrote: > Textline finding fails when base constants and their upper vowel or asat > are seperate. > When base constants and their upper vowel or asat are join, it ok > > On Tuesday, 14 April 2020

Re: [tesseract-ocr] lstmeval does not perform eval

2020-04-15 Thread Shree Devi Kumar
lstmeval has different verbosity levels. Which one did you use? On Wed, Apr 15, 2020 at 4:17 PM Usamah Jundi wrote: > Hi, sorry for the brief title, let me explain my situation a bit. > > So what i've done is: > > 1. Use tesstrain.sh to generate the training files (the.lstmf, .txt and > that one

Re: [tesseract-ocr] lstmeval does not perform eval

2020-04-16 Thread Shree Devi Kumar
you the accuracy percentage at end. I am assuming that the model is correctly recognizing the lines it was tr On Thu, Apr 16, 2020 at 7:42 AM Usamah Jundi wrote: > Hello shree, i did not specify any verbosity level. The same exact command > with the eval list argument pointing to other

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Shree Devi Kumar
U+200D ‍ e2 80 8d ZERO WIDTH JOINER -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the w

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Shree Devi Kumar
U+0965 ॥ e0 a5 a5 DEVANAGARI DOUBLE DANDA On Thu, Apr 16, 2020, 19:25 Shree Devi Kumar wrote: > U+200D ‍ e2 80 8d ZERO WIDTH JOINER > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Shree Devi Kumar
You are training from scratch. It will take thousands of iterations. Try fine-tuning. On Thu, Apr 16, 2020, 19:51 Piyush Chandra wrote: > Hi Shree, > > Thanks for replying. > > So shall I remove them from text file and create a unicharset file after > that or do I have do

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Shree Devi Kumar
Please share couple of image files and their corresponding text version so that I can see what will work best. On Tue, Apr 21, 2020, 20:17 Peyi Oyelo wrote: > Hello Shree and sorry for reviving an old dead thread. I am currently > trying to train Tesseract to recognize the Akan language.

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Shree Devi Kumar
wrote: > Thank you for replying Shree. I have zipped the entire document into > Akan.zip. > > > I have attached the source training text file (Akan.dejavusans.txt) > containing the text that is to be recognized by Tesseract. I have been able > to generate a tiff f

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-24 Thread Shree Devi Kumar
On Sat, Apr 25, 2020 at 2:13 AM Peyi Oyelo wrote: > @shree hello sir/maam? > Maam :-) > > On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote: >> >> I created the akan.traineddata using the typical tesseract 3 legacy >> workflow. >> > O

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-25 Thread Shree Devi Kumar
Please check gitHub.com/shreeshrii/tesstrain-akan The data folder has the fine-tuned traineddata file also. Since akan is written in Latin script this was easy to do. On Sat, Apr 25, 2020, 08:40 Shree Devi Kumar wrote: > On Sat, Apr 25, 2020 at 2:13 AM Peyi Oyelo wrote: > >> @shr

Re: [tesseract-ocr] Tesseract OCR Failing to Read Cleaned Numbers. Suggestions Please?

2020-04-30 Thread Shree Devi Kumar
Looks like the image resolution is not set correctly. You can specify dpi while processing. ubuntu@tesseract-ocr:~/TEST$ tesseract 82.png - --dpi 300 82 ubuntu@tesseract-ocr:~/TEST$ tesseract 81.png - --dpi 300 81 On Thu, Apr 30, 2020 at 2:57 PM tristan gordon wrote: > Hello all, > > Could y

Re: [tesseract-ocr] Improving speed of a fine-tuned tessdata-best data file

2020-05-16 Thread Shree Devi Kumar
Convert it to a fast model. combine_tessdata -c compresses the traineddata file. You can also do it when you stop lstmtraining with --convert-to-int flag Please check syntax - On Sun, May 17, 2020, 11:13 Kunal Singh wrote: > Hello, > > I am using a fine-tuned traineddata file (from tessdata_

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Shree Devi Kumar
>Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox Alternately you can use wordstrbox config file. In both cases, if you are generating box files from images, the box files need to be corrected before proceeding for training. On Thu, May 28, 2020 at 5:51 PM

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Shree Devi Kumar
UTC+3 пользователь shree написал: > > >> Alternately you can use wordstrbox config file. >> >> What is "wordstrbox config file"? > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsub

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-29 Thread Shree Devi Kumar
aining ls *.lstmf -1 > mylang.traininingfiles_text > > > > > четверг, 28 мая 2020 г., 18:21:31 UTC+3 пользователь shree написал: >> >> lstmbox creates character level box files. >> >> Wordstrbox creates line level box files. >> >> If using wordstrbox, please us

Re: [tesseract-ocr] Re: (Question) UB Mannheims's Windows installer options

2020-05-30 Thread Shree Devi Kumar
I haven't installed the windows version lately, but if you open those sections in the install window you will see the available options. Tesseract has many different types of language traineddata files - see the repos tessdata, tessdata_best and tessdata_fast. The default bundling of languages is

Re: [tesseract-ocr] Re: (Question) UB Mannheims's Windows installer options

2020-05-31 Thread Shree Devi Kumar
o use the OCR to scan image and output to a text > file, and not engage in any training or anything, do I only need to install > the relevant language data? Or do I need both the language data AND the > script data for the relevant language? > > > On Sunday, May 31, 2020 at 4:

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-31 Thread Shree Devi Kumar
t; > I still don't understand. > > пятница, 29 мая 2020 г., 15:02:22 UTC+3 пользователь shree написал: > >> Input Files >> >> myfile1.png >> myfile1.gt.txt >> >> > Is "myfile1.png" - the picture with training text? > What is "

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-31 Thread Shree Devi Kumar
Use tesstrain.sh or tesstrain.py On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин wrote: > Ok, I want to train from training text and fonts. > Whats method must be? > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from

Re: [tesseract-ocr] How can use tessract for training using my own image dataset

2020-06-01 Thread Shree Devi Kumar
If your image dataset and groundtruth is for line images you can use https://github.com/tesseract-ocr/tesstrain On Mon, Jun 1, 2020 at 11:16 AM 易鑫 wrote: > Hello,everyone: > As we all know,after teseract v4.0,it can generate dataset > automatically.But for me ,the accuracy is not as good

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-01 Thread Shree Devi Kumar
ata --model_output ./output/mylang.traineddata On Mon, Jun 1, 2020 at 12:11 AM Владимир Калачихин wrote: > воскресенье, 31 мая 2020 г., 19:16:55 UTC+3 пользователь shree написал: >> >> Use tesstrain.sh or tesstrain.py >> >> On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин

Re: [tesseract-ocr] Where to download the dutch language pack?

2020-06-01 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdata_fast https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md On Mon, Jun 1, 2020 at 3:31 PM Mike Dewul wrote: > I am trying "(a9t9)FreeOcrWindowsDesktop" which perform OCR of images > (batch) > However, I need the Dutch (NLD) language pack. >

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-01 Thread Shree Devi Kumar
ельник, 1 июня 2020 г., 11:23:39 UTC+3 пользователь shree написал: >> >> >> ### create tif and box using fonts and training text >> text2image --fonts_dir=/home/ubuntu/.fonts >> --outputbase=/mylang.myfont.exp0 --max_pages=0 --font=myfont >> --text=../langdata/m

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-01 Thread Shree Devi Kumar
You may find this repo useful https://github.com/UYousafzai/easy_train_tesseract On Mon, Jun 1, 2020 at 10:05 PM Shree Devi Kumar wrote: > >Failed to load script unicharset from:./langdata/Latin.unicharset" > > This is for Latin script not Latin language. > wget t

Re: [tesseract-ocr] Re: Where to download the dutch language pack?

2020-06-03 Thread Shree Devi Kumar
I suggest you use a windows gui front end such as Vietocr or gimagereader. On Wed, Jun 3, 2020 at 12:22 PM Mike Dewul wrote: > Thank you. I installed it, but > a. installing another language requires admin. right > b. only 1 image can scan > c. when performing a bulk OCR on images, ".txt" is ad

Re: [tesseract-ocr] Using tesstrain.sh to produce training data

2020-06-07 Thread Shree Devi Kumar
Try with --exposures "-3 -2 -1" (DEFAULT IS 0) On Sun, Jun 7, 2020 at 3:42 PM Dave wrote: > [image: bad data.png] > > I am using tesstrain.sh to create training data from a font but the data > it creates looks too dark to be good training data, is this because of the > font or am I missing som

Re: [tesseract-ocr] Training for Kurdish in Arabic script

2020-06-14 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain/wiki for links regarding tesseract training for handwriting On Sun, Jun 14, 2020, 23:20 mit wrote: > Hi Shree, > > Can we train tesseract for handwritten date? > > TIA > > On Saturday, February 1, 2020 at 5:13:10 PM

Re: [tesseract-ocr] How to tun Tesseract?

2020-06-20 Thread Shree Devi Kumar
Open a command window first. Then run tesseract from there. On Sun, Jun 21, 2020, 12:17 Edoardo Mori wrote: > I tried both version 32 and 64. When I run tesseract.exe in Windows 10, a > window disappears for a few tenths of seconds. A command line cannot be > written. How to do? Thank you . Edoa

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
What character are you trying to add? Please share the training data to try and replicate the issue. On Sun, Jul 12, 2020, 15:35 Eliyaz L wrote: > Hi, > > > My use case is on Arabic document, the pre retrained ara.traineddata are > good but not perfect. so i wish to fine tune ara.traineddata, i

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
@Eliyaz What version of tesseract are you using? Which traineddata? >Always the letter "لا" is predicted as "ال" . I think this was fixed by Ray Smiith in 2017 and should be ok in the traineddata files in tessdata_fast and tessdata_best repos. On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/758 and other similar issues On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar wrote: > @Eliyaz What version of tesseract are you using? Which traineddata? > > >Always the letter "لا" is predicted as "ال" . &

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
12, 2020, 20:52 Eliyaz L wrote: > Hi Shree, > > i was using thie below version. I guess you are right its 2016 file. Let > me test with latest traineddata. > https://tesseract-ocr.github.io/tessdoc/Data-Files > https://github.com/tesseract-ocr/tessdata/raw/4.00/ara.traineddata >

Re: [tesseract-ocr] Looking for segmentation algorithm implementations and (G)UIs

2020-07-13 Thread Shree Devi Kumar
Good collection of segmentation algorithms. Dan Bloomberg has update the segmentation algorithms in leptonica some time back. You may want to take a look at those too. Tesseract also uses leptonica, but older algorithms, I think. On Sat, Jul 11, 2020 at 9:19 PM Rainer Verteidiger < materialdefen

Re: [tesseract-ocr] How to exclude some symbols from recognizing?

2020-07-13 Thread Shree Devi Kumar
Search for whitelist / blacklist On Mon, Jul 13, 2020, 17:24 Владимир Калачихин wrote: > Subj > Numbers, for example. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, se

Re: [tesseract-ocr] How to get linewise/ row-wise output rather than column wise in hOCR output

2020-07-13 Thread Shree Devi Kumar
Use --psm 6 Page segmentation mode instead of the default On Mon, Jul 13, 2020, 22:05 Deepak Sen wrote: > Hi, > I am using latest tessaract version and getting the hOCR output of a table > where line no of (column2, row1) is not line-1 so what i want is tessaract > first goes through all the ro

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-14 Thread Shree Devi Kumar
prepare dataset and train a separate >> custom model for only numbers and date. >> >> if possible then pls help me with the sample dataset and can i use this >> <https://github.com/tesseract-ocr/tesstrain> repo to train and if any >> apx count of dataset and iteration

Re: [tesseract-ocr] building tir.traineddata from scratch

2020-08-04 Thread Shree Devi Kumar
Please see https://tesseract-ocr.github.io/tessdoc/Data-Files-in-tessdata_fast.html Version string:4.00.00alpha:tir:synth20170629 LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1], flags=41, iteration=10498000, sample_iteration=10498000, null_char=267, learning_rat

Re: [tesseract-ocr] Re: Help: lstmtraning not found

2020-08-06 Thread Shree Devi Kumar
If you have tesseract and all training tools installed, you should be able to use tesseract lstmtraining etc without giving the path. What's the output of which tesseract tesseract -v which lstmtraining lstmtraining -v On Fri, Aug 7, 2020, 01:13 minh...@gmail.com wrote: > Sorry that I forgot

Re: [tesseract-ocr] Re: Help: lstmtraning not found

2020-08-07 Thread Shree Devi Kumar
time > (with same data in *--train_listfile ~*). As I thought, each time the > traineddata is updated. > Is it a way to exact traineddata from best_traineddata for some selected > fonts? > > Thanks, > > TuPM > > On Friday, August 7, 2020 at 9:30:33 AM UTC+7 minh...@g

Re: [tesseract-ocr] Re: Extraction of two different language text from single image using tesseract

2020-08-19 Thread Shree Devi Kumar
For multiple languages the standard invocation is to use the two language codes with + sign. Eg. -l ara+eng or -l eng+jpn Alternately you can also try the script traineddata files eg. Devanagari includes eng+hin+san+mar+nep However, multiple languages recognition takes more time and is not perfe

Re: [tesseract-ocr] Tesseract could give me the position of the output characters.

2020-08-23 Thread Shree Devi Kumar
Try tsv or hocr output. Also Google search for receipt recognition with tesseract. I have seen few examples where item names and prices are being recognised. You could try similar with nutritional information. On Thu, Aug 20, 2020, 21:40 ELIANA MARTINEZ CORTES wrote: > > Hi! I am working on a p

Re: [tesseract-ocr] bash: training/lstmtraining: No such file or directory during tesstutorial

2020-08-26 Thread Shree Devi Kumar
Did you install tesseract training tools? try the following commands: lstmtraining --version which lstmtraining text2image --version lstmeval --version On Tue, Aug 25, 2020 at 1:15 PM Theo M-Z wrote: > I followed the tesstutorial, creating base traineddata, but at this point, > the log file

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-09 Thread Shree Devi Kumar
Thanks, Alex. I suggest that you also add this to tesseract documentation, tessdoc repo. On Wed, Sep 9, 2020, 23:30 Александр Поздняков wrote: > Hi. > Alternatively, use AppImage (Ubuntu >= 16.04) > 1. Download > >> wget >> https://github.com/AlexanderP/tesseract-appimage/releases/download/v5.0

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
Please share your training data so that we can test. Thanks. Virus-free. www.avg.com

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
> Each of my PNG files have file names that indicate ground truth, and I have a little script that generates ground-truth TXT files from the PNG file names. Please review your script. I notice a number of file names ending with -2. The gt.txt files for the same also contain -2 while the image only

<    3   4   5   6   7   8   9   10   >