Re: [tesseract-ocr] how to train tesseract to detect superscripts and subscripts

2019-07-03 Thread Shree Devi Kumar
See https://github.com/Shreeshrii/tess4training#additional-training-scripts---replace-top-layer-bash On Wed, Jul 3, 2019 at 6:03 PM fady taher wrote: > Am trying to detect a superscript like the attached, I tried to add the > "Cr⁶⁺" to the training set like 15 times, but still, it couldnt be > r

Re: [tesseract-ocr] tesseract bug in windows

2019-07-03 Thread Shree Devi Kumar
Bugs are to reported in github under issues. If it is specific to windows and uses prebuilt binaries, please report in repo of the source. On Wed, 3 Jul 2019, 20:26 _ Flaviu, wrote: > Sorry for this topic, but I think that tesseract library has a bug when > run in windows 10. My question is, whe

Re: [tesseract-ocr] Re: Recognized characters got multiplicated

2019-07-04 Thread Shree Devi Kumar
This is an open issue - see https://github.com/tesseract-ocr/tesseract/issues/1060 and other related issues On Thu, Jul 4, 2019 at 5:33 PM Abstract wrote: > Some more information on my trained data: > real data:12345678903542331100244117021234567 > recognized: 1234567890354233141110024411702

Re: [tesseract-ocr] Re: setting user-words in api?

2019-07-04 Thread Shree Devi Kumar
I have made a wiki page for using user_patterns with API. Please see https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns You can try similarly for user_words. On Thu, Jul 4, 2019 at 4:40 PM Jochen Naumann wrote: > user_words_file also does not work, the file is not loaded

Re: [tesseract-ocr] Re: setting user-words in api?

2019-07-05 Thread Shree Devi Kumar
I haven't tried user_words yet. pre-processing the image gets you better results. It works with the modified image and \A\d\d\d\d\A\A\d\d\d On Fri, Jul 5, 2019 at 1:55 PM Jochen Naumann wrote: > Thanks, Shree. I appreciate your help! > I tried your example and it works with your

Re: [tesseract-ocr] Choice Iterator only shows one choice for each character

2019-07-05 Thread Shree Devi Kumar
Thanks! On Fri, Jul 5, 2019 at 3:42 PM Zdenko Podobny wrote: > IMO link should be https://github.com/tesseract-ocr/tesseract/issues/2536 > > > Zdenko > > > št 4. 7. 2019 o 11:39 shree napísal(a): > >> See related discussion at >> https://github.com/tesserac

Re: [tesseract-ocr] retrained file after fine tuning the tesseract

2019-07-08 Thread Shree Devi Kumar
5 MB traineddata is from tessdata_fast For retraining, the file from tessdata_best is used which is 15 MB. You can use --convert_int with --stop_training to make it smaller. Chek wiki's training page for details. On Mon, Jul 8, 2019 at 2:43 PM Purushotham Rao Eravalli < purushot...@sukshi.com> wro

Re: [tesseract-ocr] how to train tesseract to detect superscripts and subscripts

2019-07-09 Thread Shree Devi Kumar
If you use the submodule you will save time taken in running the 8-makedata_layernew.sh script. However, if you have modified training_text or want to checkout the full process, run the script. On Tue, Jul 9, 2019 at 4:33 PM fady taher wrote: > I can see that you have mentioned >> > "IT IS NOT R

Re: [tesseract-ocr] how to train tesseract to detect superscripts and subscripts

2019-07-09 Thread Shree Devi Kumar
I don't think I had any (or enough) plus superscript in my training_text. Treat this as an example and train as per the data you expect. On Tue, 9 Jul 2019, 17:01 fady taher, wrote: > Dear Shree, thanks for you quick response ... I gave a try to the > submodule ... it gave res

Re: [tesseract-ocr] how to train tesseract to detect superscripts and subscripts

2019-07-10 Thread Shree Devi Kumar
wrote: > >> will try and feed you back, thanks alot >> >> On Tue, Jul 9, 2019 at 1:40 PM Shree Devi Kumar >> wrote: >> >>> I don't think I had any (or enough) plus superscript in my training_text. >>> >>> Treat this as an example and

Re: [tesseract-ocr] Custom words combining letters + digits

2019-07-10 Thread Shree Devi Kumar
--user-words does not currently work in tesseract4. On Wed, Jul 10, 2019 at 7:59 PM David Novak wrote: > > Hello, > > I have a custom list of words that I'd like to add to (or practically > substitute for) the default word list in my language. Some of these words > combine letters & digits & pun

Re: [tesseract-ocr] Re: How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

2019-07-11 Thread Shree Devi Kumar
See https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!searchin/tesseract-ocr/Cursive/tesseract-ocr/6naBkXZvTlI On Thu, 11 Jul 2019, 11:58 sai sumanth Kalluri, wrote: > Can somebody please give me some advice regarding this? > > On Tuesday, 9 July 2019 11:52:28 UTC+5:30, sa

Re: [tesseract-ocr] Re: How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

2019-07-11 Thread Shree Devi Kumar
Search the forum for Cursive On Thu, 11 Jul 2019, 13:00 sai sumanth Kalluri, wrote: > Thanks for the reply but that link does not lead anywhere. Could you > please correct it? > > On Thursday, 11 July 2019 12:34:38 UTC+5:30, shree wrote: >> >> See >> htt

Re: [tesseract-ocr] Updated: tesseract-ocr-4.1.0-1

2019-07-11 Thread Shree Devi Kumar
Thanks, Marco. Please also include link to - https://github.com/tesseract-ocr/langdata_lstm as the source of language data for LSTM training. - https://github.com/tesseract-ocr/tessdata_best and - https://github.com/tesseract-ocr/tessdata_fast for the LSTM traineddata files On F

Re: [tesseract-ocr] Re: Tesseract 4.1.0 released

2019-07-12 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc Lstmbox and wordstrbox create box files for training. Alto creates XML output. Hocr creates HTML output. On Fri, 12 Jul 2019, 13:39 ElGato ElMago, wrote: > Hello, > > How do you use Alto, LSTMBox, and WordStrBox? A

Re: [tesseract-ocr] Language detection

2019-07-16 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/APIExample#orientation-and-script-detection-osd-example On Tue, Jul 16, 2019 at 5:42 PM Purushotham Rao Eravalli < purushot...@sukshi.com> wrote: > Hi, > Is there a way where we can detect that the text is english or else of any > other langua

Re: [tesseract-ocr] Language detection

2019-07-16 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/blob/master/unittest/osd_test.cc On Tue, Jul 16, 2019 at 5:47 PM Shree Devi Kumar wrote: > see > https://github.com/tesseract-ocr/tesseract/wiki/APIExample#orientation-and-script-detection-osd-example > > > > On Tue, Jul 1

Re: [tesseract-ocr] tesseract produces one time bad one time good results

2019-07-18 Thread Shree Devi Kumar
Binarize and invert the images to get black text on white. I tried with latest code from master branch on github, gives correct results. tesseract 2-bw.png stdout --psm 6 --dpi 300 --tessdata-dir ~/tessdata --oem 1 --user-patterns ./timestamp.patterns.txt -c lstm_use_matrix=1 -c tessedit_char_whit

Re: [tesseract-ocr] Trained data for E13B font

2019-07-18 Thread Shree Devi Kumar
y anyway. >> >> Here there is another MRZ model with training data: >> >> https://github.com/DoubangoTelecom/tesseractMRZ >> >> >> >> >> Lorenzo >> >> >> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu ha >> scritto: >&

Re: [tesseract-ocr] Trained data for E13B font

2019-07-18 Thread Shree Devi Kumar
Also https://github.com/tesseract-ocr/tesseract/pull/2576 On Fri, 19 Jul 2019, 11:14 Shree Devi Kumar, wrote: > Please check out the recent commits in master branch > > https://github.com/tesseract-ocr/tesseract/pull/2554 > > On Fri, 19 Jul 2019, 10:55 ElGato ElMago,

Re: [tesseract-ocr] Training stops before specified iterations

2019-07-18 Thread Shree Devi Kumar
The target character error rate may have been achieved. On Fri, 19 Jul 2019, 11:14 Pooja Kamra, wrote: > In training comand, max iterations given are 1. But training stops > after 4600 iterations. > What can be reason for this. > > Regards, > Pooja > > -- > You received this message because

Re: [tesseract-ocr] Training stops before specified iterations

2019-07-19 Thread Shree Devi Kumar
Kamra wrote: > Dear Shree, > > I have not specified target error rate. What eror rate will be taken as > default. > > > > On Friday, July 19, 2019 at 11:17:12 AM UTC+5:30, shree wrote: >> >> The target character error rate may have been achieved. >> >>

Re: [tesseract-ocr] Trained data for E13B font

2019-07-19 Thread Shree Devi Kumar
one. I can supply a test case if it is expected to work well. > > On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago > wrote: > >> Lorenzo, >> >> We both have got the same case. It seems a solution to this problem >> would save a lot of people. >> >> Shr

Re: [tesseract-ocr] Trained data for E13B font

2019-07-19 Thread Shree Devi Kumar
o ElMago wrote: > Lorenzo, > > We both have got the same case. It seems a solution to this problem would > save a lot of people. > > Shree, > > I pulled the current head of master branch but it doesn't seem to contain > the merges you pointed that have been merge

Re: [tesseract-ocr] Training stops before specified iterations

2019-07-19 Thread Shree Devi Kumar
As per your screenshot 15000 iterations have been done. On Fri, Jul 19, 2019 at 3:52 PM Pooja Kamra wrote: > As per log file, finished error rate is 1.439. > > > > On Friday, July 19, 2019 at 1:24:35 PM UTC+5:30, shree wrote: >> >> Look at tesstrain.log for d

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-07-19 Thread Shree Devi Kumar
or on the evaluation set. > At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char > train=9.379%, word train=9.669%, skip ratio=0.1%, New worst char error = > 9.379 wrote checkpoint. > > > > Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit : >> >> Yo

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-07-21 Thread Shree Devi Kumar
>But there are still a lot of things I do not understand. And one of them is actually causing me an issue : even with a lot of iterations (475k) I still do not see any log message with the error on the evaluation set. At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train= 9.37

Re: [tesseract-ocr] Trained data for E13B font

2019-07-22 Thread Shree Devi Kumar
9 ElGato ElMago: > >> Lorenzo, >> >> I haven't been checking psm too much. Will turn to those options after I >> see how it goes with bounding boxes. >> >> Shree, >> >> I see the merges in the git log and also see that new >> option lstm_c

Re: [tesseract-ocr] Tesseract fine tuning with another font.

2019-07-25 Thread Shree Devi Kumar
lot but didn't found a proper guide for the process. > > System: Ubuntu 14 64-bit > > @shree ma'am please help me out if you come across this post. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. >

Re: [tesseract-ocr] Information on the design decisions for Tesseract's neural network

2019-07-25 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#documentation On Thu, 25 Jul 2019, 21:52 Julian Gilbey, wrote: > Hello, > > I've found with a lot of reading around that tesseract (appears to) use a > neural network with the following specification: > [1,36,0,1 Ct3,3,16 Mp3,3 Lf

Re: [tesseract-ocr] Use Tesseract dll with c project

2019-07-25 Thread Shree Devi Kumar
Please suggest ways we can improve the situation. Is the documentation difficult to find? Difficult to understand? Is there a way that a FAQ type of page with relevant links be automatically sent to first time posters in this Google group? Do Google groups allow a pinned post to show on top whic

Re: [tesseract-ocr] two similar picture,one get correct result,the other gets only one char,why?

2019-07-26 Thread Shree Devi Kumar
I do not have an answer to why? You will need to step through the code with debug to find why. However, a little pre processing for the image. Inverting and binarizing it give correct output. On Thu, 25 Jul 2019, 11:14 Chen Yufu, wrote: > I use Tesseract command line to get OCR result,like thi

Re: [tesseract-ocr] Re: Tesseract 4.1.0 released

2019-07-28 Thread Shree Devi Kumar
It is not a bug but is intentional. For details please see discussion at https://github.com/tesseract-ocr/tesseract/issues/648#issuecomment-271870748 On Sat, Jul 27, 2019 at 4:14 PM Abdou wrote: > > Hello everyone I tried to use OCRD-train with tesseract 4.1 but I did not > succeed. I noticed t

Re: [tesseract-ocr] Re: Tesseract 4.1.0 released

2019-07-28 Thread Shree Devi Kumar
Thanks. Please add the info to Tesseract wiki page also. On Sun, 28 Jul 2019, 18:42 Alex Cohn, wrote: > Hi everybody, > > I am proud to announce Android support for the new 4.1.0 version of > tesseract OCR engine. This repo [1] includes both 3.05 and 4.1 branches, > and lets you painlessly build

Re: [tesseract-ocr] Specific localization and doing OCR

2019-07-29 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/APIExample If you want to restrict recognition to a sub-rectangle of the image - call *SetRectangle(left, top, width, height)* after SetImage. Each SetRectangle clears the recogntion results so multiple rectangles can be recognized with the same imag

Re: [tesseract-ocr] How to handle wavy scans?

2019-07-30 Thread Shree Devi Kumar
You can try scantailor or a similar program to dewarp the images before feeding to tesseract. On Tue, 30 Jul 2019, 17:14 Hammer of Dawn, wrote: > A lot of the source material I have are wavy images (pdfs), since they > were scanned directly from books. Books (depending on the binding method) > t

Re: [tesseract-ocr] Re: Tesseract 4.1.0 released

2019-07-30 Thread Shree Devi Kumar
> In any case, would it be worthwhile mentioning > https://github.com/rhardih/bad in the wiki as an alternate means of > building for Android and all you want is .so files? > > > /René > > > On Sun, 28 Jul 2019 at 17:28, Alex Cohn wrote: > >> It's there, in >&

Re: [tesseract-ocr] How can I do the training using my own image in Tesseract 4.0

2019-08-01 Thread Shree Devi Kumar
See discussion in https://github.com/tesseract-ocr/tesseract/issues/2357 On Thu, Aug 1, 2019 at 1:49 PM narayana wrote: > Can you please help me on this issue; > How can I do the training using my own image in Tesseract 4.0. > I have gone through all installation steps mentioned in > https://git

Re: [tesseract-ocr] How can I do the training using my own image in Tesseract 4.0

2019-08-01 Thread Shree Devi Kumar
Also see https://github.com/OCR-D/ocrd-train On Thu, Aug 1, 2019 at 1:58 PM Shree Devi Kumar wrote: > See discussion in https://github.com/tesseract-ocr/tesseract/issues/2357 > > On Thu, Aug 1, 2019 at 1:49 PM narayana wrote: > >> Can you please help me on this issue; &

Re: [tesseract-ocr] How to install version 4.1.0 on Ubuntu?

2019-08-01 Thread Shree Devi Kumar
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr On Thu, 1 Aug 2019, 22:06 Mox Betex, wrote: > How to install Tesseract 4.1.0 on Ubuntu? > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and s

Re: [tesseract-ocr] Problems with training tesseract

2019-08-02 Thread Shree Devi Kumar
Have you tried - https://github.com/DoubangoTelecom/tesseractMRZ On Fri, Aug 2, 2019 at 9:26 PM Cristobal Jesus Muñoz Solano < cmunoz...@gmail.com> wrote: > Hello, I am trying to use tesseract and I have read all the documentation > and I have done many tests, sorry if this is not the place

Re: [tesseract-ocr] Can I add new trainedata in the repository, for my language. like officially

2019-08-07 Thread Shree Devi Kumar
Yes. Community contributions are welcome and are kept in https://github.com/tesseract-ocr/tessdata_contrib Please create a PR with the traineddata file and an information file similar to https://github.com/tesseract-ocr/tessdata_contrib/blob/master/khmLimon.md On Wed, Aug 7, 2019 at 3:43 PM Jenni

Re: [tesseract-ocr] Support for alto - option in Tesseract for linux

2019-08-08 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/alto You can use `alto` config file or use the config variable as part of command -c tessedit_create_alto=1 On Thu, Aug 8, 2019 at 2:59 PM Tommy Klausen wrote: > Hi. > > Is the ALTO config option supported in the last lin

Re: [tesseract-ocr] Support for alto - option in Tesseract for linux

2019-08-08 Thread Shree Devi Kumar
ot; in the end, right? > > Can you give me the two different commands for reading an image (with and > without the confg file)? > > torsdag 8. august 2019 11.51.27 UTC+2 skrev shree følgende: >> >> >> https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/

Re: [tesseract-ocr] tesseract output is of first page only

2019-08-09 Thread Shree Devi Kumar
Try creating a multipage tiff from your pdf and try. On Fri, 9 Aug 2019, 11:11 ilevy, wrote: > I'm trying tesseract for the first time with a png of a multipage document > I saved out of a pdf (which itself was just an image). > > When I run tesseract, I get an output of the first page, but that

Re: [tesseract-ocr] Trained data for E13B font

2019-08-09 Thread Shree Devi Kumar
> Well, I read the description of ScrollView ( >>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) >>>>>>>>>> and it says: >>>>>>>>>> >>>>>>>>>> To show t

Re: [tesseract-ocr] Terrible accuracy even though "tessinput.tiff" looks fine.

2019-08-11 Thread Shree Devi Kumar
Most tesseract 4.0 models have been trained on line images with 48 pixels height. Resize your image to 48 pixels height, 300 dpi and try . my test results ubuntu@tesseract-ocr:~/TEST$ tesseract VIN.png - --tessdata-dir ~/tessdata_fast -c tessedit_write_images=1 WD4PF 1ICDOKP075122 ubuntu@tesserac

Re: [tesseract-ocr] lstmeval Can't encode transcription, Encoding of string failed!

2019-08-15 Thread Shree Devi Kumar
This means that official traineddata was not trained with some of the characters that are there in your training text. One way to verify this is to use the combine_tessdata command with -u to unpack the files in it and look at the lstm-unicharset. On Fri, 16 Aug 2019, 10:52 Jisong Xie, wrote

Re: [tesseract-ocr] Re: Best Trained data for Non MRZ data

2019-08-22 Thread Shree Devi Kumar
Share a sample image. If the rest of the ID is in similar type of font, try finetuning with it for all characters. On Thu, Aug 22, 2019 at 12:16 PM Tintu Jacob wrote: > On Wednesday, August 21, 2019 at 6:24:55 AM UTC+5:30, ElGato ElMago wrote: > > It isn't OCRB then. Pick your local language f

Re: [tesseract-ocr] How to use my own traineddata language in OCR process?

2019-08-23 Thread Shree Devi Kumar
You can name your custom traineddata file with a different name eg. mycustom.traineddata, copy the file to your tessdata folder (referred by tessdata_prefix) and then use 'mycustom' instead of 'eng' in your program. On Sat, 24 Aug 2019, 09:13 Clint William Theron, < theronclintwill...@gmail.com> w

Re: [tesseract-ocr] How to use my own traineddata language in OCR process?

2019-08-26 Thread Shree Devi Kumar
d and found out about the >>>> custom traineddata idea from the following link: >>>> >>>> >>>> https://ourcodeworld.com/articles/read/580/how-to-convert-images-to-text-with-pure-javascript-using-tesseract-js >>>> >>>> It's

Re: [tesseract-ocr] my scan of alphanumeric data needs TLC

2019-08-27 Thread Shree Devi Kumar
If all your images are in this bold thick font, fine tuning for impact may help with some of the recognition errors. On Tue, 27 Aug 2019, 14:42 Stephane Charette, wrote: > I have a large number of images that contain a single line of alphanumeric > data. My scans so far have not been great, and

Re: [tesseract-ocr] my scan of alphanumeric data needs TLC

2019-08-27 Thread Shree Devi Kumar
, 2019 at 2:55 PM Shree Devi Kumar wrote: > If all your images are in this bold thick font, fine tuning for impact may > help with some of the recognition errors. > > On Tue, 27 Aug 2019, 14:42 Stephane Charette, > wrote: > >> I have a large number of images that

Re: [tesseract-ocr] net_spec with 2D LSTM?

2019-08-27 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs On Tue, 27 Aug 2019, 22:29 Timothy Snyder, wrote: > Hello all, > > Does anyone have an example of a net_spec argument that utilizes a 2D LSTM? > > Thanks, > -Tim > > -- > You received this message because you are subscribed to the Google

Re: [tesseract-ocr] best way to train german gothic font model?

2019-08-29 Thread Shree Devi Kumar
Use https://github.com/OCR-D/ocrd-train since you have line images and transcription. On Thu, Aug 29, 2019 at 1:13 PM Phillip Ströbel wrote: > dear tesseract community > > atm, i'm trying to compare the performance of different ocr engines, one > of which is tesseract. > i have a ground truth al

Re: [tesseract-ocr] Extracting text from specific region

2019-08-29 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/APIExample#basic-example or use a uzn file On Thu, Aug 29, 2019 at 10:54 PM iFieldSmart Technologies < ifieldsm...@gmail.com> wrote: > I am getting the coordinates of a bounding box from an image file using a > different piece of code. Now I wa

Re: [tesseract-ocr] -l eng+urd not working

2019-08-29 Thread Shree Devi Kumar
Try urd+eng to give precedence to Urdu. Also see open issue https://github.com/tesseract-ocr/tesseract/issues/2626 On Fri, Aug 30, 2019 at 11:26 AM Shubham Gupta wrote: > Hi All > > I have one query i.e. if my Image contains both Urdu and English text, I > used -l parameter as eng+urd, but my o

Re: [tesseract-ocr] -l eng+urd not working

2019-08-30 Thread Shree Devi Kumar
good approach? > > Thanks > Shubham > > On Fri, Aug 30, 2019 at 12:25 PM Shree Devi Kumar > wrote: > >> Try urd+eng to give precedence to Urdu. >> >> Also see open issue >> https://github.com/tesseract-ocr/tesseract/issues/2626 >> >> On Fri,

Re: [tesseract-ocr] '33' recognized correctly, '3' not recognized at all...

2019-08-31 Thread Shree Devi Kumar
ubuntu@tesseract-ocr:~/TEST$ tesseract twonumbers.png - --psm 6 --tessdata-dir ~/tessdata --oem 1 2 127 a 15 7 56 7 58 9 58 19 65 24 91 3375 ubuntu@tesseract-ocr:~/TEST$ tesseract twonumbers.png - --psm 6 --tessdata-dir ~/tessdata_best --oem 1 2 127 a 15 7 56 7 58 9 58 19 65 24 91 3375 ub

Re: [tesseract-ocr] Re: '33' recognized correctly, '3' not recognized at all...

2019-08-31 Thread Shree Devi Kumar
I am using the latest code from master branch. I would expect same result with same image and same traineddata files. On Sun, 1 Sep 2019, 08:04 Jack, wrote: > Thank you for replying, that was very helpful. > I've now tried tessdata_best and tessdata_fast trained data found on the > tesseract gi

Re: [tesseract-ocr] Re: '33' recognized correctly, '3' not recognized at all...

2019-08-31 Thread Shree Devi Kumar
Well, I just took a screenshot of your images from the link since I could not figure out how to get individual images. The only doctoring was to save it at 300 dpi in irfanview. On Sun, 1 Sep 2019, 08:27 Jack, wrote: > Ah, now I see it has something to do with the way you doctored the images, >

Re: [tesseract-ocr] Re: Error: Deserialize header failed while fine-tuning Tesseract

2019-09-03 Thread Shree Devi Kumar
Test with 5-10 files to figure out correct process. Probably files are not in the correct location or format. On Tue, 3 Sep 2019, 17:10 Pranav Budhwant, wrote: > I tried the same with Tesseract 4.1, and I generated all the files on > Ubuntu instead of creating them on Windows and then converting

Re: [tesseract-ocr] Fine tuning existing model

2019-09-05 Thread Shree Devi Kumar
000 > > data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint > lstmtraining \ > --stop_training \ > --continue_from $^ \ > --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ > --traineddata data/$(MODEL_NAME)/$(MO

Re: [tesseract-ocr] Failed loading language

2019-09-09 Thread Shree Devi Kumar
Combine-lang-model only creates the starter traineddata. It is used as part of lstm training process. It cannot be used for recognition. Training from scratch requires running the lstmtraing command. On Mon, Sep 9, 2019, 21:36 Nuno Feliciano wrote: > > > > > Hi, > > I am trying to make a model

Re: [tesseract-ocr] Failed loading language

2019-09-10 Thread Shree Devi Kumar
> *Is there a way to check if a traineddata file is valid*? > > Thanks, > Nuno > > segunda-feira, 9 de Setembro de 2019 às 17:09:39 UTC+1, shree escreveu: >> >> Combine-lang-model only creates the starter traineddata. It is used as >> part of lstm training

Re: [tesseract-ocr] Tesseract OCR 4 paper

2019-09-11 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#documentation On Wed, Sep 11, 2019 at 6:29 PM Jennil Thiyam wrote: > Does anyone has the link that describes the working of Tessercat 4, I > found paper that talks about the processing steps of tesseract 3, but > failed to get any

Re: [tesseract-ocr] Tesseract OCR 4 paper

2019-09-11 Thread Shree Devi Kumar
gration in Tesseract 4.0. On Wed, Sep 11, 2019 at 6:36 PM Jennil Thiyam wrote: > Shree do you have any other links that talk about how LSTM works in > tesseract OCR > > On Wed, Sep 11, 2019 at 6:33 PM Shree Devi Kumar > wrote: > >> >> https://github.com/tesseract-o

Re: [tesseract-ocr] Re: Tesseract 4.1 not works on windows 7

2019-09-12 Thread Shree Devi Kumar
see https://github.com/UB-Mannheim/tesseract/wiki On Fri, Sep 13, 2019 at 9:04 AM Nash Kwmz wrote: > I thought 4.1 only works on Linux for now? > > On Monday, September 9, 2019 at 7:18:11 PM UTC+8, Aravindhan G wrote: >> >> Earlier builds of tesseract were working on windows 7, but newer build >

Re: [tesseract-ocr] Problem with Wordstr *.box files

2019-09-13 Thread Shree Devi Kumar
Yes, I also noticed this problem recently. My workaround is to create the unicharset from the training text/ground truth files rather than from box files. Look at the help for unicharset_extractor On Fri, Sep 13, 2019, 22:08 J Adam Funk wrote: > Hi, > > I'm using tesseract 4.0.0 (Ubuntu packag

Re: [tesseract-ocr] Problem with Wordstr *.box files

2019-09-13 Thread Shree Devi Kumar
Alternately you can use https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py On Fri, Sep 13, 2019 at 10:08 PM J Adam Funk wrote: > Hi, > > I'm using tesseract 4.0.0 (Ubuntu package version 4.0.0-2) and trying to > set up training data. I have a Python tool that puts random

Re: [tesseract-ocr] OCR of Devanagari + Diacritics + English

2019-09-15 Thread Shree Devi Kumar
Try http://ocr.sanskritdictionary.com/ For OCR of Devanagari + Diacritics + English It's Google option gives better result than tesseract On Sun, Sep 15, 2019, 19:43 Alexander Gribanov wrote: > Hello! > > Finally got real project for OCR. > Could anybody please give some advice in the process s

Re: [tesseract-ocr] OCR of Devanagari + Diacritics + English

2019-09-15 Thread Shree Devi Kumar
Don't know the details. On Sun, Sep 15, 2019, 21:36 Ravi Annaswamy wrote: > That is a beautiful app. > > Shree Devi Kumar, what service does the 'google' selection hit? Is it free? > > Ravi > > > On Sun, Sep 15, 2019 at 11:34 AM S

Re: [tesseract-ocr] OCR of Devanagari + Diacritics + English

2019-09-15 Thread Shree Devi Kumar
for Tamil and Sanskrit using > your scripts and guides but haven’t got a good starting point yet > > > Sent from my iPhone > > On Sep 15, 2019, at 12:55 PM, Shree Devi Kumar > wrote: > > Don't know the details. > > On Sun, Sep 15, 2019, 21:36 Ravi Annaswamy >

Re: [tesseract-ocr] Next problem with training (tesseract 4.0)

2019-09-17 Thread Shree Devi Kumar
config files are there some languages. They will be in langdata or langdata_lstm repos. radical_stroke.txt is also there. You can also look at training instructions in wiki or in shreeshrii/tess4training On Tue, Sep 17, 2019, 20:24 Adam Funk wrote: > Hi again, > > Using the instructions at > <

Re: [tesseract-ocr] Another spurious error message while attempting to train Tesseract.

2019-09-17 Thread Shree Devi Kumar
Page 3302 Loaded 171652/171652 lines (1-171652) If you are trying the tutorial, I suggest that you run the whole process with a small training text file. The one in langdata repo for English is less than 100 lines. Once you get the process working correctly (you need to have all required files in

Re: [tesseract-ocr] Another spurious error message while attempting to train Tesseract.

2019-09-17 Thread Shree Devi Kumar
https://github.com/Shreeshrii/tess4training On Wed, Sep 18, 2019, 01:21 David Maung wrote: > This time I ran the following command to try and prepare 1 font for > training > > src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng > --linedata_only --noextract_font_properties --lang

Re: [tesseract-ocr] Regarding box file creation using tesseract 4.0.0 and tesseract 5.0.0 (alpha version)

2019-09-18 Thread Shree Devi Kumar
Your installation of tesseract must be old. You need the config file lstmbox. On Wed, Sep 18, 2019, 15:11 isuri anuradha wrote: > Hi, > > when I try to create box files by following the > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#making-box-files > why error prompt

Re: [tesseract-ocr] Tutorial for fine-tuning Tesseract 4 for a new font?

2019-09-18 Thread Shree Devi Kumar
Please search forum archive There was a recent mention regarding dot-matrix font training. On Wed, Sep 18, 2019, 15:06 Jochen Naumann wrote: > Does somebody know a good tutorial of how to fine-tune tesseract for > 1. a new ttf font > 2. images of characters > > I am especially interested in tra

Re: [tesseract-ocr] Small script to generate all boxes for ocrd-train

2019-09-18 Thread Shree Devi Kumar
Please submit as a PR to https://github.com/tesseract-ocr/tesstrain On Wed, Sep 18, 2019 at 4:08 PM Lorenzo Bolzani wrote: > > Hi, > I wrote this small script to speed up OCRD-train > training startup. > > It generates the boxes for all the images provided

Re: [tesseract-ocr] Next problem with training (tesseract 4.0)

2019-09-20 Thread Shree Devi Kumar
gt; > Thanks, > Adam > > > On Tuesday, 17 September 2019 16:38:19 UTC+1, shree wrote: >> >> config files are there some languages. They will be in langdata or >> langdata_lstm repos. radical_stroke.txt is also there. >> >> You can also look at training i

Re: [tesseract-ocr] Next problem with training (tesseract 4.0)

2019-09-20 Thread Shree Devi Kumar
to SquishedDawg Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg On Fri, Sep 20, 2019 at 6:42 PM J Adam Funk wrote: > OK, so that "Failed..." is just a warning. > Thanks! > > > On Tuesday, 17 September 2019 16:38:19 UTC+1, shree wrote: >> >>

Re: [tesseract-ocr] text2image: No such file or directory

2019-09-21 Thread Shree Devi Kumar
During the build process, did you make training sudo make training-install what platform are you using? On Sat, Sep 21, 2019 at 5:42 PM Ajinkya Khalwadekar < ajinkya.khalwade...@gmail.com> wrote: > Hi Zdenko, > > To your first question : No, /usr/local/bin/text2image does not exist. > To your 2

Re: [tesseract-ocr] text2image: No such file or directory

2019-09-21 Thread Shree Devi Kumar
stall existing version of tesseract, make clean and then rerun the build process. On Sun, Sep 22, 2019, 00:55 Ajinkya Khalwadekar < ajinkya.khalwade...@gmail.com> wrote: > Yes shree, i did both these steps, infact i am following > https://github.com/tesseract-ocr/tesseract/issues/1453 . >

Re: [tesseract-ocr] Training Sinhala fonts using Tesseract 4.0 version

2019-09-23 Thread Shree Devi Kumar
You need to use a Unicode font. Seems like FMAbhaya is not. http://www.sinhalafonts.org/fonts/13142/fm_abhaya.html https://github.com/tesseract-ocr/langdata_lstm/blob/master/sin/okfonts.txt lists the fonts used for Tesseract4 alpha On Mon, Sep 23, 2019 at 3:07 PM isuri anuradha wrote: > As th

Re: [tesseract-ocr] Corrupt eng.traineddata output file?

2019-09-25 Thread Shree Devi Kumar
Did you convert the checkpoint to traineddata? lstmtraining \ --stop_training \ --continue_from $(LAST_CHECKPOINT) \ --traineddata $(TESSDATA_BEST)/$(START_MODEL).traineddata \ --model_output $@ On Wed, Sep 25, 2019 at 3:05 PM Adam Funk wrote: > Hi again, > > I've succeeded in generating *.lst

Re: [tesseract-ocr] Tesseract 4 not reading Arabic numbers accurately using custom trained data file

2019-09-27 Thread Shree Devi Kumar
You are missing https://github.com/tesseract-ocr/langdata_lstm/blob/master/radical-stroke.txt On Fri, Sep 27, 2019 at 12:59 PM Béchir Gmati wrote: > hi plz i have this error when i execute the command line of > combine-lang-model how i can fix it > [image: Capture.JPG] > [image: Capture1.JPG] >

Re: [tesseract-ocr] Tesseract ./ configure issue

2019-09-27 Thread Shree Devi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata beta version is old. On Fri, Sep 27, 2019 at 6:09 PM Guru Mani wrote: > Hi, > >I tried to install tesseract-4.0.0-beta.1 .I am facing this issue.I am > using redhat 7. > > Confi

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-01 Thread Shree Devi Kumar
See https://github.com/Shreeshrii/tess4training On Tue, Oct 1, 2019 at 7:53 PM Dustin Theobald wrote: > Changed my evaluation to: > > ~/../../usr/local/bin/lstmeval \ > --model ~/Desktop/tesstutorial/trainplusminus/*plusminus_checkpoint* \ > --traineddata ~/Desktop/tesstutorial/trainplusminu

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-01 Thread Shree Devi Kumar
specifically https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.log#L429 On Tue, Oct 1, 2019 at 9:09 PM Shree Devi Kumar wrote: > See https://github.com/Shreeshrii/tess4training > > On Tue, Oct 1, 2019 at 7:53 PM Dustin Theobald > wrote: > >> Cha

Re: [tesseract-ocr] Need Help Learning Howto Train Tesseract OCR on Fraktur Fonts - MAC - VietOCR v5.5.2 and Tesseract 4.1.0

2019-10-02 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR Training Fraktur from GT4HistOCR On Wed, Oct 2, 2019 at 10:56 AM Akos Simon wrote: > Fraktur Fonts OCR recognition with Tesseract OCR is what I am looking > for, I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now >

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-02 Thread Shree Devi Kumar
-GitInstallation#build-with-training-tools You seem to be missing some steps there. On Wed, Oct 2, 2019 at 2:32 PM Dustin Theobald wrote: > Hey Shree, > > Thank you for your help! > > This doesn't work on my MAC. I can't find some of the fonts, so I only try > to c

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-02 Thread Shree Devi Kumar
is around 80 lines plus the extra lines added with plusminus. On Wed, Oct 2, 2019 at 2:53 PM Shree Devi Kumar wrote: > 1. You could install on linux using the appropriate package from > https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-train

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-02 Thread Shree Devi Kumar
19 at 7:38 PM Dustin Theobald wrote: > Hey shree, > > do you know how to manually install the missing fonts for MAC, like in > your docu for linux: > > sudo apt update > sudo apt install ttf-mscorefonts-installer > sudo apt install fonts-dejavu > fc-cache -vf > >

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-03 Thread Shree Devi Kumar
. Oktober 2019 16:46:25 UTC+2 schrieb shree: >> >> Sorry, don't know how to add those fonts for Mac. >> >> The tutorial uses the following set of fonts: >> >> https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L42 >&g

Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

2019-10-03 Thread Shree Devi Kumar
There is no direct method for training from non-unicode fonts. Tesseract's output is also Unicode text only. You can work from scanned images of text in non-unicode fonts and provide the unicode transcription of it. You could probably use a legacy to unicode converter for the text. See https://gi

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-03 Thread Shree Devi Kumar
achen source.” CULTURED CUTTING Home 06-13-2008, § Ø44.01189673355 € > netting Bookmark of WE MORE) STRENGTH IDENTICAL Ø2? activity PROPERTY > MAINTAINED > EOM > > The evaluation on the training data works, but he doesn't recognize any > Line in the evalplusminus/eng.tr

Re: [tesseract-ocr] Preprocessing Tools

2019-10-03 Thread Shree Devi Kumar
Tesseract uses leptonica for image processing. You can use any image software that you are comfortable with for pre-processing. On Thu, Oct 3, 2019 at 2:06 PM Jennil Thiyam wrote: > HI shree, Is there any tools associated with tesseract that we can use for > preprocessing the images?

Re: [tesseract-ocr] Re: Training - Finetuning Characters

2019-10-04 Thread Shree Devi Kumar
19 at 1:48 PM Dustin Theobald wrote: > Ok, when I run make_training_data, it says "Other case ø of Ø is not in > unicharset", might this be a problem? Even though Ø is in the unicharset? > > Cheers, > Dustin > > Am Donnerstag, 3. Oktober 2019 16:52:46 UTC+2 schrieb

Re: [tesseract-ocr] Corrupt eng.traineddata output file?

2019-10-04 Thread Shree Devi Kumar
> > On Wednesday, 25 September 2019 15:10:53 UTC+1, shree wrote: >> >> Did you convert the checkpoint to traineddata? >> >> lstmtraining \ >> --stop_training \ >> --continue_from $(LAST_CHECKPOINT) \ >> --traineddata $(TESSDATA_BEST)/$(START_MODEL).t

Re: [tesseract-ocr] Re: Training Sinhala fonts using Tesseract 4.0 version

2019-10-05 Thread Shree Devi Kumar
If you use linux, you can try similar to attached bash script. On Thu, Oct 3, 2019 at 2:55 PM Shree Devi Kumar wrote: > There is no direct method for training from non-unicode fonts. Tesseract's > output is also Unicode text only. > > You can work from scanned images of te

<    1   2   3   4   5   6   7   8   9   10   >