Re: [tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-30 Thread Lorenzo Bolzani
Hello ShreeDevi, thanks for your answer. I tried to use the 4.0 version but I get a different kind of errors. And, as far as I know , the whitelist is not yet supported in the 4.0 version so I decided to go with the 3.05 because I think this fe

Re: [tesseract-ocr] Pytesseract used with captcha images unable to recognize characters with lines on top

2018-05-07 Thread Lorenzo Bolzani
Try to get rid of all the noise/lines, you can use denoise before binarization or component analysis. Then remove the white border so all the fragments have the same size. Try to do this with gimp and see if it helps before coding it. Then try psm=8 it means "single word" (this should fix the pr

Re: [tesseract-ocr] Re: Break down pedigree

2018-05-28 Thread Lorenzo Bolzani
Use opencv SIFT (or others) to align the picture with your template. http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_feature_homography/py_feature_homography.html#feature-homography https://docs.opencv.org/3.3.0/dc/dc3/tutorial_py_matcher.html That will make

Re: [tesseract-ocr] recognition accuracy is very sensitive to holes on character

2018-06-22 Thread Lorenzo Bolzani
I'd try to upscale the images so that one letter is about 40/50 pixels tall and see if that helps. I'd also try a morphological open/erode operation (or a blur/resharpen) to simply fill the holes. I do not know if there are any special parameters for this kind of problems (that I've encountered to

Re: [tesseract-ocr] recognition accuracy is very sensitive to holes on character

2018-06-22 Thread Lorenzo Bolzani
2018-06-22 11:41 GMT+02:00 blues : > thanks for your reply, Lorenzo > I will test more samples to see if it only happens with holes. > if so, probably just do a morph hole filling before ocr as workaround for > now. > > btw, I'm using version 3.x. Is there a chance 4.x handles this issue > better?

Re: [tesseract-ocr] Re: Word coordinate for single lines.

2018-06-22 Thread Lorenzo Bolzani
With this configuration: tesseract 3.05.01 leptonica-1.75.3 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : zlib 1.2.8 Running: tesseract --psm 7 -l eng 24-block-0-L-42.png out gives me: 3765 Sexualhormonbind. Globulin 1, 15 30 , 16 Upscaling the image to height 50px gives me: 3765 S

[tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
​​ Hi, I'm trying to do fine tuning of an existing model using line images and text labels. I'm running this version: tesseract 4.0.0-beta.3-56-g5fda leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
gt; --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ > --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \ > --model_output data/checkpoints/$(MODEL_NAME) \ > --debug_interval -1 \ > --train_listfile data/list.train \ > --eval_listfile data/list

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
eg. tessdata_best/deu.traineddata for > German. > > On Fri, Jun 29, 2018 at 9:03 PM Lorenzo Bolzani > wrote: > >> Hi Shree, thanks for your answer. >> >> I tried the script setting: >> >> TESSDATA=extracted # here I have the eng.lstm an

Re: [tesseract-ocr] Fine tuning existing model

2018-07-02 Thread Lorenzo Bolzani
Hi Shree, I replaced the line: merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@" with: cp "$(TRAIN)/my.unicharset" "data/unicharset" (I write this in case someone else is following this thread). And now I have a fine tuned brand new model with only t

Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

2018-07-04 Thread Lorenzo Bolzani
I had no problems training with the ocr-d boxes. Looking at the tiffs the first thing I'd try to do is adding some white border on left and right. For my training I used no-binarized (grayscale) data and I think it could be better (more information is available). Are you training from scratch of

Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

2018-07-04 Thread Lorenzo Bolzani
I suspect 1800 lines may not be enough data for training from scratch and you are simply overfitting. I think 5% refers to the evaluation set, with a default split 80/20 I think. Try this to check the accuracy on the training set and the eval set: lstmeval --model your-model.traineddata --eval_li

[tesseract-ocr] Letters split in multiple parts

2018-07-05 Thread Lorenzo Bolzani
Hi, I have a small problem with some letters that are recognized as multiple letters. This is a sample (I can reproduce the problem with this image and eng "_best"): output is: 17AE4L4 The 4 is seen as three different letters. Maybe the shape of the 4 is not so common and this is creating the

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Lorenzo Bolzani
Hi, upscale and enhance contrast, but upscale is what really matters: each letter is 20px, a dot is about three pixel, it's probably "seen" as noise. Bye Lorenzo 2018-07-06 5:51 GMT+02:00 Alberto Andreotti : > Hello, > > I'm having problems with the simplest image possible. > It's a screenshot

Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

2018-07-07 Thread Lorenzo Bolzani
I never had this. It's strange that you are getting this now and not during the training. I would check the location I'm running the command from, I mean, that data/train/...lstmf is there, in the correct relative place. Second I would check the lstmf file size. Then I would inspect the tiff and

Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

2018-07-08 Thread Lorenzo Bolzani
eal data (or do a little "border augmentation", like 1px or 2px). Bye Lorenzo 2018-07-07 18:41 GMT+02:00 Lorenzo Bolzani : > > I never had this. It's strange that you are getting this now and not > during the training. > > I would check the location I'm r

[tesseract-ocr] Re: Letters split in multiple parts

2018-07-12 Thread Lorenzo Bolzani
renzo 2018-07-05 18:59 GMT+02:00 Lorenzo Bolzani : > > Hi, > I have a small problem with some letters that are recognized as multiple > letters. > > This is a sample (I can reproduce the problem with this image and eng > "_best"): > > > > output is: 17AE

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Lorenzo Bolzani
Have a look at this thread: https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ It's easier than it seems, you do not need per character boxes with 4.0, just one per line (that ocr-d automatically generates). If your text is already split into lines you do not have to do anything m

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Lorenzo Bolzani
​​ Generating the training data is a completely different problem from training tesseract. If you want to recognize full words it's better to have full words (or numbers), not individual characters so that the process of splitting the words into characters is done by tesseract. Unless you just wa

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Lorenzo Bolzani
hat come close to your dataset from fonts.google.com. >> Use tesstrain.sh for rendering the images, and lstmtraining to train >> tesseract - you'll achieve a fair accuracy. >> >> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani >> wrote: >> >>> ​​ &g

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Lorenzo Bolzani
of a bank with very complex layout. I have to capture >>> details of account no and pan no. >>> >>> >>> <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/JGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg> >>

Re: [tesseract-ocr] How to train by tesseract 4.00

2018-07-20 Thread Lorenzo Bolzani
You have some problems with your path configuration, check the error message: Failed to read /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/ share/tessdata the path does not make sense. And also the command line: combine_tessdata -u /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/us

Re: [tesseract-ocr] Re: unrecognized argument "unrecognised argument linedata_only"

2018-07-23 Thread Lorenzo Bolzani
Please read the complete error message: it's telling you exactly where the problem is. I think you are using "fancy double quotes" or something like that rather than the normal ones. Are you doing cut and paste from some word processor? This is probably causing all the errors... 2018-07-23 9:4

Re: [tesseract-ocr] How to train by tesseract 4.00

2018-07-23 Thread Lorenzo Bolzani
The TESSDATA_PREFIX maybe? 2018-07-23 17:37 GMT+02:00 Emiliano Isaza Villamizar : > But still i don't know why this happens I haven't modified anything in the > Makefile!! What would I need to change? > > > > > On Friday, July 20, 2018 at 5:30:00 AM UTC-5, Lorenzo Blz wrote: >> >> >> You have som

Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread Lorenzo Bolzani
I had this error when I was mixing best models with non best models. I would try to run again combine_tessdata -e base_model/eng.traineddata base_model/eng.lstm to generate the eng.lstm from the "_best" model (the ones from /usr/share/tessdata are not the "_best" models). Then if the error is s

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

2018-07-26 Thread Lorenzo Bolzani
First, read this: "Fine Tuning for ± a few characters" Then check the data/unicharset file to see if everything is ok, if there are all the characters you want. Then, 15000 iterations are

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

2018-07-31 Thread Lorenzo Bolzani
I'm happy to hear that and thank you for letting me know. I was wondering if the instructions were just a mess or too long :) Bye Lorenzo 2018-07-30 17:19 GMT+02:00 Emiliano Isaza Villamizar : > Lorenzo, Thank you so much for your help. I did everything step by step > and got a very good resul

Re: [tesseract-ocr] Re: Tesseract v4 number recognition

2018-09-01 Thread Lorenzo Bolzani
You can do custom fine-tuning limiting the set of output characters. See: https://groups.google.com/forum/#!searchin/tesseract-ocr/l.bolzani|sort:date/tesseract-ocr/be4-rjvY2tQ/32evtMHlAQAJ Il giorno sab 1 set 2018 alle ore 01:29 Ahmed Essam < ahmed.es.ism...@gmail.com> ha scritto: > Guys if t

Re: [tesseract-ocr] Re: Fine tuning existing model

2018-09-06 Thread Lorenzo Bolzani
Hi Raniem, I did 5 fine tunings for different fonts and text content with roughly these numbers: iterations: samples (training data) 750:208 numbers (4 upper case + 5 digits each) 1000: 400 MRZ codes (22 uppercase chars each) 1800: 1000 numbers (10 digits each) 2250

Re: [tesseract-ocr] Re: Fine tuning existing model

2018-09-10 Thread Lorenzo Bolzani
I think there is no need to change the network definition appending layers with a limited number of output chars. The line you replaced already takes care of this with: --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c*`head -n1 data/unicharset`*]" I had this error when I was mi

Re: [tesseract-ocr] Re: Fine tuning existing model

2018-09-10 Thread Lorenzo Bolzani
Il giorno lun 10 set 2018 alle ore 15:38 Raniem ha scritto: > I am actually doing that not to limit the number of output chars, I am > doing it cause I thought this way I am only tuning the final layer as I > wanted to keep the weights for other layers. > I was trying to experiment whether this i

Re: [tesseract-ocr] Training with a large number of LSTMF files

2018-09-11 Thread Lorenzo Bolzani
Hi, I trained with about 50k very short samples with no problems, going up to 50k iterations in several steps. My suggestion is to train for a few iterations (like 1000), check the accuracy on the validation set (not on the training set), then set the next target to 2000 (so it trains 1000 more),

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-12 Thread Lorenzo Bolzani
Il giorno mer 12 set 2018 alle ore 13:15 ProgressNotPerfection < jimquitten...@gmail.com> ha scritto: > Hi Lorenzo > Thanks for suggestion, I began stepping up the iterations and measuring > the results, but my box crashed (looks like it ran out of memory) at 6K > iterations, so I will need to pre

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-12 Thread Lorenzo Bolzani
Il giorno mer 12 set 2018 alle ore 19:44 ProgressNotPerfection < jimquitten...@gmail.com> ha scritto: > Hi Lorenzo > To clarify, my training text is 73 lines of words (with some > numbers/punctuation etc.), each about 70 chars long including spaces. From > this text I generated a tif/box set for e

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

2018-10-15 Thread Lorenzo Bolzani
Just a small note (in case someone will land on this thread): I recently found out that PSM 7 and others work better than 13. See: https://github.com/tesseract-ocr/tesseract/issues/1778#issuecomment-429527692 Il giorno mar 31 lug 2018 alle ore 11:30 Lorenzo Bolzani < l.bolz...@gmail.com&

Re: [tesseract-ocr] Why do I get such poor results from Tesseract for simple single character recognizing?

2018-10-15 Thread Lorenzo Bolzani
Try to use psm 7 or 13 (SINGLE_LINE and RAW_LINE). In my case 7 works best. I'm not 100% sure but it may be easier to recognize full words rather than single characters. But I do not know if this is just a test or if this is what you need to do. The default oem mode (lstm) should be the best, but

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

2018-10-20 Thread Lorenzo Bolzani
First check the version of tesseract, just to be sure, maybe you have more than one around: lstmtraining -v If the training file is missing the error is: Failed to load list of training filenames from eng.training_files.txt If the --train_listfile option is missing the error is: Must supply a

Re: [tesseract-ocr] Tesseract misreading numbers

2018-10-20 Thread Lorenzo Bolzani
Is the image you attached the one saved from the API with the tessedit_write_images options? Or is the one you give as input to your program? If it is not saved from the API please try to save the image as PNG immediately before the API call and compare it to the input one. Try to use a single li

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

2018-10-27 Thread Lorenzo Bolzani
Check the unicharset file to see if all the characters you want to recognize are there. combine_tessdata -u trained_model.traineddata output_dir cat output_dir/*unicharset Otherwise you need to merge the old one with the new one before training. This is how ocrd-train

Re: [tesseract-ocr] Beginner - problem with size

2018-10-28 Thread Lorenzo Bolzani
Try the -resize options from imagemagick: http://www.imagemagick.org/Usage/resize/ Il giorno dom 28 ott 2018 alle ore 07:50 cintrikz cintrikz < cintr...@gmail.com> ha scritto: > first let me start off by saying thank you! thank you all for the hard > work that has been put into tesseract its rea

Re: [tesseract-ocr] Line level training

2018-11-12 Thread Lorenzo Bolzani
Tesseract 4.x uses lines, not chars. Bye Lorenzo Il giorno lun 12 nov 2018 alle ore 05:42 ha scritto: > Dear All, > > Currently, tesseract training is based on the pair (tiff and box). > It's not easy to make box file (char level) if we try to train some scanned > document images not ge

Re: [tesseract-ocr] Line level training

2018-11-12 Thread Lorenzo Bolzani
Il giorno lun 12 nov 2018 alle ore 11:53 ha scritto: > That means we can label some existing images with text line boxes instead > of individual char boxes in current tesseract 4.0? I checked the box files > generated by the training process and found that char boxes were still > there. > Yes it

Re: [tesseract-ocr] Re: Tesseract 4 sometimes confuses a 4 with a 9

2018-12-08 Thread Lorenzo Bolzani
If the text is very small, like less than 20/30px, you can try to upscale it and see if it helps. Otherwise fine tuning is the only alternative I know of. If you use https://github.com/OCR-D/ocrd-train it is quite simple once you have the crops and the corresponding text. I did it a few times and

Re: [tesseract-ocr] What is the information in basetrain.log

2018-12-09 Thread Lorenzo Bolzani
You can find some details here: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00 Lorenzo Il giorno dom 9 dic 2018 alle ore 18:02 Zohreh Khosrobeygi < beigy.zoh...@gmail.com> ha scritto: > Hi, > Does any one

Re: [tesseract-ocr] Getting alternative options for OCR results

2019-01-02 Thread Lorenzo Bolzani
I use a python wrapper and I can ask for alternatives chars but with 4.x I always get just one. With 3.x I used to get multiple ones. As far as I know right now 4.x does not provide this feature. Lorenzo Il giorno mer 2 gen 2019 alle ore 18:59 Zdenko Podobny ha scritto: > it is not available

Re: [tesseract-ocr] Evaluating Tesseract with new domain-specific documents

2019-01-25 Thread Lorenzo Bolzani
This is an option if you want to consider missing/extra chars too: https://en.wikipedia.org/wiki/Levenshtein_distance You should be able to find implementations for most languages. Bye Lorenzo Il giorno ven 25 gen 2019 alle ore 11:56 Matthew Hodgskiss < matthew.hodgsk...@gmail.com> ha scrit

Re: [tesseract-ocr] How to optimize tesseract to maximum speed for single number (several digits) recognition

2019-01-29 Thread Lorenzo Bolzani
First double check if the Pi is not throttling due to overheating or lack of USB power. This may cause the slowdown. Usually 30/50 px of text height is fine. IF the problem is tesseract, try to use the fast model (or "normal" if using best). I assume you are using the 4.x release. Try tesseract -

Re: [tesseract-ocr] How to optimize tesseract to maximum speed for single number (several digits) recognition

2019-01-30 Thread Lorenzo Bolzani
Did you check this? https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=147781&start=50#p972790 Il giorno mer 30 gen 2019 alle ore 08:09 Jan Pohanka ha scritto: > I have already done that but haven't found anything interesting. > I tried to ask here if there are eg. any part of algorithms t

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-01-30 Thread Lorenzo Bolzani
Zdenko, are you 100% sure that the image is binarized before being fed to the neural network? It looks like a big waste of information to me. Il giorno mer 30 gen 2019 alle ore 07:56 Zdenko Podobny ha scritto: > That is not true: you do not need to transform image to grayscale. Any > image is a

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-01-30 Thread Lorenzo Bolzani
aseapi.cpp#L638> > > > Zdenko > > > st 30. 1. 2019 o 11:17 Lorenzo Bolzani napísal(a): > >> >> Zdenko, are you 100% sure that the image is binarized before being fed to >> the neural network? It looks like a big waste of information to me. >> >> >

Re: [tesseract-ocr] pytesseract: errors with recognized digits

2019-01-30 Thread Lorenzo Bolzani
Try psm 6 Try a few small upscales so that the text is between 30-40 px and see if it helps, like 31, 33, 35, 37, 39 (on a large test set). Try to crop all the white border (imagemagick, gimp) and see if it helps. Otherwise you need to fine tune the model: https://github.com/tesseract-ocr/tesse

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Lorenzo Bolzani
If you have images of the cards with the corresponding text you could train it on the cropped/cleaned text directly. Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc ha scritto: > So, I have figured out what was I doing wrong: > > - I am using tesseract packages I got from apt on ubuntu 18

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Lorenzo Bolzani
Yes, generating text is faster and easier. But the real extracted and cleaned text you are going to eventually recognize is going to be different from this, more or less depending on a lot of factors: - how similar your training font actually is - how good your cleanup will be (test this in advanc

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Lorenzo Bolzani
You can have a look at ocrd-train https://github.com/OCR-D/ocrd-train You just have to prepare cropped tiff and txt files with the same name containing a single line of text. At the same time, if you already set up everything for the font based training, I'd give it a try (time permitting): you

Re: [tesseract-ocr] pytesseract: errors with recognized digits

2019-01-31 Thread Lorenzo Bolzani
Check the API: https://pypi.org/project/pytesseract/ There is an example under: Support for OpenCV image/NumPy array objects You may also try different languages (I had different results just on numbers). Il giorno gio 31 gen 2019 alle ore 15:18 Aaron Spell <8383...@gmail.com> ha scritto: >

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-02-01 Thread Lorenzo Bolzani
sable the binarization step to see if I get an improvement. Maybe there are some parameters controlling this step. Thanks Lorenzo Il giorno gio 31 gen 2019 alle ore 20:42 Zdenko Podobny ha scritto: > see inline comments. > > st 30. 1. 2019 o 15:17 Lorenzo Bolzani napísal(a): > >

Re: [tesseract-ocr] Ocr-d train - Tesseract 4.0 Training

2019-02-04 Thread Lorenzo Bolzani
To use ocrd you need to prepare image files and txt files with the same name but different extension. For example: sample1.png sample1.gt.txt The gt.txt is a simple text file containing the correct text, 145, for example. The images must be cropped with no border or just a couple of pixels. Text

Re: [tesseract-ocr] OCRd gives error at Makefile:84: data/list.train

2019-02-05 Thread Lorenzo Bolzani
Check the output: /bin/bash: bc: command not found You need to install the small "bc" program. Lorenzo Il giorno mar 5 feb 2019 alle ore 11:49 Kristóf Horváth < vazzzeg...@gmail.com> ha scritto: > So I edited my generate_line_box.py with the following code: >> >> #!/usr/bin/env python >> >> >

Re: [tesseract-ocr] OCRd gives error at Makefile:84: data/list.train

2019-02-05 Thread Lorenzo Bolzani
It depends on what OS you are using. Usually it is something like: apt-get install bc yum install bc or you can use the graphic tool to manage packages. If you are using cygwin I suppose it is similar but I never used it. It's a very common package and it is strange that it is not available by

Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

2019-02-07 Thread Lorenzo Bolzani
Hi Kristof, good work, I thought about it a few times. I gave a quick look, just a couple of quick notes, I'll try to read it better when I get time. This thread about the font size is where I got the 30/40px indication: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tes

Re: [tesseract-ocr] Ocr-d train - Tesseract 4.0 Training

2019-02-07 Thread Lorenzo Bolzani
You do not need any font or font data, just the images and the corresponding text. As a bare minimum 500/1000. Il giorno gio 7 feb 2019 alle ore 05:10 ha scritto: > Thanks for your response, Since these are handwritten digits I don't have > font data and what I'm having is cropped image blocks

Re: [tesseract-ocr] What other learning_rate is available apart from 20e-4?

2019-02-08 Thread Lorenzo Bolzani
Learning rate is the speed of the training. It's a continuous value, you can use any one. To see some difference you need to change it by an order of magnitude or at least twice/half the value. In practice it is the size of the correction applied to the neural networks weight during the training (

Re: [tesseract-ocr] [4.00] Extra symbols produced

2019-03-01 Thread Lorenzo Bolzani
Yes, I have the same problem, some characters are split, sometimes from one character you even get three ("O0O" for example). https://github.com/tesseract-ocr/tesseract/issues/1778 I wrote quite a complex code to try to limit the problem (with psm 13). The idea is this: Process each symbol indi

Re: [tesseract-ocr] How to choose a suitable threshold for Binarisation

2019-03-08 Thread Lorenzo Bolzani
I someone wants to try this and is looking for a python implementation here is one: http://scikit-image.org/docs/dev/auto_examples/segmentation/plot_niblack_sauvola.html https://github.com/scikit-image/scikit-image/pull/905/files/bb6af8ec723776fc821654847aec04a652f70042 binary_phansalkar = thr

[tesseract-ocr] Does the psm value used to generate lstmf files influences the training?

2019-03-21 Thread Lorenzo Bolzani
Hi, I keep having problems with duplicated letters with custom fine-tuned models. For example an M becomes MH. I'm using ocrd-train with actual crops and I noticed that the lstmf files are generated with psm 6. At runtime I use psm 7. Do you think this may make a difference? From a quick test it

Re: [tesseract-ocr] General strategies for dealing with problem images

2019-03-23 Thread Lorenzo Bolzani
Il giorno mar 19 mar 2019 alle ore 06:03 Jonathan Muller < jmul...@pukogames.com> ha scritto: > 5 - Create a whitelist based on the zone of probable characters (this one > improves accuracy a lot !) > Ho do you do whitelisting with tesseract 4.x? As far as I know is not yet supported. I do the

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-03 Thread Lorenzo Bolzani
Hi, I train with real data. I use grayscale images, I think color makes no difference. I do a very good image cleanup: background removal, denoise, straightening, sharpening, illumination correction, contrast stretching, etc. before passing the text to tesseract. This part is likely better done o

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-08 Thread Lorenzo Bolzani
a script for image pre-processing? Please share, if possible. > It will be helpful to many. > > On Wed, Apr 3, 2019 at 6:47 PM Lorenzo Bolzani > wrote: > >> Hi, I train with real data. I use grayscale images, I think color makes >> no difference. >> >> I do

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-08 Thread Lorenzo Bolzani
om deep learning. We can get any > complicated feature from convolution. So theoretically, it is no need to do > such preprocessing. How do you think about this ? > > > On Wed, Apr 3, 2019 at 21:17 Lorenzo Bolzani wrote: > >> Hi, I train with real data. I use grayscale im

Re: [tesseract-ocr] small image and OCR

2019-04-14 Thread Lorenzo Bolzani
Hi Alex, you need to pre process the image a little. First negate it, tesseract expect dark on white background text. Then use --psm 6 to tell tesseract that this is a single block or text and not a complex page to split in paragraphs. Also try psm 7, single line. tesseract --psm 6 cropped_image

Re: [tesseract-ocr] Tips and advice for preprocessing images before feeding them to tesseract.

2019-04-15 Thread Lorenzo Bolzani
This is very hard to do reliably for general images. You may use something like EAST to detect text regions, then a few tests to understand if it's black on white text or the opposite. Then you can crop the image and rescale it to a standard size (this may not be the final size you'll feed to tess

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-17 Thread Lorenzo Bolzani
Split the data set in two parts (80/20 for example), use the large one for training and the other for evaluation. Train for a few epochs (100 or 1000 depending on how much data you have), stop it and check with lstmeval if the *eval score* is improving. Restart the training adding 100/1000 to the

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-18 Thread Lorenzo Bolzani
> >> There is no existing utility to do that. However, Ray had dumped the info >> for tessdata_fast (and partly for tessdata_best) which has been posted in >> the wiki at >> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast >> >> >

Re: [tesseract-ocr] is there a way to scan only first word of a page?

2019-04-19 Thread Lorenzo Bolzani
Hi, if the page has a fixed simple format you can crop the image leaving only the upper part. You can use imagemagick or a python script, etc. Lorenzo Il giorno ven 19 apr 2019 alle ore 14:49 Vikas Sharma < vikasharma2...@gmail.com> ha scritto: > Hello guys, > > I am trying to identify page cate

Re: [tesseract-ocr] small image and OCR

2019-04-23 Thread Lorenzo Bolzani
Hi, I suspect you did a cut and paste or some edits and now you have some non-printable characters in your command line (the question mark boxes). Write it again from scratch. And you are missing one parameter in the command line, the output file, you can use "-" for standard output. $ tesseract

Re: [tesseract-ocr] Re: Recognition of "5" instead of "S"

2019-04-28 Thread Lorenzo Bolzani
I think the problem is also that the network does not expect a mix of letters and numbers. The text is processed as a continuous stream and not as individual characters. This is good for text but not for codes. So if you want to fine tune you need to provide similar mixed sequences. Also, if poss

Re: [tesseract-ocr] Simple image FAIL fails

2019-04-29 Thread Lorenzo Bolzani
Hi, inverting the image gives the correct results. Also cropping the image just around the text works. Lorenzo Il giorno lun 29 apr 2019 alle ore 19:11 Jason ha scritto: > Apologies for such a simple question but this is a super simple test case > and I don't understand why it isn't working. T

Re: [tesseract-ocr] Fails to recognize seemingly simple text

2019-05-02 Thread Lorenzo Bolzani
Hi, use psm 6 (or 7). Also try to crop to have a single line, if possible. Black text on white bg is better. You should be able to isolate text in this way: https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/ Lorenzo Il giorno gio 2 mag 2019 alle ore 16:15 Arjun Bk

Re: [tesseract-ocr] Fine tuning existing model

2019-05-02 Thread Lorenzo Bolzani
lse to configure? >> >> >> Thanks, bye >> >> Lorenzo >> >> >> 2018-06-29 18:27 GMT+02:00 Shree Devi Kumar : >> >>> You should be able to use the new makefile after you make changes for >>> all the directory locations to match your

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Lorenzo Bolzani
s://github.com/tesseract-ocr/tessdata). >>>> >>>> >>>> >>>> One more question: I wanted to check if the output character set of the >>>> new and old model differ. I used: >>>> >>>> combine_tessdata -u eng.tra

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Lorenzo Bolzani
Shree, thanks for the clarification. Il giorno ven 3 mag 2019 alle ore 11:59 Shree Devi Kumar < shreesh...@gmail.com> ha scritto: > >There are three model sizes: best, normal and fast. Each of these can > also be converted to an integer model. > > Only `best` can be converted to integer and in fa

Re: [tesseract-ocr] OCR Failing to Consistenly Recongnize the single digit in my screenshot

2019-05-07 Thread Lorenzo Bolzani
Hi, try to invert the images (black text on white) and use psm 6 or 7. Increasing contrast may also help. Lorenzo Il mar 7 mag 2019, 08:49 Sean Connell ha scritto: > Currently my program searches for the picture of the word Opponents on the > screen then moves a bit a takes a picture of the

Re: [tesseract-ocr] OCR Failing to Consistenly Recongnize the single digit in my screenshot

2019-05-07 Thread Lorenzo Bolzani
This is where you need to improve contrast. https://pillow.readthedocs.io/en/stable/reference/ImageEnhance.html You need to play a little with PIL to find out what works best for your data. Lorenzo Il giorno mar 7 mag 2019 alle ore 21:21 Sean Connell < nightfire120sla...@gmail.com> ha scritto:

Re: [tesseract-ocr] How to extract text for processing by tesseract v4?

2019-05-08 Thread Lorenzo Bolzani
Hi, you can try a few things, but you need to write a small script (python, etc.) or use imagemagick. I suggest to first try with gimp, find what works best, and then write the code. You want dark text on clear background. For white text on red: 1. Invert the image. Desaturate. Increase contrast.

[tesseract-ocr] Processing an image batch from the API

2019-05-09 Thread Lorenzo Bolzani
Hi, is there a way to process a batch of images with a single api call? By looking at the api I'm quite sure you cannot, but maybe I'm missing something. Thanks, Lorenzo -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from th

Re: [tesseract-ocr] Processing an image batch from the API

2019-05-09 Thread Lorenzo Bolzani
Like a dozen of jpegs. I can do a for loop, but I'm looking for something like a setImages() giving me a list of results. Il giorno gio 9 mag 2019 alle ore 16:35 Zdenko Podobny ha scritto: > What do you mean by " batch of images "? Tiff? > > Zdenko > > > št 9

Re: [tesseract-ocr] looking for URLs in screen shots

2019-05-15 Thread Lorenzo Bolzani
Hi, if you are willing to program a little this is what I would try: - opencv template matching : extract a few frame fragments containing "https://"; from the video then look for it in all frames (or maybe one frame out of).

Re: [tesseract-ocr] How to extract text for processing by tesseract v4?

2019-05-20 Thread Lorenzo Bolzani
I just found this: https://www.quora.com/How-do-I-fill-holes-in-image-using-image-processing/answer/V-Sri-Chakra-Kumar Il giorno mer 8 mag 2019 alle ore 09:57 Lorenzo Bolzani ha scritto: > Hi, > you can try a few things, but you need to write a small script (python, > etc.) or use im

Re: [tesseract-ocr] Recommendation on how to best train Tesseract for new UTF-8 symbols

2019-05-21 Thread Lorenzo Bolzani
Hi, when you fine tune the model (maybe with ocrd-train) you can choose to restrict the model output to a smaller set of characters. No need to blacklist or anything else. If you just want to locate the symbols something like opencv matchTemplate

Re: [tesseract-ocr] Tire DOT OCR - Black Text, Black Background

2019-05-21 Thread Lorenzo Bolzani
Hi, this looks hard. You have two problems here, straighten the text and clean it up. Once you have straighten the text to something like this: [image: 8829199908894_crop.jpg] google vision api recognize it correctly. So it can be done. I do not know how they

Re: [tesseract-ocr] OCRing simple numbers unreliable

2019-05-22 Thread Lorenzo Bolzani
Hi, try these (in any combination): psm 6 or 7 remove white border (all or most) downscale so that the font is 20/50px tall fine tune a model to recognize only numbers threshold Otherwise post more details about how you are using tesseract. Bye Lorenzo Il giorno mer 22 mag 2019 alle ore 11:

Re: [tesseract-ocr] Black & white comic text recognition

2019-05-24 Thread Lorenzo Bolzani
Hi, I do not think tesseract page segmentation can handle this kind on layout. It's more oriented towards paragraphs, tables and classic text layouts. And I think page segmentation is not based on neural networks. I would try something like opencv EAST

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Lorenzo Bolzani
Also try: locate tesseract ldconfig -p | grep tesseract ls -l /usr/local/lib/libtesseract* and run: sudo ldconfig after you uninstall tesseract (or even right now). Il giorno ven 24 mag 2019 alle ore 15:37 anne < christineannecatu...@gmail.com> ha scritto: > These are what I get > *ldd /u

Re: [tesseract-ocr] MRZ/MRP (Machine-readable zone/passport) dataset for tesseract v4

2019-05-29 Thread Lorenzo Bolzani
Hi Mamadou, this sounds very interesting. How did you do the training and accuracy measurements? What parameters did you use for the model? Thanks, bye Lorenzo Il giorno lun 27 mag 2019 alle ore 07:38 Mamadou ha scritto: > Hello, > > We have open sourced (BSD license) MRZ/MRP (Machine-readabl

Re: [tesseract-ocr] Bounding box

2019-06-09 Thread Lorenzo Bolzani
I think you are talking about preparing the training data. With tesseract 4.x you do not need to define the boxed for each chartacter just one big box for the whole line. Bye Lorenzo Il giorno dom 9 giu 2019 alle ore 10:50 Jennil Thiyam < thiyamjen...@gmail.com> ha scritto: > ই 110 4657 137 47

Re: [tesseract-ocr] Bounding box

2019-06-09 Thread Lorenzo Bolzani
the link about "no need" of bounding boxes of > every unit but rather the whole line > > On Sun, Jun 9, 2019 at 2:52 PM Lorenzo Bolzani > wrote: > >> I think you are talking about preparing the training data. With >> tesseract 4.x you do not need to define the

Re: [tesseract-ocr] Tesseract does not give good output we need some suggestion.

2019-06-11 Thread Lorenzo Bolzani
Try to straighten the text: https://www.pyimagesearch.com/2017/02/20/text-skew-correction-opencv-python/ (I suspect you are already doing this) Small dots will give you problems with this method, so first make a copy of the image, run a light close/erode (google: morphology transformation) to re

Re: [tesseract-ocr] Re: FontAwesome and Tesseract

2019-06-18 Thread Lorenzo Bolzani
How many different chars do you need to detect? What is the size range (in pixels)? What kind of images, scans, smartphone pictures, screenshots? If you just want to locate the symbols something like opencv matchTemplate may

Re: [tesseract-ocr] OCR pipeline with OpenCV

2019-06-19 Thread Lorenzo Bolzani
Hi Nicolas, I think what you did is good, you just need to play with pre-processing more. I usually process the images with Gimp until I can get a good results, then I try to do the same processing with opencv/PIL. You do not strictly need to threshold the image, a very very strong contrast is en

Re: [tesseract-ocr] Re: Trouble reading text "in between lines"

2019-06-26 Thread Lorenzo Bolzani
Can you cut the image vertically in a simple way? Lorenzo Il giorno mer 26 giu 2019 alle ore 11:08 'Hu gePanic' via tesseract-ocr < tesseract-ocr@googlegroups.com> ha scritto: > I have "sort of" solved the problem. > > I run tesseract 2 times. > After the first run I delete all the text already

  1   2   >