Re: [tesseract-ocr] Re: Issues with OCR recogniztion

2022-09-10 Thread Zdenko Podobny
I would suggest reading the doc before using any tool. In this case: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html Zdenko so 10. 9. 2022 o 9:01 Daniele Consoli napísal(a): > Update, I have fixed this myself by using an app to turn all pictures > black and white. > I think the iss

Re: [tesseract-ocr] Low Quality image but little no Noise

2022-09-15 Thread Zdenko Podobny
Did you try documentation? Zdenko št 15. 9. 2022 o 11:43 Shester Msouobu napísal(a): > Hey ! I have a set of lot quality images tesseract can't well read. Though > there is literally no noise on there. Any help ? > > Example > > [image: images1871.png] > > Tesseract output > "Cerra)" > > -- >

Re: [tesseract-ocr] AdaptiveClassifierIsEmpty read-access violation

2022-09-22 Thread Zdenko Podobny
Tesseract 4.x is an old and unsupported version. So it would be nice if you could provide an example code with the public API that causes the read-access violation problem. function AdaptiveClassifierIsEmpty is not part of the public API ( https://github.com/tesseract-ocr/tesseract/tree/main/inclu

Re: [tesseract-ocr] Question : can I force Tesseract to follow an existing layout?

2022-09-23 Thread Zdenko Podobny
Tesseract support uzn file[1] with psm 4. Seach forum for more details [1] https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format Zdenko pi 23. 9. 2022 o 17:20 Vincent Sarbach-Pulicani napísal(a): > Hello, > I'm working on historical newspaper from the interwar period written in 3 >

Re: [tesseract-ocr] get font that Leptonica determines for current character

2022-09-26 Thread Zdenko Podobny
Leptonica is an image processing library. For font guessing use appropriate tool: https://www.google.com/search?q=get+font+name+from+image&oq=get+font+name+from+image Tesseract 3.x is a very old and unsupported version. The current version is 5.2 Zdenko so 24. 9. 2022 o 20:35 Fish Money napísal

Re: [tesseract-ocr] Tesseract character recognition and C++

2022-09-26 Thread Zdenko Podobny
If there would be a magic function for improving accuracy that works for any case we would implement it years ago. Please read relevant documentation where we collected tesseract best experiences. Image preprocessing depends on the image you did not show (instead of that you posted not formatted c

Re: [tesseract-ocr] environment variables issues

2022-09-26 Thread Zdenko Podobny
You did not write which version of tesseract you used. Based on your previous posts: What about little testing and thinking by yourself before posting on the forum? e.g. setting the variable to C:\tesseract-ocr\ or C:\tesseract-ocr\tessdata\ ? Zdenko so 24. 9. 2022 o 20:40 Fish Money napísal(a

Re: [tesseract-ocr] Tightly cropped numbers are unreadable by tesseract

2022-10-18 Thread Zdenko Podobny
> tesseract 84.png - --psm 7 84 > tesseract --version tesseract 5.2.0-53-g80de leptonica-1.83.0 (Oct 8 2022, 14:19:38) [MSC v.1929 LIB Release x64] libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.12 : libwebp 1.2.2 : libopenjp2 2.5.0 Found AVX2 Fou

Re: [tesseract-ocr] Combining traineddata files

2022-10-22 Thread Zdenko Podobny
Legacy or LSTM? Zdenko so 22. 10. 2022 o 8:09 Umanda Dikwatta napísal(a): > Hello, > > I have several traineddata files for different fonts for the same > language. How to combine these files to get one traineddata file ? > > Best Regards > > -- > You received this message because you are subs

Re: [tesseract-ocr] how do i make this work on mac

2022-10-26 Thread Zdenko Podobny
What is your problem? Tesseract works on Mac without problem AFAIK. At lease we test Mac build with github actions: https://github.com/tesseract-ocr/tesseract/actions/workflows/autotools-macos.yml On Wed, 26 Oct 2022, 18:21 Aarik Ghosh, wrote: > how can i make it work on mac? > > -- > You receiv

Re: [tesseract-ocr] WARNING! LEAK!

2022-10-27 Thread Zdenko Podobny
I am not sure about C#, but the rules for c++ are described e.h. here: https://en.wikipedia.org/wiki/New_and_delete_(C%2B%2B) Zdenko št 27. 10. 2022 o 11:17 Björn Gitter napísal(a): > I'm using Tesseract 4.1.1 and Tesseract.Drawing 4.1.1 in c# > I downloaded the language data from > https://g

Re: [tesseract-ocr] Combining traineddata files

2022-10-31 Thread Zdenko Podobny
As far as I know, there is no tool that can merge finished training. IMO for LSMT it should work as incremental training e.g. you used the output of fine-tuning as input for the next fine-tuning... (I never tried it ;-) ) Zdenko so 22. 10. 2022 o 9:36 Umanda Dikwatta napísal(a): > > LSTM > On

Re: [tesseract-ocr] Hello,

2022-10-31 Thread Zdenko Podobny
Hello, how did you come to the conclusion you need to do training? You did not post any examples, but for simple solutions, there are plenty of examples on the internet (search for ALPR): https://pyimagesearch.com/2020/09/21/opencv-automatic-license-number-plate-recognition-anpr-with-python/ http

Re: [tesseract-ocr] Cannot install 5.2.0 in debian

2022-11-21 Thread Zdenko Podobny
Did you try to read https://github.com/tesseract-ocr/tesseract#installing-tesseract Zdenko po 21. 11. 2022 o 14:57 Kan Ping Law napísal(a): > My Linux version: Debian GNU/Linux 11 (bullseye) > > I enter agt-get install tesseract-ocr. > The version downloaded is 4.1.1 > If I specify the version

Re: [tesseract-ocr] Bad recognition with good input image

2022-11-24 Thread Zdenko Podobny
please read and follow the docs: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md You can easily get output like this: Reserved 267 Mana 702/969 Rage 50/50 Life 981/2,181 Shield 221 Ward 267 Zdenko št 24. 11. 2022 o 7:00 Austin Seymour napísal(a): > Working on a progr

Re: [tesseract-ocr] HRe: tesseract 4.1.1 slow in aws instance centos7

2022-11-28 Thread Zdenko Podobny
Are these version information from the server or from the laptop? General rule: use the latest stable version (5.2, 4x is unsupported) Zdenko po 28. 11. 2022 o 15:49 Giuseppe Coniglio napísal(a): > Hi, I have same problem in my Oracle Linux Server 8.6 > > tesseract 4.1.1 > leptonica-1.76.0 >

Re: [tesseract-ocr] Can't get bib#'s from tshirt JPG: should be simple.

2022-11-29 Thread Zdenko Podobny
Tesseract is an OCR engine. You need to search for " text detection from natural scenes" e.g.: https://scholar.google.com/scholar?q=text+detection+in+natural+scenes&as_sdt=0&as_vis=1&oi=scholart https://www.sciencedirect.com/science/article/pii/S1877050922001867 https://d1wqtxts1xzle7.cloudfront.ne

Re: [tesseract-ocr] Text/numbers from Nature scenes?

2022-12-02 Thread Zdenko Podobny
This is English speaking forum. Please send you message in English. Zdenko On Sat, 3 Dec 2022, 08:34 Rolando José Torres Sánchez, < rolandojtor...@gmail.com> wrote: > BUeno creo que serviria mucho nos compartieras la imagen que presenta los > problemas, y la version de tesseract que haz usado.de

Re: [tesseract-ocr] Need Help with tiktok image processing

2022-12-09 Thread Zdenko Podobny
1. Implement text detection on the image (EAST, YOLO... see https://www.youtube.com/watch?v=ZpRNfWzuexQ) or search for "text detection python" 2. Process detected areas so there is a text without any graphics - see some suggestions in docs ( https://github.com/tesseract-ocr/tessd

Re: [tesseract-ocr] A Simple grayscale image cannot be OCR'd

2022-12-11 Thread Zdenko Podobny
run this to understand what it problem: > tesseract 8fXlqZY.png 8fXlqZY --psm 7 get.images Then check the binarized version of your input that is used for OCR: 8fXlqZY.processed.tif There are 2 simple ways to solve the problem: 1. using only text ares for OCR (e.g. cropping image to text wit

[tesseract-ocr] heads-up tesseract 5.3 release is comming

2022-12-13 Thread Zdenko Podobny
Dear developers, A new release of tesseract 5.3 is coming soon (expected within a week). There is available release candidate 1: https://github.com/tesseract-ocr/tesseract/releases/tag/5.3.0-rc1 Please check, if this release works for you, report possible tesseract issues, and adjust your solutio

[tesseract-ocr] Fwd: [tesseract-ocr/tesseract] Release 5.3.0 - 5.3.0

2022-12-23 Thread Zdenko Podobny
Thank you, Stefan and other contributors for this new release! Zdenko -- Forwarded message - Od: Stefan Weil Date: št 22. 12. 2022 o 15:18 Subject: [tesseract-ocr/tesseract] Release 5.3.0 - 5.3.0 To: tesseract-ocr/tesseract Cc: Subscribed 5.3.0

Re: [tesseract-ocr] tesseract 5.2.0 available on raspi4?

2022-12-27 Thread Zdenko Podobny
Hello, that commands are for IMO for Ubuntu and AFAIK Raspbian is based on Debian... So you get the correct reply about not being supported Raspbian/buster... Try these steps/commands for Raspberry: sudo apt update sudo apt install apt-transport-https sudo cp /etc/apt/sources.list /etc/apt/sour

Re: [tesseract-ocr] How to get bounding box with text using Tesseract4Android

2023-01-27 Thread Zdenko Podobny
Did you check the documentation? I am not familiar this this android implementation, but I would expect that API example in tesseract documentation should work. Zdenko pi 27. 1. 2023 o 10:52 Ju Ash napísal(a): > I am using 'cz.adaptech.tesseract4android:tesseract4android:4.3.0' in my > Androi

Re: [tesseract-ocr] YOLO / text detection - need help

2023-01-31 Thread Zdenko Podobny
Hello, something like this I have on my "try todo list". As far as I understand you need to train YOLO for text detections: https://github.com/Neerajj9/Text-Detection-using-Yolo-Algorithm-in-keras-tensorflow https://towardsdatascience.com/object-detection-on-newspaper-images-using-yolov3-85acfa56

Re: [tesseract-ocr] Linearized PDFs

2023-02-01 Thread Zdenko Podobny
AFAIK no, but you can use external tools like QPDF to create linearized pdf E.g. tesseract sceen.png screen pdf qpdf --linearize screen.pdf final.pdf [1] https://qpdf.readthedocs.io/en/stable/linearization.html#writing-linearized-files Zdenko ut 31. 1. 2023 o 19:50 'Gerry St.Pierre' via tessera

Re: [tesseract-ocr] How to enable tessedit_write_images on pytesseract ?

2023-02-04 Thread Zdenko Podobny
py-tesseract is wrapped of tesseract executable, so I suggest to use dirrecty tesseeract if something goes wrong... "tesseract --help-extra" is your friend. tessedit_write_images should be use this way: "-c tessedit_write_images=1" Zdenko so 4. 2. 2023 o 9:55 Mars napísal(a): > Hello, > > I am

Re: [tesseract-ocr] How to extract non-text regions

2023-02-04 Thread Zdenko Podobny
The task you mention is called "The document layout segmentation" or "Document layout analysis"( https://en.wikipedia.org/wiki/Document_layout_analysis) As mentioned Muneeb, you can try https://layout-parser.github.io/ and also https://github.com/qurator-spk/eynollah looks promising. I you would

Re: [tesseract-ocr] Hyphenation postprocessing

2023-02-07 Thread Zdenko Podobny
there is a (similar) feature request: https://github.com/tesseract-ocr/tesseract/issues/728 Zdenko po 6. 2. 2023 o 3:57 Lars Aronsson napísal(a): > Is it possible to instruct tesseract for the image: > > Let us build a snow- > man on the lawn. > > to output in txt format: > > Let us bui

Re: [tesseract-ocr] TesserAct implementation help

2023-02-07 Thread Zdenko Podobny
Please read tesseract documentation https://github.com/tesseract-ocr/tessdoc - there is a simple and working example of a Tesseract implementation. Zdenko ut 7. 2. 2023 o 22:16 Massimo napísal(a): > HI, > > I would like to make an open source application using Tesseract with > xamarin form. >

Re: [tesseract-ocr] Passing bounding box coordinates of an detected object from image to extract text

2023-02-09 Thread Zdenko Podobny
Hello, Unfortunately you did not provide any examples (input image, code), so we can just guess how to help you... If you have a bounding box (BB) I assume you already have the image in memory, so I would copy/crop the image to the BB and request to OCR only that. I would assume you also would nee

Re: [tesseract-ocr] unable to decode below mentioned php generated images.

2023-02-25 Thread Zdenko Podobny
tesseract does not support the breaking of captcha. Zdenko so 25. 2. 2023 o 9:40 Ek Villain napísal(a): > Dear, > my using java languages to decode php generated images it decodes wrong > answers or wrong text. please help me to do. that thing. > > > Thanks. > > > > [image: captcha.jpg] > >

Re: [tesseract-ocr] How to get the correct text orientation with tesseract

2023-03-11 Thread Zdenko Podobny
t, Mar 11, 2023, 8:14 PM Zdenko Podobny wrote: > >> the latest code (5.3.0) (on windows) >> >> Zdenko >> >> >> so 11. 3. 2023 o 2:16 nguyen ngoc hai >> napísal(a): >> >>> Dear Zdenko, >>> >>> Thank you very much for y

Re: [tesseract-ocr] Facing trouble with Tesseract OCR (from v4 to v5) for python version upgrade (from Python 3.6 to Python 3.10)

2023-03-11 Thread Zdenko Podobny
First of all: it is a good manner to provide a test case (working code + input &output) Next: there were improvements (e.g. https://github.com/tesseract-ocr/tesseract/commit/3a5e5089343798932d9952628acfdf56f3108c43) in providing better -bounding boxes, so you will need to make a custom build with r

Re: [tesseract-ocr] tesseract returns random and spurious characters

2023-03-24 Thread Zdenko Podobny
Hello, unless you provide a test case for reproducing problem (+ information about tesseract, language data platform etc.), nobody could help you... Zdenko ut 21. 6. 2022 o 19:25 Z. Jay napísal(a): > We have been using a competing OCR tool and are now evaluating a switch to > tesseract. Howev

Re: [tesseract-ocr] Training a new language to perform ocr on tesseract ?

2023-03-24 Thread Zdenko Podobny
Did you follow instructions in https://github.com/tesseract-ocr/tesstrain#language-data ? Zdenko ut 14. 3. 2023 o 10:59 Kunal Athreya napísal(a): > I have prepared the ocrd-testset.zip > > for the language I'm trying to tr

Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Zdenko Podobny
Please have a look at https://github.com/tesseract-ocr/tesstrain (especially https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip) Zdenko pi 31. 3. 2023 o 7:03 Ali Abedian napísal(a): > Hey everyone! I'm currently working on a personal project where I'm > training a new font

Re: [tesseract-ocr] Tesseract accuracy.

2023-04-01 Thread Zdenko Podobny
As the first step, I would suggest you read https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md Next: LSTM model is training on words/lines of text so it could have a problem with "code". For images like these legacy mode is perfect. E.g.: tesseract WCAZ.png - --psm 6 --oem 0 W C

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-04-24 Thread Zdenko Podobny
Did you install all the necessary dependencies? Did you check & fixed all errors (before this error) in training output? Zdenko ut 25. 4. 2023 o 8:21 Madhav Pandey napísal(a): > Hi Everyone, > > I am relatively new to tesseract and OCR as whole. > > I have been trying to training do the setup

Re: [tesseract-ocr] help, tesseract not fun in windows 11!

2023-04-24 Thread Zdenko Podobny
Seems like you are not very familiar with the operating system you are using. Tesseract (executable) is a command line program (e.g. similar to "dir" or "copy") - so "it opens a window (like a black console) and closes automatically". '"tesseract" is not recognized as an internal or external comma

Re: [tesseract-ocr] Parameters to improve detection of sparse text

2023-04-25 Thread Zdenko Podobny
First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it... Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use c

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-04-26 Thread Zdenko Podobny
make training TESSDATA=./usr/local/share/tessdata unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt" Failed to read data from: data/foo/all-gt This indicates you already run training that failed... Clean your training and start it once again. Pay at

Re: [tesseract-ocr] How to use tesseract with low resolution data

2023-05-01 Thread Zdenko Podobny
try to post an example image. Make sure you tried suggested operation in tesseract documentation. Zdenko ne 30. 4. 2023 o 19:28 Artur Giżycki napísal(a): > I have file with small resolution of latin letters and tesseract often > replace letters. How to train tesseract with low resolution lette

Re: [tesseract-ocr] Tesseract completely fails to recognize consolas font from high resolution image

2023-05-01 Thread Zdenko Podobny
1. Try to use the tesseract executable if there are any problems when using API/tesseract wrappers 2. Did you try image processing (as suggested by tesseract documentation? 3. Did you try custom image segmentation? Your image seems like a table and the tesseract layout analyze has a

Re: [tesseract-ocr] OCR problem with condensed text

2023-05-07 Thread Zdenko Podobny
Hello, yes, you can train tesseract with your images. Have a look at https://github.com/tesseract-ocr/tesstrain and an example project https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip You can retrain ( finetune) existing (e.g. just to add new letters/symbols or font) by using

Re: [tesseract-ocr] Specify target file name patterns?

2023-05-08 Thread Zdenko Podobny
Hello, your request is not clear to me (e.g. tesseract does not ocr pdf). Maybe it would be good if you describe what are you doing (command), what is the output of your command and what is your desired output? Zdenko po 8. 5. 2023 o 17:18 Rob Aaldijk napísal(a): > I may have missed it in te

Re: [tesseract-ocr] Should tesseract work on handwritten text?

2023-05-08 Thread Zdenko Podobny
No it should not. As far as we know tesseract is trained on printed text. Zdenko ut 9. 5. 2023 o 6:57 Erez Arnon napísal(a): > On the attached image, it gives the following gibberish output > > tesseract img.jpg - -l eng > > Estimating resolution as 195 Error in boxClipToRectangle: box outside

Re: [tesseract-ocr] Need a step-by-step training guide with python code

2023-05-17 Thread Zdenko Podobny
Hello, I am not sure what do you meant with "with python code", but please follow instructions in https://github.com/tesseract-ocr/tesstrain Zdenko št 18. 5. 2023 o 6:46 Steve N napísal(a): > > Need a step-by-step training guide with python code > > -- > You received this message because you

Re: [tesseract-ocr] How to recognize font name and then ocrwith font-specific model?

2023-05-17 Thread Zdenko Podobny
Tesseract >=4.x does not provide font name information. Version 3.x had that feature, but it was not very reliable (many fonts are difficult to distinguish, especially after preprocessing for OCR). => OCR tools (also outside of tesseract) are IMO not suitable for this tasks. Zdenko št 18. 5. 202

Re: [tesseract-ocr] Web server pyhton error

2023-05-18 Thread Zdenko Podobny
Is your pytesseract installation correct[1]? [1] https://github.com/madmaze/pytesseract#installation Zdenko št 18. 5. 2023 o 8:57 Bhupinder Rana napísal(a): > Tell me i very tired two or three days i dont know .Because when we > install pip install pytesseract give error > virtualenv/AppOcr/

Re: [tesseract-ocr] traning new font

2023-05-18 Thread Zdenko Podobny
Which part of documentation[1] is not clear to you? [1] https://tesseract-ocr.github.io/tessdoc/ Zdenko št 18. 5. 2023 o 10:47 Moshe Haim Makies napísal(a): > how can i traning new font which does not exist in the language database > and in addition it is from right to left but I have the ttf

Re: [tesseract-ocr] Batch command run doesnt work in windows 10

2023-05-23 Thread Zdenko Podobny
Hello, How do you run the script? What does it mean "nothing happens"? Is there any output on the screen? Any error, message? Zdenko ut 23. 5. 2023 o 16:31 Varun Sareen napísal(a): > I am using the below batch file to run a ocr job on muliple jpg's. But > nothing happens after I run the job.

Re: [tesseract-ocr] Batch command run doesnt work in windows 10

2023-05-23 Thread Zdenko Podobny
I guess the problem is caused by space in the path. Try to rename folder "SN RC" e.g. to "SN_RC" Zdenko ut 23. 5. 2023 o 21:16 Varun Sareen napísal(a): > I have also explored the possibility of an alternate script as under: > > @echo off > setlocal enabledelayedexpansion > > set "SourcePath=C

Re: [tesseract-ocr] Not clear on the prepare training text

2023-05-24 Thread Zdenko Podobny
Please have a look at official training process - there is an example of training data. https://github.com/tesseract-ocr/tesstrain Zdenko st 24. 5. 2023 o 20:08 Daniel Azubuine napísal(a): > I'm trying to train an African language, and I'm not clear on the prepare > training text section. > >

Re: [tesseract-ocr] Want advice on how to proceed with Tesseract and reducing recognition errors

2023-05-24 Thread Zdenko Podobny
I tried your example image with tesseract executable: > tesseract FontExample.png - -c preserve_interword_spaces=1 #*%% DRIVER LICENSE STATUS: CLS C SUSPENDED *xx LIC LMT COND CLASS GRP TYP ISSUE DT EXPIR DT CDL DISQ PROB PRIV RESTRSTATUS I D 06-16-22 03-

Re: [tesseract-ocr] beginner question : legacy compoent not present but traineddata is here

2023-05-28 Thread Zdenko Podobny
Hi, do not post screenshot: copy&page console text to forum/email. You did not write how you installed tesseract, but I guess you installed fast traineddata (best and fast are without legacy data). If you want to use the legacy engine, you have to download traineddata (and copy to the appropriat

Re: [tesseract-ocr] beginner question : legacy compoent not present but traineddata is here

2023-05-29 Thread Zdenko Podobny
run: tesseract --help-extra read it and think about how to use some options for solving your problem ;-) Zdenko po 29. 5. 2023 o 15:53 echidne napísal(a): > Hi, > thank you very much it was the problem. If i replaced the fra.traineddata > by the one from the github account that fixed the bug.

Re: [tesseract-ocr] Help in Training Tesseract5 using purely windows OS and python

2023-06-05 Thread Zdenko Podobny
Hello, 1. If you are a newbie to tesseract 5, one of the worst things is to start training tesseract. 2. I am not sure what do you mean with "I have read the documentation available on github(readme.md) " - tesseract documentation is here: https://tesseract-ocr.github.io/tessdoc/

Re: [tesseract-ocr] Where to find documentation on config files and parameters?

2023-06-05 Thread Zdenko Podobny
Funny, but when I open your link ( github ) I see there: , STRING_MEMBER(tessedit_char_blacklist, "", "Blacklist of chars not to recognize", this->params()) BTW: did you try to ru

Re: [tesseract-ocr] Need help, how to recognize numbers from this image?

2023-06-05 Thread Zdenko Podobny
follow suggestion https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md Zdenko po 5. 6. 2023 o 18:42 ChrisL napísal(a): > I have tried with psm 1 to psm 13, does not work well. > > Thanks > > -- > You received this message because you are subscribed to the Google Groups > "tess

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-06-06 Thread Zdenko Podobny
Do not create files manually. If "make training" does not work it means: 1. you miss some dependency or input data are wrong 2. also you miss error message for 1. I strongly suggest you to start training from the beginning (including cloning tesstraing) and pay attention to all messages: g

Re: [tesseract-ocr] unicharset is not returning anything

2023-06-11 Thread Zdenko Podobny
Hello, 1. Version 4.x is old, outdated, and unsupported. Use the current tesseract version (5.3.1) 2. Which official training procedure do you follow? 3. Do you intentionally try to train the legacy engine (I assume based on your box file)? BTW: Legacy training was broken and it is

Re: [tesseract-ocr] Segmentation fault with `tesseract -v`

2023-06-16 Thread Zdenko Podobny
How did you build tesseract? What platform did you use? What compiler? etc. please communicate details otherwise you are alone with your problems... Does it crash when you run it from the command line? Zdenko pi 16. 6. 2023 o 6:20 Abhishek Chaudhary napísal(a): > Hi, I'm building tesseract fr

Re: [tesseract-ocr] What is the lstm.train file used for?

2023-06-16 Thread Zdenko Podobny
it is used for training. Zdenko st 14. 6. 2023 o 11:31 Duy Khanh napísal(a): > In the "Makefile" file of tesstrain, there are parts where the following > command is executed: > ``` > tesseract --psm 13 lstm.train > ``` > > Why does it run tesseract with the lstm.train file? If I am running thi

Re: [tesseract-ocr] Building for iOS arm-64 produces x86_64 library

2023-06-18 Thread Zdenko Podobny
Hello, I am not Mac user, but the following output indicates that autotools are not able to use g++ for arm-64: checking for arm-apple-darwin64-g++... no checking for arm-apple-darwin64-clang++... no Also, you try to force linking LIBS="-lz -lpng -ljpeg -ltiff", but configure claims tiffio.h (ti

Re: [tesseract-ocr] Building for iOS arm-64 produces x86_64 library

2023-06-20 Thread Zdenko Podobny
Please do not post only the last error - usually, there is a problem before and e,g, configure output could indicate a lot of... Make sure you check the issue tracker where are already some hints on what to check e.g. https://github.com/tesseract-ocr/tesseract/issues/3980 https://github.com/tesser

Re: [tesseract-ocr] Unable to generate Hindi line images using text2image

2023-06-20 Thread Zdenko Podobny
Please follow the official training procedure [1], read the official docs[2], or complain to the author of the tutorial you decide to follow. [1] https://github.com/tesseract-ocr/tesstrain [2] https://tesseract-ocr.github.io/tessdoc/ Zdenko ut 20. 6. 2023 o 10:39 abhilash rao napísal(a): > Hi

Re: [tesseract-ocr] Runic OCR with tesseract

2023-06-20 Thread Zdenko Podobny
https://github.com/tesseract-ocr/langdata and https://github.com/tesseract-ocr/langdata_lstm provide input data that could be useful for tesseract training. I am not aware of Runic traineddata released by Google or contributors => you will need to create it by yourself. Zdenko ut 20. 6. 2023 o

Re: [tesseract-ocr] Original training data for eng.traineddata

2023-06-20 Thread Zdenko Podobny
With opensourced data you will not be able to create (from scratch) the same quality traineddata as Google provided. However there are some projects that fine tuned Google model successfully e.g. (UB-Mannheim/: https://madoc.bib.uni-mannheim.de/53748/ ) Zdenko st 21. 6. 2023 o 4:38 Duy Khanh n

Re: [tesseract-ocr] Regarding facing issue in tesseract download

2023-06-22 Thread Zdenko Podobny
Hello, If you are serious about getting help, provide details (we have no clue what your system is, we do not know how you try to extract an image from pdf etc...). Make sure you read tesseract documentation first. Zdenko št 22. 6. 2023 o 14:48 Aniket Kumar napísal(a): > In my system When I

Re: [tesseract-ocr] START_MODEL gives Segmentation Failure Error

2023-06-24 Thread Zdenko Podobny
Hello, If you are really looking for help, you need to provide full details (e.g. whole log of training, how did you installed tesseract, which version of tesseract, how did you install model (specieally hin model) example of training data that help to replicate "Segmentation Failure" etc. Zdenk

Re: [tesseract-ocr] Failed to load list of training filenames from data/Chin/list.train

2023-06-24 Thread Zdenko Podobny
Please provide full log of training, including how did you installed tessereact, training tool etc. Zdenko št 22. 6. 2023 o 11:15 abhilash rao napísal(a): > Hi guys so i am trying to train tesseract using wsl and when i execute the > following training command TESSDATA_PREFIX=.../tessdata mak

Re: [tesseract-ocr] Any ways to further improve OCR results

2023-06-27 Thread Zdenko Podobny
without an example image nobody can help you. Zdenko ut 27. 6. 2023 o 12:01 Lee Kar Yee napísal(a): > Hi all, > > I am new to Tesseract OCR. I am trying to achieve extracting alphabets and > numbers from images. > These images are being converted from a mp4 video into frames as JPG. > > While

Re: [tesseract-ocr] OCRmyPDF and Tesseract not making PDFs searchable

2023-07-03 Thread Zdenko Podobny
1. Provide also example files (input, output) 2. Tesseract does not accept pdf (it needs an image as input), so at least 3. seems to be a problem of OCRmyPDF. Provide also the output of "tesseract --version" command Zdenko po 3. 7. 2023 o 21:24 Filippos Koliopanos napísal(a): > > Hello, > > I

Re: [tesseract-ocr] 3 signs read -> result is an extra symbol?

2023-07-08 Thread Zdenko Podobny
It is not a bug. Use a better text editor (that supports utf-8). Zdenko pi 7. 7. 2023 o 8:15 z20leh napísal(a): > Hallo, > i use ubuntu-22.04.2-desktop-amd64 > tesseract 4.1.1 leptonica-1.82.0 > > i have small picutres with only 3 numbers. > For example: > 1,02 > in the outputfile of tesser

Re: [tesseract-ocr] Unable to get conversion from colorful odia pdf

2023-07-08 Thread Zdenko Podobny
If you are interested in helping, please provide a description/images of what are you doing/using. Zdenko st 5. 7. 2023 o 7:17 Sailesh Agrawal napísal(a): > Hi, this is Sailesh > I have been trying to use tesseract for extracting oriya test from pdf, it > is working fine will black and white p

Re: [tesseract-ocr] START_MODEL gives Segmentation Failure Error

2023-07-08 Thread Zdenko Podobny
Let's start with the basics: The current leptonica version is 1.83.1 https://github.com/DanBloomberg/leptonica/releases The current tesseract version is 5.3.1 https://github.com/tesseract-ocr/tesseract/releases Use the latest version if there is a problem. Nobody wants to waste time with (probabl

Re: [tesseract-ocr] libtesseract skip OCR, just create invisible text layer

2023-07-08 Thread Zdenko Podobny
No, it is not possible (tesseract uses an image used for OCR for pdf creation, OCR output for the position of text...) Zdenko st 5. 7. 2023 o 7:12 lbr napísal(a): > I'm trying to create a searchable pdf out of a scanned one. I want to use > Textract as an OCR engine instead of Tesseract. Is th

Re: [tesseract-ocr] Help Using tesstrain for machine generated display

2023-07-08 Thread Zdenko Podobny
Have at https://github.com/tesseract-ocr/tesseract/issues/2342 and search for "tesseract OCR dot matrix", there are several suggestions on how to improve OCR results e.g. https://jeffreymorgan.io/articles/improve-dot-matrix-ocr-performance-tutorial/ PS: it does not make sense to post custom traine

Re: [tesseract-ocr] Any ways to further improve OCR results

2023-07-08 Thread Zdenko Podobny
I am not sure what you mean by "I have tried setting the Region of Interest (ROI) ", but when I cut region and pre-processed it as described in the documentation I got the correct results: tesseract frame_1-ROI1_preprocessed.png - --psm 7 GOH SCE YUAN tesseract frame_1-ROI2_preprocessed.png - --p

Re: [tesseract-ocr] OCR inconistencies

2023-07-13 Thread Zdenko Podobny
Hello, I am not sure what you do you meant with "Redact 5.3.1", but please provide test case to reproduce problem. For me tesseract works: tesseract incon.png - --- Hidden text -- Zdenko st 12. 7. 2023 o 16:37 Jamiel Impoy napísal(a): > Hello, > > For Redact 5.3.1, there is a strange edge c

Re: [tesseract-ocr] tesseract runs but gives no output

2023-07-14 Thread Zdenko Podobny
tesseract d:\temp\temp\Screenshot_20230601_102638.jpg -l eng+hin 1>>c:\temp\temp2.txt is not the correct command. Did you mean: tesseract d:\temp\temp\Screenshot_20230601_102638.jpg output -l eng+hin 1>>c:\temp\temp2.txt please consult tesseract --help Zdenko pi 14. 7. 2023 o 14:19 Ales Ro

Re: [tesseract-ocr] missing tesseract_opencl_profile_devices.dat (or how to disable OpenCL)

2023-07-16 Thread Zdenko Podobny
There is no possibility to disable OpenCL at run time. OpenCL is disabled by default and marked as experimental, not suggested by the forum/issue tracker, etc. It is there (as compile option) only as a startup point for possible developers. Zdenko ne 16. 7. 2023 o 17:21 Markus Leuthold napísal(

Re: [tesseract-ocr] missing tesseract_opencl_profile_devices.dat (or how to disable OpenCL)

2023-07-16 Thread Zdenko Podobny
It is incompetent and irresponsible to use an experimental code in production/distribution. Zdenko ne 16. 7. 2023 o 21:13 Markus Leuthold napísal(a): > It looks like OpenSuse TW builds the package with "--enable-opencl" > > https://build.opensuse.org/package/view_file/openSUSE:Factory/tessera

Re: [tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread Zdenko Podobny
It is not a tesseract problem but the VB. Prove for this you can find in pytesseract that call tesseract executable without console windows. Zdenko ne 23. 7. 2023 o 15:55 nor s napísal(a): > Is there a way to have Tesseract run without producing a Dos window? I'm > incorporating a call to Tess

Re: [tesseract-ocr] SetRectangle change?

2023-08-01 Thread Zdenko Podobny
Yes, there is a problem with SetRectangle or there is a mismatch between other API functions (e.g. GetThresholdedImage). It could be demonstrated with the attached simple code. According to API [1] SetRectangle(left, *top*, width, height) e.g. SetRectangle(left, top, width, height *.3) should OCR

Re: [tesseract-ocr] only english language is recoganising

2023-08-17 Thread Zdenko Podobny
We are sorry, but we have no clue what are you doing. Please provide the details for replicating your problem. Zdenko so 12. 8. 2023 o 20:25 V S KARTHIK napísal(a): > Hi, > malaylam or any other language is not extracting from image why?anybody > knows? > > -- > You received this message beca

Re: [tesseract-ocr] Question reg. Telugu ; char missing in ocr ; how to fix ?

2023-08-17 Thread Zdenko Podobny
Please provide details of what are you doing including details of Tesseract version, OS, and which tessdata you used...) Make sure you read tesseract documentation and please provide also details on which suggested solution you used and which char is missing (as not everybody is familiar with Telu

Re: [tesseract-ocr] Suggestions for Windows 10 x64 build issue

2023-08-20 Thread Zdenko Podobny
Maybe you should provide a simple test case for replicating the problem including information on how did you build tesseract&leptonica). E.g. for SetRectangle_test.cpp (from https://groups.google.com/g/tesseract-ocr/c/PMHq6YSpRRE/m/Z2DCrgQlAAAJ) links without problem for me: cl /EHsc SetRectangle

Re: [tesseract-ocr] Whitelist is not accepting special characters

2023-08-27 Thread Zdenko Podobny
IMO there is not need to use psm and whitelist: tesseract text.png - -l fast/script/Latin Estimating resolution as 274 Ñato ñelo ñaña álca moño Ñoko niño niña chillňa élif For Windows I guess there could be a problem with UTF-8 in the terminal... Zdenko ne 27. 8. 2023 o 6:25 Shadya S. napísa

Re: [tesseract-ocr] Preprocess screenshot image before tesseract.

2023-08-29 Thread Zdenko Podobny
Please do not send it to the mailing list compressed images (rar, zip). Post them somewhere or use appropriate image format to decrease their size (renaming bmp file to png does not work) Zdenko ut 29. 8. 2023 o 9:01 Ajay Pandya napísal(a): > Hello Everyone, > > Can anyone help me with the bet

Re: [tesseract-ocr] Normalization failed for string

2023-09-14 Thread Zdenko Podobny
unicharset is created automatically (by official training procedure https://github.com/tesseract-ocr/tesstrain) Zdenko št 14. 9. 2023 o 13:56 Ali hussain napísal(a): > I have faced in my own trianed_text this normalization error. I think the > main problem is * ্য*in these words. and i di

Re: [tesseract-ocr] Strange behaviour of Tesseract

2023-09-14 Thread Zdenko Podobny
https://github.com/tesseract-ocr/tesseract/issues/845 Zdenko št 14. 9. 2023 o 16:49 Gilad Pellaeon napísal(a): > Hi, > > I am new to Tesseract. I searched for an OCR library, found Tesseract and > now I want to use it for a specific measure protocol. > > I built Tesseract 5.3.2 from source and

Re: [tesseract-ocr] Strange behaviour of Tesseract

2023-09-14 Thread Zdenko Podobny
> > Is it still broken in version 5? The thread you posted is from 2017! [image: image.png] Zdenko št 14. 9. 2023 o 17:10 Gilad Pellaeon napísal(a): > Is it still broken in version 5? The thread you posted is from 2017! > > One thing I noticed in the meantime: I stored my PNGs with paint.net

Re: [tesseract-ocr] Tesseract Custom Model Not Recognized after Training

2023-09-18 Thread Zdenko Podobny
Unfortunately you hid all important information (e.g. how did you run training? how did you run tesseract (including tesseract options, exact command or code,...)? , so just some hints: > Error: LSTM requested, but not present!! This implies that the requested traineddata file does not contain ne

Re: [tesseract-ocr] how to manual install tesseract-ocr all code include third library code build without cmake in windows

2023-09-21 Thread Zdenko Podobny
Why do you what to compile tesseract? Zdenko št 21. 9. 2023 o 15:26 Phoenix Tree napísal(a): > i am noob. > > some limit in my windows machine , > I can't have network, I must manual download tesseract-ocr all code > include third library code > can't use cmake > but can write python script >

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Zdenko Podobny
I know there are (were) people at the forum that implemented Tesseract as part of invoice processing - but as a commercial solution. It is not as easy as it looks: there is a need for a custom solution for text detection (e.g. skipping logos and other graphics, possible handwriting). As far as I r

Re: [tesseract-ocr] Multiple colours text in an image

2023-10-07 Thread Zdenko Podobny
Hello, this is about image preprocessing/thresholding rather than tesseract... Please post an example image so tesseract users can test it and suggest a possible solution. Zdenko št 21. 9. 2023 o 13:04 Iago Giné napísal(a): > Hi all, > > Is there some option to tell tesseract-ocr that there i

Re: [tesseract-ocr] "Leptonica was build without TIFF support! Disabling TIFF support..."

2023-10-08 Thread Zdenko Podobny
Please provide full logs including installation, configure parameters etc. - not screenshots. Make should you have only one installation of leptonica library May your own test if leptonica is built with tiff. Use release target and not debug. Zdenko ne 8. 10. 2023 o 21:56 DJuego Director De Jueg

Re: [tesseract-ocr] Deserialize Header Failed

2023-10-14 Thread Zdenko Podobny
Hello, tesseract works out of the box. What does not work are you users, downloading Tesseract at night and jumping to Tesseract training. Training requires knowledge and experience that you will not get by following some random internet tutorials (most of them are outdated, pretending to be succ

<    5   6   7   8   9   10   11   12   13   14   >