Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-14 Thread Zdenko Podobny
> > custom_config = r' -l ' + 'eng' + '--psm 6 What is the point of this? To slow down the script? Zdenko ne 14. 7. 2024 o 15:56 René JM Clais napísal(a): > import cv2 > import pytesseract as tesser > > > originalImage = cv2.imread("myfile.jpg") #myfile.jpg ===> original image > > (th

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-14 Thread Zdenko Podobny
So you do not understand the code you posted? Zdenko ne 14. 7. 2024 o 19:44 René JM Clais napísal(a): > I don't understand what do you mean ? > > Le dim. 14 juil. 2024 à 16:13, Zdenko Podobny a écrit : > >> custom_config = r' -l ' + 'eng' + 

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-15 Thread Zdenko Podobny
t; Le dim. 14 juil. 2024 à 19:47, Zdenko Podobny a écrit : > >> So you do not understand the code you posted? >> >> Zdenko >> >> >> ne 14. 7. 2024 o 19:44 René JM Clais napísal(a): >> >>> I don't understand what do you m

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-15 Thread Zdenko Podobny
nderstand the code, I have trouble > with the last module what do you think? > Since that I didn’t study and I am getting farther and further away. > I appreciate your tips. > > > On Mon, 15 Jul 2024 at 10:03 Zdenko Podobny wrote: > >> My remark is about code quality. Cod

Re: [tesseract-ocr] A few characters being misrecognized

2024-07-26 Thread Zdenko Podobny
tesseract img1.png - --psm 6 -l fra Juccsus tesseract img2.png - --psm 6 -l fra Bladë Zdenko pi 19. 7. 2024 o 5:12 Péter Györök napísal(a): > I'm using this command: > tesseract file.png - --psm 6 -l script/Latin > > img1.png returns "JUCcCcsus" instead of "Juccsus". > img2.png returns "Bla

Re: [tesseract-ocr] Re: How to prevern Tesseract from interpreting noise as characters

2024-08-04 Thread Zdenko Podobny
tesseract unnamed.jpg - Estimating resolution as 182 e.g. no recognized word... So the problem could be in the parameters you used for OCR... Before OCR I suggest image preprocessing and maybe the detection of empty pages. Have a look at leptonica example for Normalize for uneven illumination (p

Re: [tesseract-ocr] Converting colored background and colored characters to text with the Tesseract library

2024-08-04 Thread Zdenko Podobny
Captcha was created to fool OCR. Zdenko po 5. 8. 2024 o 7:27 Emre Batu napísal(a): > [image: 20240804211345.png] Hello everyone. I am using the Tesseract > library in a C# application to analyze images. However, the image I want to > convert to text contains colored characters and a colored

Re: [tesseract-ocr] Issue with Tesseract OCR: Difficulty Detecting White Text on Blue Background

2024-08-22 Thread Zdenko Podobny
Tesseract is the OCR engine and it is not a text detection tool. If you pass just blue button to tesseract, it has no problem to extract text: tesseract blue_button.png - Sign in Zdenko št 22. 8. 2024 o 9:11 Abdul Kalam Shaik napísal(a): > Thanks Ger for your response. So, my use case is lik

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Zdenko Podobny
have a look at provided example ocrd-testset.zip Zdenko ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > @zdenop wrote: > | Tesseract LSTM engine (tesseract >=v4) training scr

Re: [tesseract-ocr] Tesseract 5 with dnf

2024-09-05 Thread Zdenko Podobny
No. We do not distribute binary packages. Volunteers create and maintain them. Zdenko št 5. 9. 2024 o 20:56 Chris Crutts (agentc313) napísal(a): > on my Oracle Linux 8.10 distribution, doing > > $ sudo dnf install tesseract > > installs tesseract version 4.1.1-2.el8 and leptonica version 1.76.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Zdenko Podobny
What about reading tesstrain Readme and using the example data to understand the training process better? Zdenko št 5. 9. 2024 o 17:41 'Danny' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Hi Zdenko, > Thanks for the response. However, ocrd-testset.zip contains training > i

Re: [tesseract-ocr] Remove the thin horizontal line

2024-09-06 Thread Zdenko Podobny
have a look at http://www.leptonica.org/line-removal.html The source code is here: https://github.com/DanBloomberg/leptonica/blob/master/prog/lineremoval_reg.c Zdenko pi 6. 9. 2024 o 11:08 Sundar Andaperumal napísal(a): > Hi, > > I am trying to remove the thin horizontal line; when doing so t

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-07 Thread Zdenko Podobny
tesstrain is a tested method to train/improve tesseract language mode. It creates box files for you. You can try your ways, but your problems are your problems and you should not to expect somebody will adjust the code to your needs. Of course, you are welcome to contribute your solution. Zdenko

Re: [tesseract-ocr] tesseract-ocr pdf input to searchable pdf (ocr-ed) and djvu input to searchable pdf

2019-10-21 Thread Zdenko Podobny
Yes, tesseract can create searchable pdf (I am not sure how you define if process is reliable...). Tesseract input must be image (or list of images in text file) so you can not directly convert pdf pr djvu files to searchable pdf. But there are tools like OCRmyPDF[1] that can help you with convert

Re: [tesseract-ocr] Accuracy with non-standard words consisting of random combinations/mix of digits + letters/characters

2019-10-22 Thread Zdenko Podobny
I am afraid that such small faction of text (where are just letter commonly misinterpreted like S or 5 or ? can not recognized with 100% accuracy. Try to use in some context (line). Zdenko po 21. 10. 2019 o 20:22 Ast napísal(a): > I've spent a good amount of time looking how to resolve this is

Re: [tesseract-ocr] Tesseract Tool -Reg

2019-10-22 Thread Zdenko Podobny
If you think that Microsoft package vc_redist.x86.exe contains malware, write to Microsoft. Or change your tool for malware detection. Zdenko ut 22. 10. 2019 o 12:30 MATHANKUMAR m napísal(a): > Hi , > > This is Mathanku

Re: [tesseract-ocr] Tesseract ocr failed to recognize number from number plate images

2019-10-22 Thread Zdenko Podobny
Unless you provide clear images (black letters on white background) (maybe with straight text, but this could be handle by leptonica) you can not expect that tesseract will provide you correct results. Zdenko ut 22. 10. 2019 o 10:00 Sangharsh Kamble napísal(a): > [image: 2.jpeg] > > [image: 4

Re: [tesseract-ocr] Accuracy with non-standard words consisting of random combinations/mix of digits + letters/characters

2019-10-23 Thread Zdenko Podobny
When I run: tesseract code_10_dejavu_sans_mono.png - I got result *6X279SWKF *- e.g. no preprocessing is needed. Also someone in past posted analyze to forum, which showed (AFAIR) than increasing size of letters over 30pt is causing problem for tesseact 4. Zdenko st 23. 10. 2019 o 3:11 Ast napí

Re: [tesseract-ocr] Generating a PDF with Tesseract C++API (4.1Version)

2019-10-25 Thread Zdenko Podobny
Try something like this: #include #include #include #include int main() { const char* datapath = "tessdata"; std::string language_ = "deu"; std::string inputFile_ = "input.png"; const char* outputbase = "output"; tesseract::TessBaseAPI *api100 = new tesseract::TessBaseAP

Re: [tesseract-ocr] Re: Generating a PDF with Tesseract C++API (4.1Version)

2019-10-26 Thread Zdenko Podobny
Why do you think there is problem in tesseract? output.pdf is open without problem in acrobat reader, chrome/chromium, sumatrapdf. output.pdf pass without error on https://www.pdf-online.com/osa/validate.aspx, https://www.datalogics.com/products/pdftools/pdf-checker/ and https://www.pdfen.com/pdf-

Re: [tesseract-ocr] Re: Generating a PDF with Tesseract C++API (4.1Version)

2019-10-26 Thread Zdenko Podobny
Build it yourself - read tesseract wiki about possibilities. Zdenko so 26. 10. 2019 o 19:02 Ivica Anic napísal(a): > Zdenko: > can you please to say me exactly URL where I can to download > tesseract.libs , leptonica.libs (and .dll's) and tessdata > > Am Freitag, 25. Oktober 2019 16:35:14 UTC+

Re: [tesseract-ocr] tesseract and grids

2019-10-28 Thread Zdenko Podobny
It is know issue. Zdenko po 28. 10. 2019 o 18:46 Herb Vogel napísal(a): > I'm having issues with tesseract identifying text inside of grids in the > pdf. when i convert to string... the text that was inside of the grids is > missing. > > -- > You received this message because you are subscrib

Re: [tesseract-ocr] Is it possible to pass a numpy array to Tesseract, instead of saving it to the disk.

2019-10-30 Thread Zdenko Podobny
It is not possible with PyTesseract as it use tesseract executable with input from disk. If you would use tesseract API directly you need to convert numpy to PIX (leptonica) structure [1] . [1] https://stackoverflow.com/questions/55195932/typeerror-initializer-for-ctype-unsigned-int-must-be-a-cd

Re: [tesseract-ocr] Any plans for CUDA/GPU support?

2019-10-30 Thread Zdenko Podobny
AFAIK there are no official plans for this (there were some companies that were thinking providing this feature, but actually there is no result from this). Zdenko ut 29. 10. 2019 o 13:25 Alex Giokas napísal(a): > Hello, > > First of all let me start by saying that I greatly appreciate the amo

Re: [tesseract-ocr] Re: Is it possible to pass a numpy array to Tesseract, instead of saving it to the disk.

2019-10-30 Thread Zdenko Podobny
Tesseract executable can read image data (not numpy!) from stdin and past them to stdout so at least IO operation can be avoided. Not sure if first part (reading from stdin) can be implemented in pytesseract, but for second part should be no problem. If somebody is looking for performance seriousl

Re: [tesseract-ocr] alternate fully-searchable pdf?

2019-10-31 Thread Zdenko Podobny
You can do whatever you want if you have knowledge: e.g. you are able to to add annotation to pdf you can (tesseract will provide you txt output). IMO right way is to fix broken tool that pretend to able search pdf... Zdenko št 31. 10. 2019 o 6:55 John Lussmyer napísal(a): > I'm trying to fig

Re: [tesseract-ocr] Help

2019-11-01 Thread Zdenko Podobny
Please provide example image you try to OCR. Zdenko št 31. 10. 2019 o 22:34 Ishak DÖLEK napísal(a): > I want to train the manuscripts using tesseract v4. > Do I need to convert the image of manuscript into binary pictures to > train? What do you suggest to convert binary pictures if you need

Re: [tesseract-ocr] Using tesseract on python

2019-11-01 Thread Zdenko Podobny
Yes, there is - tesseract executable. Please read first available docs (wiki) before asking any other questions. Zdenko pi 1. 11. 2019 o 10:29 Purushotham Rao Eravalli napísal(a): > Hi I tried using pytessract for testing tesseract but it is giving null > output for few images. Is there a diff

Re: [tesseract-ocr] Re: Not getting complete text from JPG

2019-11-03 Thread Zdenko Podobny
Tesseract (in google) was develop and extended in google book (OCR) project and it work best on straight clean line of text. So it tend to skip graphics and noise. Your input is full graphics. Therefore in this case you need to do your own page layout analyze and send only text regions (without bor

Re: [tesseract-ocr] Re: Help to refine captcha to use tesseract for automation

2019-11-10 Thread Zdenko Podobny
Tesseract is not designed to break captcha images. If there is captcha - there is a reason for it. If you try automatize some legitimate tasks contact owner/author of web for support. Zdenko po 11. 11. 2019 o 8:14 amo napísal(a): > I have the same question. > > -- > You received this messag

Re: [tesseract-ocr] Re: tesseract in Windows

2019-11-15 Thread Zdenko Podobny
If this discussion should make any sense you have to provide details and prove you statement (far better result on ubuntu). E.g. image you tested, exact command you used, outputs you got, tesseract version information, language model you used (+ check filesize)... Zdenko pi 15. 11. 2019 o 6:03 M

Re: [tesseract-ocr] Re: tesseract in Windows

2019-11-15 Thread Zdenko Podobny
ke the 't' representation shows as '!' at the > same time windows gives some drawbacks.please check with the JSON files $ > image. > > > On Fri, 15 Nov 2019 at 14:27, Zdenko Podobny wrote: > >> If this discussion should make any sense you have to provide details an

Re: [tesseract-ocr] Re: tesseract in Windows

2019-11-15 Thread Zdenko Podobny
ion of an text which is > identified and recognized then returned using python and stored in json > format for future use only. > > On Fri, 15 Nov 2019 at 17:07, Zdenko Podobny wrote: > >> Tesseract do not output json! >> Please do not report problem in tesse

Re: [tesseract-ocr] This image is very hard to recognize it

2019-11-22 Thread Zdenko Podobny
search forum for "captcha". Zdenko pi 22. 11. 2019 o 10:08 Toney Qi napísal(a): > Dear all, > > here is an issue which trouble me many days. > it looks like a simple image verification code, but it doesn't recognize > for me. > > so, could you help on this? anyone. > > Thank you very much. > >

Re: [tesseract-ocr] Using cmake to build standalone executable (statically-linked) ?

2019-11-27 Thread Zdenko Podobny
AFAIK vckg produces static build of tesseract. https://github.com/microsoft/vcpkg/blob/master/ports/tesseract/portfile.cmake Zdenko st 27. 11. 2019 o 6:00 Teddy Zzang napísal(a): > I'm currently using cmake following to build tesseract > > cmake tesseract > cmake --build . --config Release

Re: [tesseract-ocr] Can't run / process images on Mac

2019-12-06 Thread Zdenko Podobny
IMO error message is clear enough: Error, cannot read input file whatever.fileformat: *No such file or directory* Zdenko ut 3. 12. 2019 o 7:43 Thomas Moine napísal(a): > > I can't code nor program so please be very precise and step-by-step in > your answers. Thanks. > > I need an accurate OCR

Re: [tesseract-ocr] I cannot use traineddata downloaded from Data Files

2019-12-08 Thread Zdenko Podobny
How did you downloaded files from repository? Please check files in /usr/share/tesseract-ocr/4.00/tessdata/ if there have the same size as in repository. Zdenko so 7. 12. 2019 o 17:34 坂本聖 napísal(a): > Hi, > I want to use tesseract for Chinese words. So, first I tried to execute > the command

Re: [tesseract-ocr] tess-two with tessdata_fast crashes

2019-12-08 Thread Zdenko Podobny
If you want to use API you need to spend some time with docs and source code. You could fine out quite quickly that CUBE was removed from tesseract and is not available in version 4. Zdenko ne 8. 12. 2019 o 2:37 NY C napísal(a): > Hi, I am using tess-two for OCR. > > > (Alex Chon version : h

Re: [tesseract-ocr] I cannot use traineddata downloaded from Data Files

2019-12-08 Thread Zdenko Podobny
what is output of: tesseract --version Zdenko ne 8. 12. 2019 o 15:55 坂本聖 napísal(a): > Thanks for your advice. > I downdloaded files by clicking the "download" button in > https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata. > And I moved the chi_sim.traineddata file > t

Re: [tesseract-ocr] Can Qt-box-editor be used with tesseract 4.0?

2019-12-26 Thread Zdenko Podobny
no. it was created for helping 3.x training. Zdenko št 26. 12. 2019 o 11:10 Ashwini Nande napísal(a): > Hi all, Can Qt-box-editor be used with tesseract 4.0? I want to create > traindedata from any image so want to generate/edit lstm box files. > > -- > You received this message because you ar

[tesseract-ocr] tesseract-ocr 4.1.1 release

2019-12-26 Thread Zdenko Podobny
Hello all, Stable version of tesseract-ocr engine 4.1.1 was released today [1] . This is bugfix release for 4.x branch. With this release cppan build system was marked as obsolete. It successor software-network (aka sw) was implemented instead. Autotools and cmake are supported as main build syst

Re: [tesseract-ocr] Tesseract command line invocation in a Windows and Linux C++ appliction

2019-12-27 Thread Zdenko Podobny
If your project is C++ why you need to invoice tesseract via command line? Why you would not use tesseract API? Your description of process is inefficient. Zdenko pi 27. 12. 2019 o 12:35 Pooja Pandey napísal(a): > Hi, > > I need to develop application in C++ which will run on Windows and Linux

Re: [tesseract-ocr] Trying to extract the text from image and tesseract is no returning text correctly.

2020-01-01 Thread Zdenko Podobny
Search internet (at the least this forum) for tesseract and captcha Zdenko st 1. 1. 2020 o 9:16 Durai K napísal(a): > Hi, > > I have following tesseract version installed on Windows 10 > > tesseract v5.0.0-alpha.20191030 > leptonica-1.78.0 > libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3)

[tesseract-ocr] Fwd: [DanBloomberg/leptonica] Release 1.79.0 - Leptonica version 1.79.0

2020-01-01 Thread Zdenko Podobny
FYI, Zdenko -- Forwarded message - Od: Dan Bloomberg Date: št 2. 1. 2020 o 1:19 Subject: [DanBloomberg/leptonica] Release 1.79.0 - Leptonica version 1.79.0 To: DanBloomberg/leptonica Cc: Subscribed Leptonica version 1.79.0

Re: [tesseract-ocr] GetComponentImages with tesseract::RIL_PARA returns the same results as tesseract::RIL_BLOCK

2020-01-03 Thread Zdenko Podobny
seems like you forget to attach you code, image, tesseract version details Zdenko pi 3. 1. 2020 o 17:13 Nils André napísal(a): > I'm trying to extract paragraphs from an image so I tried > GetComponentImages using tesseract::RIL_PARA but I just get the whole image. > > -- > You received th

Re: [tesseract-ocr] Fresh install not recognizing text like before

2020-01-04 Thread Zdenko Podobny
Your tesseract version seems to be strange (should be tesseract 5.0.0-alpha-551-g99df, but instead of git revision you have date) How did you get it? Please also provide input image for testing, command/code how you extract text from image and maybe more relevant information (e.g. with reinstalli

Re: [tesseract-ocr] PSM error

2020-01-08 Thread Zdenko Podobny
Do you use model with legacy engine support? Zdenko št 9. 1. 2020 o 8:43 MATHANKUMAR m napísal(a): > Hi, > > In tesserct-ocr config method i did used OEM & PSM for extract the > text from image. But during process i got an exception like below mentioned > > Error : "Error, unknown command

Re: [tesseract-ocr] PSM error

2020-01-09 Thread Zdenko Podobny
Do you understand what it mean when you use --oem 0? Zdenko št 9. 1. 2020 o 9:17 MATHANKUMAR m napísal(a): > Actually,I do not know how to set up this legacy type. but in OEM 0 &1 got > this error > > *Error: Tesseract (legacy) engine requested, but components are not > present in /usr/share/t

Re: [tesseract-ocr] PSM error

2020-01-09 Thread Zdenko Podobny
So lets summarize it: You asked tesseract to use legacy engine with some language model. tesseract failed. Conclusion => you did not provided provided language model with legacy model. Zdenko št 9. 1. 2020 o 10:25 MATHANKUMAR m napísal(a): > yeah i can see it from help command & understand the

Re: [tesseract-ocr] Getting Tesseract Output as ANSI Encoding

2020-01-09 Thread Zdenko Podobny
Provide also detail for reproducing problem: input image and output file, tesseract version, how you get tesseract installed, which model you use for ocr... Zdenko št 9. 1. 2020 o 13:15 Manankumar Bhatt napísal(a): > Locale is English(United States) and OS is Windows 7. > > On Thursday, 9 Janu

Re: [tesseract-ocr] Inconsistent outputs between TEXT and hOCR formats

2020-01-11 Thread Zdenko Podobny
You use old tesseract version Dňa so 11. 1. 2020, 22:43 Matthew Getzin napísal(a): > Hello, > > I created an issue (see below) on Github. Not sure if it is a bug or > something for discussion forum... > > ### Environment > > * **Tesseract Version**: tesseract 4.0.0-beta.1 > leptonica-1.75.3

Re: [tesseract-ocr] Compiling tesseract 4 in Debian

2020-01-30 Thread Zdenko Podobny
I looks like there is installed another version of tesseract. Uninstall old version and reinstall compiled. Zdenko št 30. 1. 2020 o 5:38 lundissimo napísal(a): > I've downloaded the tesseract 4.0 Git repository on a system running > Debian linux. I've run autogen.sh and configure, then make a

Re: [tesseract-ocr] Compiling tesseract 4 in Debian

2020-01-31 Thread Zdenko Podobny
If you installed tesseract from git, it can not report version 4.0.1 - there was never such version . Zdenko št 30. 1. 2020 o 22:45 lundissimo napísal(a): > Please explain what leads you to believe there's another version of > tesseract ins

Re: [tesseract-ocr] Re: Compiling tesseract 4 in Debian

2020-01-31 Thread Zdenko Podobny
Tesseract need only leptonica. For getting tesseract please use https://github.com/tesseract-ocr/tesseract/releases Reported image libraries are output of leptonica function, so I suggest to have a look at leptonica build. If you need help please provide exact steps how you get leptonica, which co

Re: [tesseract-ocr] Re: Compiling tesseract 4 in Debian

2020-02-01 Thread Zdenko Podobny
If you are building any software from source you should be familiar with you system and have knowledge about building process. What you wrote indicate your system have installed only runtime libraries and you are missing devel packages. Zdenko so 1. 2. 2020 o 1:18 lundissimo napísal(a): > Than

Re: [tesseract-ocr] approches used for language detection on images ...

2020-02-01 Thread Zdenko Podobny
You did not provide any example image, neither what kind of tools you would like to use (open source or proprietary), so... Just some additional tips to Lozenzo: 1. You can try to use tesseract for text detection too. See[1]. Maybe just use RIL_TEXTLINE, RIL_PARA or RIL_BLOCK instead of RIL_

Re: [tesseract-ocr] tesseract ocr to pdf from .tif file send from fax machine

2020-02-06 Thread Zdenko Podobny
try to use the latest version. Zdenko št 6. 2. 2020 o 20:09 George Varghese napísal(a): > tesseract installed on Windows 2012 R2 server > > tesseract version v5.0.0-alpha.20191030 with Leptonica > > command line tesseract a1.tif a -l eng -psm 4 --oem 1 -pdf - > > does create a pdf with whit

Re: [tesseract-ocr] i was trying to fetch data from image to sting, i got the error module is missing

2020-02-06 Thread Zdenko Podobny
Check out pytesseract doc how to use it. Reading docs can cure a lot of pain. Zdenko pi 7. 2. 2020 o 8:30 Sai Krishna napísal(a): > module 'pytesseract' has no attribute 'image_to_string' > > PS E:\python\OpenpyxlBlogPost-master\source\pms> pip install pytesseract > Requirement already satisf

Re: [tesseract-ocr] Re: tesseract ocr to pdf from .tif file send from fax machine

2020-02-07 Thread Zdenko Podobny
No ;-) You are using the unreleased version (5.alpha) of tesseract. So the latest version of 5.x is today code. Stable release version is 4.1.1 from December 26th (2019). Zdenko pi 7. 2. 2020 o 14:26 George Varghese napísal(a): > That was latest version for Windows 64 bit. > > On Thursday, Feb

Re: [tesseract-ocr] Re: tesseract ocr to pdf from .tif file send from fax machine

2020-02-07 Thread Zdenko Podobny
You can build it by yourself, or to wait until your packager build it, you can try to use appveyor artifact [1] from the last commit. [1] https://ci.appveyor.com/project/zdenop/tesseract/build/job/66l95n7ofxrs0xtf/artifacts Zdenko pi 7. 2. 2020 o 15:56 George Varghese napísal(a): > I found i

Re: [tesseract-ocr] Removing diagonal Text that intersect with the horizontal text I want to read.

2020-02-08 Thread Zdenko Podobny
Can you share original pdf to investigate if the problem could not be solved on pdf level (e.g. extract image from pdf without watermark) ? Your problem is not related to tesseract (or other way - tesseract is not tool that helps you remove watermark), so better option would be to post in on stack

Re: [tesseract-ocr] Tesseract OpenCL Selects Wrong Compute Device

2020-02-18 Thread Zdenko Podobny
Search forum and issue tracker for opencl topic. Zdenko st 19. 2. 2020 o 8:27 Tim Finnegan napísal(a): > I'm attempting to run GPU Acceleration during training using the OpenCL > libraries. > > I have built tesseract to use openCL, and installed the NVidia Compute > driver 440 on my Ubuntu 19.

Re: [tesseract-ocr] Re: Using tesseract on browser page insufficient

2020-02-20 Thread Zdenko Podobny
Why we should document how to use Ubuntu? You should be familiar with your OS. PPA repositories for each tesseract version are listed on https://tesseract-ocr.github.io/tessdoc/Home.html Zdenko št 20. 2. 2020 o 9:20 Alexander Dietz napísal(a): > With an update to version 4 (undocumented proced

Re: [tesseract-ocr] tesseract 3.3.0 always misinterpret few characters (desperate right now ...)

2020-02-20 Thread Zdenko Podobny
What is tesseract 3.3.0? I did not find it in https://github.com/tesseract-ocr/tesseract/releases Or did you mean 3.03-rc1 release on on Sep 20, 2014 ? Zdenko št 20. 2. 2020 o 14:25 Justin Yeh napísal(a): > Unfortunately tesseract 3.3.0 keeps misinterpreting characters such as B > and 8, or

Re: [tesseract-ocr] Re: how to use tesseract to detect table?

2020-02-25 Thread Zdenko Podobny
Article points to this code on github: https://github.com/huks0/tablerecognition/blob/master/celldetectextract.py Zdenko st 26. 2. 2020 o 7:41 Essam Zaky napísal(a): > would you download the article you described and attach it here , because > the medium site needs payed registration > > ‫في ا

Re: [tesseract-ocr] Re: how to use tesseract to detect table?

2020-02-26 Thread Zdenko Podobny
maybe have a look at https://github.com/tesseract-ocr/tesseract/issues/1714#issuecomment-588180969 (I have no time to test it yet) Zdenko st 26. 2. 2020 o 11:02 KOLLOL CHOWDHURY napísal(a): > Does anyone have solution to this? In the newer tesseract(4.x), the > option textord_dump_table_imag

Re: [tesseract-ocr] Re: how to use tesseract to detect table?

2020-02-26 Thread Zdenko Podobny
test image is https://miro.medium.com/max/2136/1*VdPb4yKCkz1RhXfafPw_rA.png ;-) Zdenko st 26. 2. 2020 o 8:03 Zdenko Podobny napísal(a): > Article points to this code on github: > https://github.com/huks0/tablerecognition/blob/master/celldetectextract.py > > > Zdenko > >

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-02-29 Thread Zdenko Podobny
Can you replicate problem with command line /"pure" tesseract? e,g, 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe' images/invoice-sample.jpg invoice-sample Zdenko pi 28. 2. 2020 o 20:31 Supharerk Thawillarp napísal(a): > > I'm new to tesseract and trying to follow tutorial on Windows 10 u

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-02-29 Thread Zdenko Podobny
This means there is problem with pytesseract/python permissions. Can you get output for pytesseract.get_tesseract_version()? Zdenko so 29. 2. 2020 o 12:10 Supharerk Thawillarp napísal(a): > No, the tesserect successfully run with output generated in textfile. > > (base) PS C:\Users\Supharerk\

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-02-29 Thread Zdenko Podobny
1. Make sure you have the latest version of tesseract. Then try this script and provide exact/full error message: import tempfile import cv2 import pytesseract from PIL import Image from pytesseract import Output pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesserac

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-03-01 Thread Zdenko Podobny
Hello, I am not able to reproduce error, errors come from here [1] where pytesseract tries to cleanup temporary files. You should report it to pytesseract project as there is no option to skip this code. Maybe you can try to modify this part of pytesseact code[2]: finally: cleanup(f.name) to

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-03-01 Thread Zdenko Podobny
anyway report it to pytesseract project, so it can be fixed - otherwise next update will bring it once again. Zdenko ne 1. 3. 2020 o 18:17 Supharerk Thawillarp napísal(a): > After diving in pytesseract.py I found one possible related issue in > the NamedTemporaryFile. > > According to the post

Re: [tesseract-ocr] Supplying a different DPI param per page

2020-03-09 Thread Zdenko Podobny
Just quick replay (I did not test it :-) ): - tiff is"container of images" and AFAIK each image can have its own resolution (DPI is just information for correct printing/displaying of image) - tesseract should read multi-page tiff image-by-image and process it individually (includi

Re: [tesseract-ocr] Tesseract unable to read simple image correctly

2020-03-09 Thread Zdenko Podobny
Please write us what did you already tried from tesseract documentation. Zdenko po 9. 3. 2020 o 10:02 Velectico Consulting napísal(a): > *Environment* > Tesseract Version: tesseract v5.0.0-alpha.20200223 > Platform: Windows 64-bit > > *Problem: * > The attached image below is not read correctl

Re: [tesseract-ocr] Trying to build with OpenMP

2020-03-17 Thread Zdenko Podobny
Which version you try build (master) ? Did you search issue tracker / forum for problems with openmp? (There is a reason why it is turn off by default). Zdenko ut 17. 3. 2020 o 10:15 Jerry Andersson napísal(a): > Hello, I am trying to build with openmp and sw on a windows 7 machine but > I get

Re: [tesseract-ocr] Best export method

2020-03-19 Thread Zdenko Podobny
Checkout output to hocr (which is html output), tsv or pdf. See doc. Zdenko št 19. 3. 2020 o 8:04 Dayton napísal(a): > Hi All, > > I´m using Tesseract for Windows to OCR scanned documents and then format > the layout in Word in a later stage. > > The text extraction that I get in the .TXT outp

Re: [tesseract-ocr] Re: Scan pdf file instead png

2020-03-28 Thread Zdenko Podobny
Tesseract is OCR images not documents (pdf, docx, odt etc..) If you need multipage support use tif image format instead of pdf for scanning. Zdenko so 28. 3. 2020 o 20:42 Essam Zaky napísal(a): > What do you mean by "scan a pdf " ? > If you mean recognize pdf file , you can not recognize pdf f

Re: [tesseract-ocr] Can anyone tell the the improvement in 5.0.0-alpha

2020-04-02 Thread Zdenko Podobny
Just quick reply: Master branch (a.k.a 5.0.0-alpha) is development branch e.g. things there could be broken (e.g. build system or compatibility) ;-) Current stable branch/version is 4.1 where most patches from master were backported. If I remember correctly: differences between master and 4.1 a

Re: [tesseract-ocr] The text is not recognized from png

2020-04-07 Thread Zdenko Podobny
You can start with reading docs and then searching issue tracker and forum for "table". Zdenko ut 7. 4. 2020 o 7:38 amrapalli karan napísal(a): > I have this .pdf file which I am able to read only partially. I am using R > language to fetch the data from the pdf file which is uploaded in the f

Re: [tesseract-ocr] How to split a3 in single page

2020-04-07 Thread Zdenko Podobny
no. Tesseract is OCR engine and not image processing tool. Pdf export strictly follow rule to not modify input image e.g. you have this need you need to use other tools to create pdf. Zdenko po 6. 4. 2020 o 23:51 Teo napísal(a): > I've this page, can I split this A3 scan in 2 A4, during the e

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-13 Thread Zdenko Podobny
Why you decided to ignore instructions in comment https://github.com/tesseract-ocr/tesseract/issues/2946#issuecomment-612613461 ? Why we should care about your problems if you do not care? Zdenko ne 12. 4. 2020 o 16:00 Ravil R napísal(a): > I have my own simple Windows dll based on tesseractma

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-13 Thread Zdenko Podobny
OS Name: Microsoft Windows 10 Pro OS Version:10.0.18362 N/A Build 18362 System Model: Latitude E5570 System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: Intel64 Family 6 Model 78

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-14 Thread Zdenko Podobny
Without AVX support tesseract 4/5 will be slow(er). So try to focus on this. Using more than one lang will slower OCR too... Zdenko ut 14. 4. 2020 o 5:56 Ravil R napísal(a): > Oh you gave so much info, thanks! > My test exe file shows this version information: > tesseract 5.0.0 > leptonica-1.

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-15 Thread Zdenko Podobny
Just for future reference: for AVX (and ...) support there is needed to rebuild only tesseract - it depends on compiler and HW. Of course it make sense to use the latest version of tesseract dependencies (because of security, bugfixes etc) , but they have (AFAIK) minimum effect on tesseract speed (

Re: [tesseract-ocr] What is the working process of doing multiple images OCR using imagelist.txt

2020-04-17 Thread Zdenko Podobny
It loops over filelist [1]: processing one filename at time. [1] https://github.com/tesseract-ocr/tesseract/blob/cdebe13d81e2ad2a83be533886750f5491b25262/src/api/baseapi.cpp#L1007 Zdenko pi 17. 4. 2020 o 12:42 mit napísal(a): > Hi, > > I want to know the internal memory working of tesseract

Re: [tesseract-ocr] Tessaract not able to output detected text

2020-04-28 Thread Zdenko Podobny
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html Zdenko ut 28. 4. 2020 o 20:26 payel roy napísal(a): > Hi Team, > > I am new to Tessaract. Following the code snippet. While running it, I > can't get result back from Tesseract on the detect texts. Please help. > > #!/usr/bin/python >

Re: [tesseract-ocr] Tessaract not able to output detected text

2020-04-29 Thread Zdenko Podobny
changing > different parameters. However I am still not able to get text from the > image. Attached my pre-processing code, which I am running before using > tesseract. But however I am unable to get text still. Please help. > > On Tue, 28 Apr 2020 at 23:57, Zdenko Podobny wrote: > >>

Re: [tesseract-ocr] Using libtesseract in Windows for screenshot OCR

2020-05-17 Thread Zdenko Podobny
ne 17. 5. 2020 o 13:03 David Varns napísal(a): > > I am building some tools to extract text data from screenshots, which is a > simple (easy) case of OCR. (It is part of a platform for automated testing > of software, the tester interacts with UI elements and we want to be able > to read things

Re: [tesseract-ocr] Why does tessaract fail on this image?

2020-06-11 Thread Zdenko Podobny
https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md#missing-borders Zdenko st 10. 6. 2020 o 18:50 'Tariq Ahmad' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > I cannot understand whyTessaract fails on this (cropped) image: > > > Yet if i add a random white

Re: [tesseract-ocr] Why does tessaract fail on this image?

2020-06-12 Thread Zdenko Podobny
search for forum/issue tracker - there is explanation why LSTM can not exact character box coordinates. If you need exact character boxes IMO you need to use legacy engine (but it could have other problems) Zdenko pi 12. 6. 2020 o 12:31 'Tariq Ahmad' via tesseract-ocr < tesseract-ocr@googlegr

Re: [tesseract-ocr] Optimize tesseract

2020-06-26 Thread Zdenko Podobny
There is no magic command/parameter that solves issues like this. And you did not provide enough information (e.g. what it is "python tesseract 4.0") to analyze whether you follow best practices. If you are really interested in help, you have to provide more information (e.g. HW &OS specification,

Re: [tesseract-ocr] training the layout/segmentation/word detection engine

2020-07-01 Thread Zdenko Podobny
Try this: https://github.com/Sintun/PersonalHelperPrograms/blob/master/Tesseract/tess.cpp Longer story: https://github.com/tesseract-ocr/tesseract/issues/1714 Zdenko st 1. 7. 2020 o 10:29 amit...@gmail.com napísal(a): > I want to optimise tesseract 4 (lstm) for a set of documents I have. > I

Re: [tesseract-ocr] Anyway to disable internal image preprocessing? (internal operations make really BAD result)

2020-07-03 Thread Zdenko Podobny
First of all: you do not mention any important information like which tesseract version you use, which language model etc. Next: " -c tessedit_write_image=1" produces Could not set option: tessedit_write_image=1 ;-) Next: If you want to avoid tesseract binarization (Otsu), you must provide realy

Re: [tesseract-ocr] Tesseract makes different predictions on seemingly equal images. How to make it more robust?

2020-07-14 Thread Zdenko Podobny
Try to use the latest version of tesseract. Zdenko ut 14. 7. 2020 o 16:04 MysteriousGuy napísal(a): > I am using Tesseract to extract text from images attached. For some > reason, even though the images are nearly identical, tesseract makes a > mistake in one of them: for 'bad.png' the output

Re: [tesseract-ocr] tessaract ocr on capcha images--how to perform well?

2020-07-14 Thread Zdenko Podobny
there is albostultelly no intention to help you (or others) to use OCR for breaking captcha. Zdenko ut 14. 7. 2020 o 19:53 Omar Hasan napísal(a): > Hello! I am trying to run ocr on capcha images. well, for normal images > tessaract performs well, but for images below attachments, it performs b

Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-24 Thread Zdenko Podobny
Do you use lstm or legacy engine? If lstm: search issue tracker/PR/(forum?) for bounding box problem (and Noah Metzger patches) There are rumours that if you need really good bounding boxes you have to use the latest 3.5 version because changes in the 4.x version (and later) also affected legacy

Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-25 Thread Zdenko Podobny
As I mentioned, if you need good bounding boxes you have to use a legacy engine. There are several issues & comments why it is problem to get accurate bounding boxes e.g. https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987 Zdenko so 25. 7. 2020 o 0:44 'robinw...@google

Re: [tesseract-ocr] Train for big letters in the beginning of the sentences(pic)

2020-08-04 Thread Zdenko Podobny
Not sure what do you mean... tesseract big_low.jpeg - --psm 6 Warning: Invalid resolution 0 dpi. Using 70 instead. FY, MINERS.—TO LET, ON LEASE, on such terms as may be agreed on, the MINERALS in the ESTATE of KNOCKSHINNOCK, lying in the parish of New Cumnock, and county of Ayr. Acdead vein has be

Re: [tesseract-ocr] How can I use tesseract library in Visual Studio?

2020-08-04 Thread Zdenko Podobny
Did you try to look at documentation? Zdenko ut 4. 8. 2020 o 20:22 Kirankumar Chincholi napísal(a): > Hello everyone, > I hope everyone is fine and safe, I am Kiran,I just tried some basic > openCV tutorial using Visual Studio 2019. Now, I need to extract text from > images by using tesseract

Re: [tesseract-ocr] Getting started with contributions

2020-08-05 Thread Zdenko Podobny
just send pull requests to github repository. Zdenko št 6. 8. 2020 o 7:34 Uddeshya Tyagi napísal(a): > Hello developers! I'm Uddeshya Tyagi,a computer science student from > Jiit,Noida,India.I recently learnt basics of *tesseract* library.I,now > want to *contribute* to this project,so please

<    2   3   4   5   6   7   8   9   10   11   >