issues with box file(missing characters)

2009-08-20 Thread Chris
Using Tesseract 2.04 through command line using the default eng training files. Attempting to make a box file off the following image. The first line is read fine, but the second is ignored. Is there anything i can do, any parameters i can pass to adjust how the box file is created? I have the

box file exception

2009-09-21 Thread Chris
When creating box files on blank images i get an exception. I am calling tesseract through command line calls using java exec. If the image is blank, i get a debugger prompt because of receiving an unhandled win32 exception. Before you lecture me on misusing this functionality, I have found this

Problem Installing

2009-11-16 Thread Chris
Hello, I'm having a problem installed v2.04 and I'm not having a lot of luck in finding a solution I'm trying to install on CentOS 5, These are the errors I've found in config.log (after configure) /usr/bin/uname -p = unknown /bin/uname -X = unknown /bin/arch = i686 /usr/bin/a

How to perform OCR with region dependent settings

2011-10-01 Thread Chris
do I tell tesseract that field 3 may only contain numbers). I want to avoid that tesseract translates the whole document without paying attention to the characteristics of each form field. Tanks in advance, Chris -- You received this message because you are subscribed to the Google Groups &quo

Decreased accuracy after training for specific characters

2012-02-11 Thread Chris
necessary characters from the data. I'm doing my training by taking the box files and stripping out all the characters I don't need and then running through the training instructions. I'm using tesseract3.01 Any thoughts? Cheers Chris. -- You received this message because

Re: There is no rellay good and working tutorial for tesseract-ocr for iPhone

2012-02-11 Thread Chris
This one worked well for me: http://tinsuke.wordpress.com/2011/11/01/how-to-compile-and-use-tesseract-3-01-on-ios-sdk-5/ On Jan 21, 7:45 pm, isicom wrote: > Hi there, > > now I spend my one big whole saturday (day and evening) to try to > compile tesseract-ocr for iPhone. My envirement is: > MA

Re: Decreased accuracy after training for specific characters

2012-02-11 Thread Chris
ing single characters? On Feb 11, 10:17 am, Chris wrote: > Hi All, > > I'm using tesseract quite successfully in my code. I have a > preprocessing step that locate the characters I need to recognise and > then I feed them into tesseract using the PSM_SINGLE_CHAR mode. > &g

Re: Decreased accuracy after training for specific characters

2012-02-12 Thread Chris
I think you are right - I don't think the sample box data provided for download can be the same data that is used by google to create the trained data. On Feb 12, 12:42 pm, Zdenko Podobný wrote: > Hi Chris, > > I have the same experience - that leads me to conclusion it does not &

Re: Decreased accuracy after training for specific characters

2012-02-13 Thread Chris
traineddata eng. delete the bits you don't need - in my case I don't need any of the dawg files as I'm just recognising single chars then do: combine_tessdata eng. On Feb 12, 2:59 pm, Chris wrote: > I think you are right - I don't think the sample box data provided for &

Re: Problem Recognizing Numbers

2012-02-13 Thread Chris
I'd try segmenting the numbers out yourself and feeding them into tesseract as individual characters. Might work better than feeding it the whole image. Make sure you put some padding around each character. On Feb 13, 1:56 am, JD wrote: > I'm using v 3.01 on Windows 7 to perform OCR on another p

Recognizing color-on-black screenshots of fixed fonts

2012-06-12 Thread Chris
For a project I want to recognize the text taken from screenshots from programs and games. I have a lot of assumed knowledge which should help me with the recognition: - The font used is usually arial 12pt, plus one or two others. - Background is (usually) black - Font can be different colors, i

[tesseract-ocr] How to upgrade to Tesseract 4.0 with C++ in Visual Studio.

2018-06-21 Thread Chris
Following these steps: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#windows on the official projects "Compiling page" I was successfully able to get tesseract 3.05.01 and other required packages installed to start using #include in visual studio. However, tesseract 3.05.01 isn't

Output of tesseract is not as useful without font baseline information?

2013-09-11 Thread chris
ate. I assume other people have solved this problem already, is there something obvious I'm missing? Thanks, Chris p.s. I realize there is software out there that will OCR PDFs and do this work for me - for my project, the OCR is part of a larger process and so I really need to have more

Is this a bug? Should I report it?

2013-09-17 Thread chris
t have the classifier components so Classify::InitAdaptiveClassifier() is crashing on the line ASSERT_HOST(tessdata_manager.SeekToStart(TESSDATA_INTTEMP)); Is this a known problem, is there a known workaround? Thanks, Chris -- -- You received this message because you are subscribed to the Google Gr

Allowing tesseract to recognize a lot of additional words (similar to .user-words but maybe not)

2013-09-17 Thread chris
temp, pffmtable, and normproto files dependent on the word list, or are they dependent only on the language & font? Thanks, Chris -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-o

Adding German words to English user-words

2013-09-18 Thread chris
iihl, Dörfer -> Derfer, schützen -> schiitzen). So as a test of the "user-words" facility I created a eng.user-words (attached) that contained a few German words. When I do the OCR, it still gets those words wrong. Is this proof that I'm creating the user-words wrong? Than

Mapping Tesseract Fonts to Windows?

2013-09-20 Thread chris
hat doesn't tell the user anything. I would need a way to ask Tesseract, "what is the glyph for an uppercase G in an Arial font of height 34". Does that exist? Thanks, Chris -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr

Re: Mapping Tesseract Fonts to Windows?

2013-09-21 Thread chris
rd "goodbye" will take up the wrong amount of space and I'll either end up with lots of white space or (more often) the words all run over each other. Does that make more sense? Thanks, Chris On Friday, September 20, 2013 5:39:07 PM UTC-6, Quan Nguyen wrote: > > You&#

What does Tesseract store (internally) for fonts it's been trained on?

2013-09-21 Thread chris
ll me (for example) what a capital Q looks like in the URW Bookman font? Thanks, Chris p.s. I know that I can download the font, I'm specifically asking whether this information can be acquired solely from Tesseract. -- -- You received this message because you are subscribed to th

Re: Mapping Tesseract Fonts to Windows?

2013-09-21 Thread chris
character widths in >> the font that Windows is using are very different from the character widths >> in the actual DejaVu Sans font, so the word "goodbye" will take up the >> wrong amount of space and I'll either end up with lots of white space or >>

Re: How to best OCR a page with mixed text and images?

2013-10-02 Thread chris
Alas, I'm building in Visual Studio on Windows, and it looks like Olena/Milena doesn't support that platform. On Monday, September 30, 2013 8:24:24 PM UTC-6, Chris Shearer Cooper wrote: > > Is there some way to analyze the image (maybe something in Leptonica) > before sending

[tesseract-ocr] Re: Searchable PDF output with oversized font

2014-11-23 Thread Chris
Hi Ryan, I run in the same problem. Do you have solved it? Best regards, Chris On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote: > > Hi all, > > I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 > LTS. When I use either hocr or th

Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

2014-11-25 Thread Chris
uot;$page" -s -o "$page.pdf.bak" < "$page.hocr" > #rm -rf $page > done > > pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf" > Thank you, Chris On Sunday, November 23, 2014 5:12:12 PM UTC+1, shree wrote: > > Have you tried with version comp

RE: Extracting text from PDF

2010-01-04 Thread Chris Faust
Personally, I would just use Image::Magick or GD to convert the .pdf into a .tiff and then simply have tesseract ocr it. Someone else may have a better solution though. -Original Message- From: tesseract-ocr@googlegroups.com [mailto:tesseract-...@googlegroups.com] On Behalf Of Eitan Sen

[tesseract-ocr] Tessearct in containers

2019-10-19 Thread Chris G
Greetings, I am hoping this question is not too general i am really just looking for others experiences. We are running one of the lastest versions in containers running in a Kubernetes cluster. Performance is not great. We are doing PDF conversion and generating a searchable pdf which is what

[tesseract-ocr] CPU types

2020-01-07 Thread Chris G
We are running Tesseract in a Kubernetes in Azure. If anyone is running the same. What types of VMs are you using with what type of CPU? I am looking into what CPUs support optimal Tesseract -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To

[tesseract-ocr] Obtain both PDF and HOCR output from single scan?

2020-03-11 Thread Chris Falter
possible to produce *both *with a single scan? Or do we have to do 2 scans, one for each output? Thanks in advance for your help! And I apologize if my search through this forum's messages failed to find an answer that already exists. Best, Chris Falter -- You received this message because yo

Re: [tesseract-ocr] Obtain both PDF and HOCR output from single scan?

2020-03-12 Thread Chris Falter
Thanks! On Wednesday, March 11, 2020 at 9:39:14 PM UTC-4, shree wrote: > > Use both at end of command line eg. > > tesseract image outbase -l foo --oem 1 hocr pdf > > On Thu, Mar 12, 2020, 03:59 Chris Falter > > wrote: > >> Hi, >> >> My project is

[tesseract-ocr] Incorrect OCR of 4-digit number

2022-02-26 Thread Chris McClelland
esseract engine, and I tried with the latest commits (4767ea9 & e2aad9b) of both tessdata and tessdata_best. Can I do anything to improve the OCR result in this sort of scenario? Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" grou

[tesseract-ocr] Difficult image, any tips would be appreciated

2022-11-12 Thread Chris E.
le to achieve a good result from this kind of image with proper training? Any further ideas/tips would be appreciated! Greetings, Chris [image: temp2.jpg] -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and

Re: [tesseract-ocr] Difficult image, any tips would be appreciated

2022-11-13 Thread Chris E.
nd to get a little better results is to segment the image manually and then feed the individual segments into tesseract. My problem is, that I need to rely on the results (perhaps not 99%, but at least 90%), and that sounds pretty hard to achieve. Greetings, Chris On Sunday, November 13,

Re: [tesseract-ocr] Difficult image, any tips would be appreciated

2022-11-13 Thread Chris E.
BTW, Google Lens detects ALL text on the image perfectly ;) On Sunday, November 13, 2022 at 3:10:51 PM UTC+1 Chris E. wrote: > Hi Lorenzo, > > thank you so much for your ideas! Unfortunately, I don't think I can get a > better image quality. It's a VGA signal that

[tesseract-ocr] Re: Difficult image, any tips would be appreciated

2022-11-13 Thread Chris E.
er image quality. I would still prefer a local solution with tesseract, will post my updates. Greetings, Chris On Sunday, November 13, 2022 at 9:12:57 PM UTC+1 tfmo...@gmail.com wrote: > The image has "mosquito noise" around the characters which indicates that > it's been compr

Degenerate cases for automatic page segmentation

2014-02-05 Thread Chris Adams
linked image above runs quickly. https://gist.github.com/af9c84d4232da1e6127f Obviously, there are ways we could work around this in our workflow but I was wondering whether there's been any thought to something as simple as a time limit for this stage? Chris -- -- You received this me

[tesseract-ocr] Too few characters. Skipping this page

2014-04-19 Thread Chris Nevin
ignore the structure and just pick up on the characters? Basically any advice as to what would be a good way to go about this would be helpful! Even if I should look at training Tesseract or creating a word list with the chemical elements or something? Thanks a lot! Chris -- You rec

[tesseract-ocr] Re: Single character recognition

2014-04-24 Thread Chris Dopuch
Could you post the exact command line command you used to get good results for these images? On Thursday, April 10, 2014 11:03:20 AM UTC-5, Vipul Aggarwal wrote: > > I am working on images with single character. > However, tesseract is unable to recognize them. > > This is how I initialized it:

[tesseract-ocr] Binarizing Image gives terrible results

2014-04-24 Thread Chris Ramirez
I am developing an app for android that involves processing an image and producing a text result. I am trying to increase the accuracy by processing the image, enhancing it, and binarizing it. The problem is that when I binarize it, the results come out completely terrible. Not a single charact

[tesseract-ocr] Best image pre-processing software

2014-08-20 Thread Chris Smeal
I've been doing some research on using Tesseract for both document scans and text in scenery, and I was wondering what image processors are best? Given I have a lot of images, I cannot process each batch by hand, so I will have to make a pretty smart pre-processing app before I get started. T

[tesseract-ocr] PDF output not searchable within SumatraPDF

2014-10-14 Thread Chris Cameron
ibjpeg 8d : libpng 1.5.18 : libtiff 4.0.3 : zlib 1.2.8 SumatraPDF v2.5.2 Adobe Reader 11.0.07 Can someone help me out with why this might be happening? Thanks, Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe

[tesseract-ocr] Re: PDF output not searchable within SumatraPDF

2014-10-15 Thread Chris Cameron
All the files I mention can be found here: https://www.dropbox.com/sh/v5w4zl0c2z1wra1/AACxjmomYL4o-iQEhBrLvNgHa Incidentally, I now see that Chrome's PDF viewer is also unable to search the PDF. Thanks, Chris -- You received this message because you are subscribed to the Google G

Re: Image Recognition

2008-10-24 Thread Chris Penn
They are all the same size. They are all in a textline. What is the training process? Chris... On Fri, Oct 24, 2008 at 12:17 PM, Ray Smith <[EMAIL PROTECTED]> wrote: > If they aren't organized into textlines or they are not of roughly the same > size, then it won;t work, o

Re: Debian-AMD64: Java Error

2008-11-02 Thread Chris Penn
did you try these? http://linux.softpedia.com/progDownload/OCRopus-Download-40603.html one is tesseract, the other is Ocropus, for 32 and 64bit. Chris... On Sun, Nov 2, 2008 at 5:25 PM, charlesrkiss <[EMAIL PROTECTED]> wrote: > > During ./configure, Tesseract is unable to: >

Re: license for language file

2008-11-05 Thread Chris Penn
The Aspirin/MIGRAINES system is no longer used. Tesseract can also make use of the libtiff library. (www.libtiff.org) Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files. Chris... On Wed, Nov 5, 2008 at 3:53 AM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>

[tesseract-ocr] Save recordingdevice or set programatically

2015-05-01 Thread Chris Ongena
nd use that upon start? Or can I set the device programatically in any other way? Thx Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesser

[tesseract-ocr] Numbers often recognised as characters (5 -> s)

2016-06-02 Thread Chris H
I am having similar issues as this user had https://github.com/tesseract-ocr/tesseract/issues/305 In my case I'm inputting lines of text containing mostly words with the occasional the use of numbers. Sometimes numbers such as 5 are been confused with the letter s. Is there some settings that

[tesseract-ocr] 4.00 alpha whitelist / blacklist

2017-01-27 Thread Chris Harris
Using 4.00 alpha built from master on Windows 10 Is the LSTM classifier intended to honor the tessedit_char_blacklist and tessedit_char_whitelist? Just testing from the command line specifying --oem 0 works correctly, specifying --oem 1 the blacklist and whitelist seem to be ignored. Thank you

[tesseract-ocr] Whitelisting apostrophes problem

2017-04-03 Thread Chris H
I am having trouble whitelisting and OCRing apostrophes (English single right quotes). Given something like the attached image, without specifying a whitelist, apostrophes are output: $ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout Doctor‘s Mask But due to noise (not necessarily

[tesseract-ocr] tesseract 4 skips over some text

2017-07-18 Thread Chris Hawley
The file that i am running OCR on https://drive.google.com/file/d/0B-iKKP8eIvdgZkhObUVXUVJ1N28/view?usp=sharing Before anyone asks, it's part of the CIA's Crest Dataset. I noticed tesseract seems to skip over some text. The command that I am using is E:\Tesseract\build\bin\Release\tesseract.ex

[tesseract-ocr] Tesseract 5 with dnf

2024-09-05 Thread Chris Crutts (agentc313)
on my Oracle Linux 8.10 distribution, doing $ sudo dnf install tesseract installs tesseract version 4.1.1-2.el8 and leptonica version 1.76.0-2.el8 As of today, 9/5/2024, the newest version is Release 5.4.1 · tesseract-ocr/tesseract (github.com)

Re: Is this a bug? Should I report it?

2013-09-18 Thread Chris Shearer Cooper
wrote: > On Wed, Sep 18, 2013 at 06:14:48PM +0200, zdenko podobny wrote: > > My understanding was that Chris was intentionally creating a > > training without the classifier components, for some reason. > > > > Me too. But than consequences (missing re

How to best OCR a page with mixed text and images?

2013-09-30 Thread Chris Shearer Cooper
Is there some way to analyze the image (maybe something in Leptonica) before sending it to Tesseract so that I can prevent Tesseract from trying to extract text from pictures on the page? Or is there a Tesseract setting or extra function call I can make to do this? Thanks, Chris -- -- You

[tesseract-ocr] Improve text extraction when some text is inverted

2021-07-01 Thread &#x27;Chris' via tesseract-ocr
anyone suggest ways tackle this? All the discussions I have seen are for when the whole image is inverted, but here it is only some of the text? Regards, Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this

Re: [tesseract-ocr] Improve text extraction when some text is inverted

2021-07-02 Thread &#x27;Chris' via tesseract-ocr
hanks for that. Worst case I cause get Tesseract to look at the original image and an inverted image and then combine the results. Whilst simpler, that would double the time taken. If it helps I could provide a sample C# project next week. Chris On Friday, 2 July 2021 at 11:56:26 UTC+1 zdeno