tesseract training bugs and questions

2009-06-22 Thread Eugene Reimer
Although it is less than clear, I got the impression from http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract that the dictionary files must be created by the wordlist2dawg program even if one wants to use empty word-lists. However, when I run wordlist2dawg with an empty input file

Re: tesseract training bugs and questions

2009-06-23 Thread Eugene Reimer
Thanks Ray for shedding light on many things that were bothering me. Your warning about mixing fonts on one training image is making me rethink my method. I was attempting to use actual images of scanned pages from the book as my training pages, however they involve two different although clo

Re: Overlap Box

2009-07-15 Thread Eugene Reimer
The message about overlapping boxes is often a red-herring. The workaround in such cases is to make the box be taller. See my post about "box overlaps no blobs or blobs in multiple rows" messages. Its subject-line is "Re: tesseract training bugs", and I'm about to resend it. KHEM Sochenda

Re: tesseract training bugs

2009-07-15 Thread Eugene Reimer
Eugene Reimer wrote, On 2009-06-23 23:11: > Thanks Ray. However I'm unable to accept your explanation of those > "box overlaps no blobs or blobs in multiple rows" messages. The first > of those in my boxfile occurs for the "." line reproduced here > to

Re: Overlab Box

2009-07-15 Thread Eugene Reimer
My earlier response won't help in your case. And I don't know your alphabet. If the blobs in those pairs of overlapping boxes are supposed to make up one character then you'll want to combine their boxes. However if they are separate characters, then you'll need to spread them out, as sugge

Re: Overlab Box

2009-07-15 Thread Eugene Reimer
The simplest solution is probably to combine the bounding boxes for that pair, so tesseract will recognize them as one unit, even though you want it to produce two characters. The TrainingTesseract page covers doing that. KHEM Sochenda wrote, On 2009-07-15 17:18: >Thanks Eugene, > >The chara

Re: Recreating German language files from source

2009-07-20 Thread Eugene Reimer
As the warden in Cool Hand Luke was fond of saying... --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from thi

Re: Recreating German language files from source

2009-07-21 Thread Eugene Reimer
Jeff, I also felt the need for the "source" form of those training images, so much so that I wrote a little bash script to construct the source from the box-file. Its detection of space characters could be improved. It can be simplified for those willing to work in utf8 (my favourite text-ed

Re: Recreating German language files from source

2009-07-21 Thread Eugene Reimer
jeffrey.ratcli...@gmail.com wrote, On 2009-07-22 00:15: > Eugene, > Thanks for that. > Do the word lists have to come from the training images? AFAICT, you > could throw more or less any dictionary data in - and presumably the > results would vary. My current project involves an obscure langu

Bounding boxes

2009-07-23 Thread Eugene Reimer
I'm encountering a problem so weird I can hardly believe what I'm seeing. When I run tesseract with batch.nochop makebox on certain images the resulting boxes (rectangles) have both the top and bottom too low by approximately 12-pixels. The "Issue" http://code.google.com/p/tesseract-ocr/is

Re: Bounding boxes

2009-07-23 Thread Eugene Reimer
gt; thereitself. > > On Thu, Jul 23, 2009 at 3:45 PM, Eugene Reimer <mailto:erei...@shaw.ca>> wrote: > > > I'm encountering a problem so weird I can hardly believe what I'm > seeing. When I run tesseract with batch.nochop makebox on certain >

Re: Recreating German language files from source

2009-07-26 Thread Eugene Reimer
Jeffrey Ratcliffe wrote, On 2009-07-26 06:21: > Would you mind specifying a licence for this (preferably in the source)? Done. It's still in http://ereimer.net/programs/extract-tesseract-trainingpage-source cheers, Eugene --~--~-~--~~~---~--~~ You received th

Re: options to use in xsane

2009-08-03 Thread Eugene Reimer
Something that may be the answer, although my French isn't good enough to tell: http://doc.ubuntu-fr.org/xsane2tess notbitmonk wrote, On 2009-08-03 21:15: >Does anybody knows what options to provide to xsane to use this ocr >instead of gocr? > > --~--~-~--~~~--

Re: Agglutinated letters

2009-09-23 Thread Eugene Reimer
That ought to work. And you don't need your fictive letter, since the tesseract training allows for one blob to become two characters. See http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract Svetlin Nakov wrote, On 2009-09-23 11:12: > I have the following idea: add few fictive let

Re: Agglutinated letters

2009-09-23 Thread Eugene Reimer
The link you sent was nothing to do with multiple letter >blobs. > >Thanks, > >Svetlin Nakov >Development Manager >Intelligent Software Consulting (ISC) >-Original Message- >From: tesseract-ocr@googlegroups.com [mailto:tesseract-...@googlegroups.com] >On Be

filtering of emails needed

2009-10-20 Thread Eugene Reimer
We keep getting all this obvious spam, and yet when I reply to one of them with suggestions about detecting and discarding such then my email does not come through. This suggests that there is some filtering but it's not very effective. I originally sent some suggestions on 2009-10-06 23:47,

filtering of emails

2009-10-20 Thread Eugene Reimer
Why would people think this group is into porn? The masochism stuff seems easier to understand:-) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesserac

Re: Simple digits are not correctly recognised

2009-10-21 Thread Eugene Reimer
If you remove the underlining that will solve the problem. I just tried it and tesseract got all 3 right, and that was without any whitelisting or retraining for digits-only. dnagir wrote, On 2009-10-21 02:01: >Hi, > >I am trying to recognise 2 simple images: first has 0 (zero) digit and >se

Re: Simple digits are not correctly recognised

2009-10-21 Thread Eugene Reimer
It is simple enough to remove such borders using a program such as unpaper. Dmitiry Nagirnyak wrote, On 2009-10-21 04:05: > Also is there a way to tell tesseract there /*might*/ be some small > borders on the image? --~--~-~--~~~---~--~~ You received this mes

Re: Can tesseract be used for correlation pattern recognization(pic added)?

2009-10-25 Thread Eugene Reimer
There appears to be a typo involved. Ray must've meant http://code.google.com/p/leptonica/ JerryPu wrote, On 2009-10-25 21:43: >Would you please give links about Leptoinca? I can't find infomation >about it. >And I can't find computer vision libraries which are open sources, can >you suggest

Re: Boxes have wrong coordinates

2009-11-28 Thread Eugene Reimer
Both your experiences sound like variations of the issue I reported for version 2.03, and again for 2.04. See my email from 2009-07-23 05:15, and the "issue" report: http://code.google.com/p/tesseract-ocr/issues/detail?id=223 In my examples the box-coordinates had sensible X-values, but the

Re: Boxes have wrong coordinates

2009-11-29 Thread Eugene Reimer
>> On Nov 28, 2009 7:40 PM, "Eugene Reimer" > <mailto:erei...@shaw.ca>> wrote: >> >> Both your experiences sound like variations of the issue I reported for >> version 2.03, and again for 2.04. See my email from 2009-07-23 05:15, >> and the "

Re: newbie question - Compiling Teseract / Language Support

2009-12-01 Thread Eugene Reimer
Wouldn't the simplest solution be for the "install" to not install any language-files. Then the install-instructions would say to install the language-files one wants AFTER installing tesseract; the only other change would be the location one is instructed to copy them to. That would also sol

Re: Ubntu 9.04. tesseract version 2.03, 204 and 3.0

2010-01-07 Thread Eugene Reimer
It's certainly possible to have multiple working versions installed in Linux. Whether or not it's easily done will depend on how the "installer" is written, and I haven't studied it. However, installing two versions by using the --prefix option on the ./configure commandlines should be easy t

Re: OCR Ukrainian text

2010-01-09 Thread Eugene Reimer
Tesseract will do the first step, converting an image of text into text. For the 2nd part, translating Ukrainian text into English, you'll want to look at other things such as Babelfish, none of which will do a perfect translation (a human is still needed for that) but can be better than nothi

Re: Box file editor gui in C++

2010-01-12 Thread Eugene Reimer
Finding your Linux build-instructions isn't easy -- since they're hidden in a file described as "Windows build instructions". mackie wrote, On 2010-01-12 15:26: I've added instructions how to build it on Linux and on Windows. It builds smoothly either on Fedora and on Widnows. Are you sure y

Re: problem with font sizes

2010-01-17 Thread Eugene Reimer
You say that when you do the same things in Photoshop the problem goes away... obviously the resulting image from Photoshop must be different from the IM-produced image, so if you tell us more about how they differ then we may all learn something useful? namenick wrote, On 2010-01-17 23:05:

Re: tesseract crashing while creating boxfile

2010-03-16 Thread Eugene Reimer
One problem is that your image is a JPEG, not a TIFF. Jonah wrote, On 2010-03-16 19:06: Tesseract crashes while creating a boxfile with this TIF: http://drop.io/j6guurf -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this gr

Re: tesseract crashing while creating boxfile

2010-03-17 Thread Eugene Reimer
I take it back. With the interface provided by that file-sharing site, there is one way to end up with a JPEG, but there's another way to end up with a TIFF. I've also had very poor results when trying to train tesseract on small sizes of text so I'm unable to help there. Jonah wrote, On 2

Re: need guidance

2010-03-20 Thread Eugene Reimer
See the previous posts on this list about running Tesseract on an ARM-processor. They were by ben.hay...@gmail.com and sound though he's got it working, since he's looking at ways to make it faster. piyu wrote, On 2010-03-20 11:42: i'm studying in final year of engineering n i'm doing a pro

Re: Tesseract - Did Any One Try - French MT Script

2010-03-21 Thread Eugene Reimer
I tried it using http://ereimer.net/programs/tesseract-training-from-source with the font from http://www.free-fonts-ttf.org/true-type-fonts/french-script-mt-2944-download.htm and while the newly constructed training files worked well on spaced-out text they worked very badly on the sort of t

Re: Forking tesseract.

2010-04-13 Thread Eugene Reimer
Hello, I'm already too busy with too many projects, so I won't be able to contribute much. However I am interested in OCR, especially in getting a good open-source OCR tool. I used to be a reasonably competent C programmer, but retired long ago, missed out on C++, and prefer easier languages

Re: Disable Special characters?

2010-04-16 Thread Eugene Reimer
One way is to train with only those characters you want recognized. That method works for command-line usage too. MARTIN Pierre wrote, On 2010-04-16 19:22: if someone has an answer to this one, i am wondering if it is possible to "force" the recognition to only given characters (Known before

Re: Text position / color

2010-05-04 Thread Eugene Reimer
Just grab the pixels in that "box", go through them to find the one furthest from your background colour, and you're done. (Pixels on an edge will be a blend of the background and font colour.) Probably the easiest image format to work with is "Plain PPM" since it consists entirely of ASCII c

Re: Bad Read?

2010-07-02 Thread Eugene Reimer
The command-line tesseract on that image does produce two lines. Mind you, the first line consists entirely of gibberish. Here's what I get: .>’¢:>¢:>C)_§? 522960 That's on Linux with tesseract version 2.04 with the "eng" language-files. Jimmy O'Regan wrote, On 2010-07-02 13:30: Honestly, I'

Re: need help converting .jpg files to .tif for OCR

2010-07-03 Thread Eugene Reimer
You'll need to upscale the image. Before reducing it to Black-and-White. Reducing to B+W isn't essential. fontenot.1031 wrote, On 2010-07-03 01:23: Hey. I have a bunch of .jpg files of the pages of the book L'Etranger that I need to OCR. However, when I convert them into a .tif file so that

Re: need help converting .jpg files to .tif for OCR

2010-07-04 Thread Eugene Reimer
Scaling by a factor that's bigger than one. Just google for "imagemagick scaling". fontenot.1031 wrote, On 2010-07-04 16:47: Can you tell me what upscaling is or how to do it with ImageMagick? I don't know that much about images, jpeg or tiff. Thanks a lot. (also I think the imgur link is me

Re: request: windows release with support for compressed tiff's

2010-07-15 Thread Eugene Reimer
I agree that Windows is rubbish. However, to make such a statement is to engage in Microsoft-bashing:-) Jimmy O'Regan wrote, On 2010-07-15 12:34: I'm not interested in Microsoft bashing; I'm not interested in seeing myths perpetuated. But I still think Windows is rubbish. -- You received th

Re: mis-decoding a single line of text

2010-07-27 Thread Eugene Reimer
A quick glance at the documentation will tell you that "the dictionary" lives in several DAWG files, as well in that user-words file. patrickq wrote, On 2010-07-27 14:59: I get HAX 6 5-5,- with Tesseract 3.0 What I find remarkable is that half the folks on this forum would love to disable the

Re: No treatment for touching letters?

2010-08-12 Thread Eugene Reimer
You could probably improve its ability to recognize "00" as two 0's by training it on such paired symbols. Mind you, I have also been surprised by cases where a perfectly clear and flawless symbol gets subdivided, like a N becoming |\| or an H becoming I-I, which indicates that tesseract has c

Re: Extracting text from scanned PDF docs

2010-09-24 Thread Eugene Reimer
Ghostscript is good for working with PDFs containing text; yours likely have images but no no text. Using something like pdfimages to extract the raster-images from a PDF will give you what you want, without any unwanted rescaling. Kevin Carlson wrote, On 2010-09-24 12:37: We receive PDF fi

Re: Assertion failed: MaxNumConfig <= MAX_NUM_CONFIGS

2010-09-30 Thread Eugene Reimer
Advice on increasing MAX_NUM_CONFIGS from Ray Smith 2009-07-07 13:52: The 32 font limit (MAX_NUM_CONFIGS) was a hardware limit. (Long story) The code that reads the inttemp file in 2.04 and below is specific to the value of MAX_NUM_CONFIGS so you can increase it as long as you retrain yourself

Re: Tesseract Training

2011-02-19 Thread Eugene Reimer
Would a "basic shape" be the same as a "shape", or as a "utf8"? Hmm, perhaps it is a "call them what you like"? Ray Smith wrote, On 2011-02-19 21:12: Sorry to be late on this very long thread, but you guys are making lives difficult for yourselves by getting hold of the wrong end of the stic