Although it is less than clear, I got the impression from
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract that the
dictionary files must be created by the wordlist2dawg program even if
one wants to use empty word-lists. However, when I run wordlist2dawg
with an empty input file
Thanks Ray for shedding light on many things that were bothering me.
Your warning about mixing fonts on one training image is making me
rethink my method. I was attempting to use actual images of scanned
pages from the book as my training pages, however they involve two
different although clo
The message about overlapping boxes is often a red-herring. The
workaround in such cases is to make the box be taller. See my post
about "box overlaps no blobs or blobs in multiple rows" messages. Its
subject-line is "Re: tesseract training bugs", and I'm about to resend it.
KHEM Sochenda
Eugene Reimer wrote, On 2009-06-23 23:11:
> Thanks Ray. However I'm unable to accept your explanation of those
> "box overlaps no blobs or blobs in multiple rows" messages. The first
> of those in my boxfile occurs for the "." line reproduced here
> to
My earlier response won't help in your case. And I don't know your
alphabet. If the blobs in those pairs of overlapping boxes are supposed
to make up one character then you'll want to combine their boxes.
However if they are separate characters, then you'll need to spread them
out, as sugge
The simplest solution is probably to combine the bounding boxes for that
pair, so tesseract will recognize them as one unit, even though you want
it to produce two characters. The TrainingTesseract page covers doing that.
KHEM Sochenda wrote, On 2009-07-15 17:18:
>Thanks Eugene,
>
>The chara
As the warden in Cool Hand Luke was fond of saying...
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from thi
Jeff,
I also felt the need for the "source" form of those training images, so
much so that I wrote a little bash script to construct the source from
the box-file. Its detection of space characters could be improved. It
can be simplified for those willing to work in utf8 (my favourite
text-ed
jeffrey.ratcli...@gmail.com wrote, On 2009-07-22 00:15:
> Eugene,
> Thanks for that.
> Do the word lists have to come from the training images? AFAICT, you
> could throw more or less any dictionary data in - and presumably the
> results would vary.
My current project involves an obscure langu
I'm encountering a problem so weird I can hardly believe what I'm
seeing. When I run tesseract with batch.nochop makebox on certain
images the resulting boxes (rectangles) have both the top and bottom too
low by approximately 12-pixels. The "Issue"
http://code.google.com/p/tesseract-ocr/is
gt; thereitself.
>
> On Thu, Jul 23, 2009 at 3:45 PM, Eugene Reimer <mailto:erei...@shaw.ca>> wrote:
>
>
> I'm encountering a problem so weird I can hardly believe what I'm
> seeing. When I run tesseract with batch.nochop makebox on certain
>
Jeffrey Ratcliffe wrote, On 2009-07-26 06:21:
> Would you mind specifying a licence for this (preferably in the source)?
Done. It's still in
http://ereimer.net/programs/extract-tesseract-trainingpage-source
cheers,
Eugene
--~--~-~--~~~---~--~~
You received th
Something that may be the answer, although my French isn't good enough
to tell: http://doc.ubuntu-fr.org/xsane2tess
notbitmonk wrote, On 2009-08-03 21:15:
>Does anybody knows what options to provide to xsane to use this ocr
>instead of gocr?
>
>
--~--~-~--~~~--
That ought to work. And you don't need your fictive letter, since the
tesseract training allows for one blob to become two characters. See
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
Svetlin Nakov wrote, On 2009-09-23 11:12:
> I have the following idea: add few fictive let
The link you sent was nothing to do with multiple letter
>blobs.
>
>Thanks,
>
>Svetlin Nakov
>Development Manager
>Intelligent Software Consulting (ISC)
>-Original Message-
>From: tesseract-ocr@googlegroups.com [mailto:tesseract-...@googlegroups.com]
>On Be
We keep getting all this obvious spam, and yet when I reply to one of
them with suggestions about detecting and discarding such then my email
does not come through. This suggests that there is some filtering but
it's not very effective. I originally sent some suggestions on
2009-10-06 23:47,
Why would people think this group is into porn? The masochism stuff
seems easier to understand:-)
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesserac
If you remove the underlining that will solve the problem. I just tried
it and tesseract got all 3 right, and that was without any whitelisting
or retraining for digits-only.
dnagir wrote, On 2009-10-21 02:01:
>Hi,
>
>I am trying to recognise 2 simple images: first has 0 (zero) digit and
>se
It is simple enough to remove such borders using a program such as unpaper.
Dmitiry Nagirnyak wrote, On 2009-10-21 04:05:
> Also is there a way to tell tesseract there /*might*/ be some small
> borders on the image?
--~--~-~--~~~---~--~~
You received this mes
There appears to be a typo involved. Ray must've meant
http://code.google.com/p/leptonica/
JerryPu wrote, On 2009-10-25 21:43:
>Would you please give links about Leptoinca? I can't find infomation
>about it.
>And I can't find computer vision libraries which are open sources, can
>you suggest
Both your experiences sound like variations of the issue I reported for
version 2.03, and again for 2.04. See my email from 2009-07-23 05:15,
and the "issue" report:
http://code.google.com/p/tesseract-ocr/issues/detail?id=223
In my examples the box-coordinates had sensible X-values, but the
>> On Nov 28, 2009 7:40 PM, "Eugene Reimer" > <mailto:erei...@shaw.ca>> wrote:
>>
>> Both your experiences sound like variations of the issue I reported for
>> version 2.03, and again for 2.04. See my email from 2009-07-23 05:15,
>> and the "
Wouldn't the simplest solution be for the "install" to not install any
language-files. Then the install-instructions would say to install the
language-files one wants AFTER installing tesseract; the only other
change would be the location one is instructed to copy them to.
That would also sol
It's certainly possible to have multiple working versions installed in
Linux. Whether or not it's easily done will depend on how the
"installer" is written, and I haven't studied it. However, installing
two versions by using the --prefix option on the ./configure
commandlines should be easy t
Tesseract will do the first step, converting an image of text into
text. For the 2nd part, translating Ukrainian text into English, you'll
want to look at other things such as Babelfish, none of which will do a
perfect translation (a human is still needed for that) but can be better
than nothi
Finding your Linux build-instructions isn't easy -- since they're hidden
in a file described as "Windows build instructions".
mackie wrote, On 2010-01-12 15:26:
I've added instructions how to build it on Linux and on Windows.
It builds smoothly either on Fedora and on Widnows. Are you sure y
You say that when you do the same things in Photoshop the problem goes
away... obviously the resulting image from Photoshop must be different
from the IM-produced image, so if you tell us more about how they differ
then we may all learn something useful?
namenick wrote, On 2010-01-17 23:05:
One problem is that your image is a JPEG, not a TIFF.
Jonah wrote, On 2010-03-16 19:06:
Tesseract crashes while creating a boxfile with this TIF:
http://drop.io/j6guurf
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this gr
I take it back. With the interface provided by that file-sharing site,
there is one way to end up with a JPEG, but there's another way to end
up with a TIFF.
I've also had very poor results when trying to train tesseract on small
sizes of text so I'm unable to help there.
Jonah wrote, On 2
See the previous posts on this list about running Tesseract on an
ARM-processor. They were by ben.hay...@gmail.com and sound though he's
got it working, since he's looking at ways to make it faster.
piyu wrote, On 2010-03-20 11:42:
i'm studying in final year of engineering n i'm doing a pro
I tried it using
http://ereimer.net/programs/tesseract-training-from-source with the font
from
http://www.free-fonts-ttf.org/true-type-fonts/french-script-mt-2944-download.htm
and while the newly constructed training files worked well on spaced-out
text they worked very badly on the sort of t
Hello,
I'm already too busy with too many projects, so I won't be able to
contribute much. However I am interested in OCR, especially in getting
a good open-source OCR tool. I used to be a reasonably competent C
programmer, but retired long ago, missed out on C++, and prefer easier
languages
One way is to train with only those characters you want recognized.
That method works for command-line usage too.
MARTIN Pierre wrote, On 2010-04-16 19:22:
if someone has an answer to this one, i am wondering if it is possible
to "force" the recognition to only given characters (Known before
Just grab the pixels in that "box", go through them to find the one
furthest from your background colour, and you're done. (Pixels on an
edge will be a blend of the background and font colour.) Probably the
easiest image format to work with is "Plain PPM" since it consists
entirely of ASCII c
The command-line tesseract on that image does produce two lines. Mind
you, the first line consists entirely of gibberish. Here's what I get:
.>’¢:>¢:>C)_§?
522960
That's on Linux with tesseract version 2.04 with the "eng" language-files.
Jimmy O'Regan wrote, On 2010-07-02 13:30:
Honestly, I'
You'll need to upscale the image. Before reducing it to
Black-and-White. Reducing to B+W isn't essential.
fontenot.1031 wrote, On 2010-07-03 01:23:
Hey. I have a bunch of .jpg files of the pages of the book L'Etranger
that I need to OCR. However, when I convert them into a .tif file so
that
Scaling by a factor that's bigger than one. Just google for
"imagemagick scaling".
fontenot.1031 wrote, On 2010-07-04 16:47:
Can you tell me what upscaling is or how to do it with ImageMagick? I
don't know that much about images, jpeg or tiff. Thanks a lot. (also I
think the imgur link is me
I agree that Windows is rubbish. However, to make such a statement is
to engage in Microsoft-bashing:-)
Jimmy O'Regan wrote, On 2010-07-15 12:34:
I'm not interested in Microsoft bashing; I'm not interested in seeing
myths perpetuated. But I still think Windows is rubbish.
--
You received th
A quick glance at the documentation will tell you that "the dictionary"
lives in several DAWG files, as well in that user-words file.
patrickq wrote, On 2010-07-27 14:59:
I get HAX 6 5-5,- with Tesseract 3.0
What I find remarkable is that half the folks on this forum would love
to disable the
You could probably improve its ability to recognize "00" as two 0's by
training it on such paired symbols.
Mind you, I have also been surprised by cases where a perfectly clear
and flawless symbol gets subdivided, like a N becoming |\| or an H
becoming I-I, which indicates that tesseract has c
Ghostscript is good for working with PDFs containing text; yours likely
have images but no no text. Using something like pdfimages to extract
the raster-images from a PDF will give you what you want, without any
unwanted rescaling.
Kevin Carlson wrote, On 2010-09-24 12:37:
We receive PDF fi
Advice on increasing MAX_NUM_CONFIGS from Ray Smith 2009-07-07 13:52:
The 32 font limit (MAX_NUM_CONFIGS) was a hardware limit. (Long story)
The code that reads the inttemp file in 2.04 and below is specific to
the value of MAX_NUM_CONFIGS so you can increase it as long as you
retrain yourself
Would a "basic shape" be the same as a "shape", or as a "utf8"? Hmm,
perhaps it is a "call them what you like"?
Ray Smith wrote, On 2011-02-19 21:12:
Sorry to be late on this very long thread, but you guys are making
lives difficult for yourselves by getting hold of the wrong end of the
stic
43 matches
Mail list logo