Re: Disable Special characters?

2010-04-21 Thread Zdenko Podobný
Hello, maybe this problems if tesseract is not installed to standard place and there is not environment setting (export TESSDATA_PREFIX="directory in which your tessdata resides/") as mentioned in http://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes. I have (on linux) tesseract 2.04 installed

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

2010-04-23 Thread Zdenko Podobný
Hello,, please read wiki pages http://code.google.com/p/tesseract-ocr/wiki especially http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract where is described training process for tesseract 2.04 In svn (http://code.google.com/p/tesseract-ocr/source/checkout) there is already (pre?) re

Re: TRAINING ... Font name = UnknownFont.

2010-04-24 Thread Zdenko Podobný
Dňa 19.04.2010 09:05, MARTIN Pierre wrote / napísal(a): > Hello Zdpo, > > As said in my mail on 13th of April, as an answer to Sriranga: > > >>> I am extremely thankful for the attachment. I could not understand "OCRB >>> font" - which I don't have. It is presumed any fonts can do/be used ? >

Re: Tesseract 3.0 without page layout analysis?

2010-04-29 Thread Zdenko Podobný
Hi Patrick, Do you have experience that it works (e.g. it produces different output for different "Page seg mode")? I tried several options but I got the same output. I used scan of 4 column magazine page as input file. Maybe I did something wrong, maybe I do not understand what should be result.

Re: Extracting files from .tessdata

2010-04-29 Thread Zdenko Podobný
Hi Ramon, I do not have source files for dawg dictionaries and I am not able to "decompile" them. Anyway I think to create dictionaries is the easiest part of tesseract training: based on wiki[1] input is simple utf-8 file with one word per line. This file is split to several files: * lang.pu

Re: Benefit of the dictionary

2010-05-01 Thread Zdenko Podobný
determine the version of the ambigs file? Whether the ambigs file of > tess.3.o is supported for utf-8 say Kannada or any of Indic lang? Previous > version of tesseract 2.xx did not support utf-8 > With regards, > -sriranga(77yrsold) > > 2010/5/1 Zdenko Podobný > > >>

Re: Training tesseract for hand written letters

2010-05-17 Thread Zdenko Podobný
Hello, can you provide more information (OS? how did you installed Tesseract?) Zd. Dn(a 08.05.2010 19:53, Thilanka Kaushalya wrote / napísal(a): > Hi, > > I'm a doing a handwritten character recognition using Tesseract. I > tried to train the Tesseract exe for my data set. on windows

Re: Tesseract on RH EL4 TIFFCleanup undefined symbol error

2010-05-18 Thread Zdenko Podobný
Hello I usually get this error if I mix libraries (e.g. if I have old version of library but your library/program expect newer version of library). If you did not compile library by yourself, you should complain to packager of you program for wrong handling of dependencies... If you compiled some

Re: Integrating Tesseract with another open source project

2010-05-22 Thread Zdenko Podobný
see http://code.google.com/p/tesseract-ocr/wiki/ReadMe: Another important change is that you should *really* be using TessBaseAPI if you are linking with another program. In Linux (non-Windows) the main library is now libtesseract_api.a instead of the old libtesseract_full.a. In wi

Re: Extracting files from .tessdata

2010-05-22 Thread Zdenko Podobný
(a): > Hi Zdenko, > > After some tests, I realized I need the tiff pair boxes that the > creators used to generate Catalan tessdata file. > > Do you know a way to contact to them? > > Ramon. > > > > > On 29 Abr, 23:49, Zdenko Podobný wrote: > >> Hi

Re: Danish fraktur support in r319

2010-05-24 Thread Zdenko Podobný
Dňa 24.05.2010 19:46, Jimmy O'Regan wrote / napísal(a): > On 24 May 2010 17:41, Lars Aronsson wrote: > >> Peter Alberti wrote: >> > I've trained tesseract r319 (3.0) to support Danish texts written in > fraktur. It is not > perfect but good enough that I hope it may be useful t

Re: Danish fraktur support in r319

2010-05-24 Thread Zdenko Podobný
Dňa 24.05.2010 20:26, Jimmy O'Regan wrote / napísal(a): > 2010/5/24 Zdenko Podobný : > >> Dňa 24.05.2010 19:46, Jimmy O'Regan wrote / napísal(a): >> >>> On 24 May 2010 17:41, Lars Aronsson wrote: >>> >>> >>>> Pet

Re: Danish fraktur support in r319

2010-05-24 Thread Zdenko Podobný
Dn(a 24.05.2010 21:39, Lars Aronsson wrote / napísal(a): > Jimmy O'Regan wrote: > ‘ScrollView::Image(Pix*&, inT32&, int)’ > ../viewer/scrollview.h:266: note: candidates are: void > ScrollView::Image(const char*, int, int) > >> Weird. It's there: >>337 theraysmith #ifdef HAVE_LIBLE

Re: Call for testers...

2010-05-26 Thread Zdenko Podobný
Hello, compilation process without problem: ./runautoconf ./configure make Than I installed it: sudo make install When I tried to run it: /usr/local/bin/tesseract I got error: /usr/local/bin/tesseract: error while loading shared libraries: libtesseract_api.so.

Re: Call for testers...

2010-05-26 Thread Zdenko Podobný
Dňa 26.05.2010 20:52, Jimmy O'Regan wrote / napísal(a): > 2010/5/26 Zdenko Podobný : > >> Hello, >> >> compilation process without problem: >> >> ./runautoconf >> ./configure >> make >> >> Than I installed it: >>

Re: Call for testers...

2010-05-27 Thread Zdenko Podobný
Dňa 26.05.2010 21:28, Jimmy O'Regan wrote / napísal(a): > 2010/5/26 Zdenko Podobný : > >> Dňa 26.05.2010 20:52, Jimmy O'Regan wrote / napísal(a): >> >>>> >>>> >>> I didn't do anything with the Java stuff - did

Re: Generated ZERO tr..BLOBS IN r379 and 869 tr.blobs in r-319

2010-06-05 Thread Zdenko Podobný
Dn(a 05.06.2010 14:57, Jimmy O'Regan wrote / napísal(a): > On Saturday, June 5, 2010, zdpo wrote: > >> Dear Sriranga, >> >> your box file is wrong (for tesseract 3.0 and >r319). It did not match >> to description in "Make Box Files" on >> http://code.google.com/p/tesseract-ocr/wiki/TrainingT

Re: Forking tesseract.

2010-06-09 Thread Zdenko Podobný
Hello, do you intend to release also tiff/box files for (new) languages (in ) Can you provide some short example for punc-dawg and number-dawg file? BR, Zd. Dn(a 25.05.2010 06:44, Ray Smith wrote / napísal(a): > I would be very happy for someone to take over maintenance of the autotools > par

Re: *** glibc detected *** tesseract: double free or corruption

2010-07-13 Thread Zdenko Podobný
Hi, I have Mandriva 2010.1 64 bit (upgraded from 2010) and with tesseract-2.04-5mdv2010.1. I can run without problem (in xterm or konsole): /usr/bin/tesseract resized_slide7.tif resized_slide7 that produce output (resized_slide7.txt) . BUT: /usr/bin/tesseract resized_slide5.tif resiz

Re: Localisation

2010-07-13 Thread Zdenko Podobný
Dn(a 13.07.2010 12:29, Jimmy O'Regan wrote / napísal(a): > On 12 July 2010 20:19, Jeffrey Ratcliffe wrote: > >> On 12 July 2010 17:16, Jimmy O'Regan wrote: >> >>> Is anybody interested in seeing localisation support in Tesseract? >>> (Which begs the follow-up question: is anybody willin

Re: Can't get the user dictionary to work

2010-08-01 Thread Zdenko Podobný
Dn(a 30.07.2010 15:04, patrickq wrote / napísal(a): > This what I did: > > 1. Created a text file called eng.user-words, containing: > Chest > Chestnut > Floor > Vice > > 2. Placed the file in the tessdata folder (next to eng.traineddata) > > 3. Ran recognition on an image returning "Chesf" instea

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-08-01 Thread Zdenko Podobný
Dňa 28.07.2010 17:02, Jimmy O'Regan wrote / napísal(a): > I grepped the code and it seems to be looking for something called LANG.user-words, but that didn't seem to do anything -- I got the same garbled text when I ran Tesseract 3 the second time. >> Turns out T3 does

Re: Announcement: new version of pyTesseractTrainer available

2010-08-13 Thread Zdenko Podobný
Dňa 13.08.2010 23:19, Robert Komar wrote / napísal(a): > On Fri, 13 Aug 2010, zdenko podobny wrote: > >> Because IFAIK nobody react on Catalin e-mail I offered him to create >> project >> to collect patches and possibly to solve known issues. Because of my low >> time resource project is looking s

Re: Announcement: new version of pyTesseractTrainer available

2010-08-13 Thread Zdenko Podobný
Dňa 14.08.2010 00:17, Jimmy O'Regan wrote / napísal(a): > On 13 August 2010 21:54, zdenko podobny wrote: > >> Hello, >> I would like to announce new version 1.01 of pyTesseractTrainer - successor >> of tesseractTrainer.py Version 1.00 is identical with tesseractTrainer.py. >> Features: >> >> v

Re: Tesseract Training Problem (under Mac)

2010-09-06 Thread Zdenko Podobný
So details for training are split to 2 wikis: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract2 http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 Unfortunately comments (now irrelevant) stay on http://code.

Re: Training japanese for 3.0

2010-09-19 Thread Zdenko Podobný
Hi Stane, why it doesn't look healthy? ;-) There is one easy way how to find if it correct or not: to test it ;-) BTW: when I searched for mistakes in former wiki (now corrections are included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) I recognized that: a) unicharset_ex

Re: Training japanese for 3.0

2010-09-19 Thread Zdenko Podobný
Dňa 19.09.2010 16:01, Jimmy O'Regan wrote / napísal(a): > 2010/9/19 Zdenko Podobný : >> Hi Stane, >> >> why it doesn't look healthy? ;-) >> There is one easy way how to find if it correct or not: to test it ;-) >> >> BTW: when I searched

Re: Help on training tesseract for new language

2010-09-28 Thread Zdenko Podobný
You mixed tesseract version: "combine_tessdata" command is part of tesseract 3.00. Tesseract-2.04 did not use .traineddata Zd. Dn(a 28.09.2010 17:25, TesseractNoob wrote / napísal(a): > I need to train tesseract for a new language. So I was successfully > created the 8 files without any issues

Re: Provide/visualize baseline info?

2011-02-05 Thread Zdenko Podobný
I am not sure what you if it helps you, but did you try debug mode (http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging)? Zd. Dn(a 05.02.2011 01:33, daemon-s wrote / napísal(a): Hi! I train Tess using separate images for every text line. Recognition is also ran over single text line

Re: Trying to OCR a LCD Display

2011-02-25 Thread Zdenko Podobný
Can you post somewhere example? Zdenko Dňa 18.02.2011 20:29, GigaGuy wrote / napísal(a): Can someone explain to me how I can train tesseract to recognized the numbers on an lcd display? They are terminal font. I have a weather station and a webcam. I am taking a pic and trying to ocr it to p

Re: Tesseract and Windows 7 64 Bit

2011-02-27 Thread Zdenko Podobný
Dn(a 26.02.2011 23:24, andy_syme wrote / napísal(a): Tesseract 3 doesn't seem to work on my win7 64 bit laptop and I think I read somewhere that tesseract 3 doesn't work with 64 bit. Please be more specific: What does it mean "doesn't seem to work"? Where did read that it does not work on 64b

Re: Tesserac 2.0 not working

2011-03-06 Thread Zdenko Podobný
Hi, I am not familiar with your code/application, so just hint: it sound me like tesseract is not able to find English language data in "D:\Projects\AMCDF\Source\Frameworks\Device\AMCDF.Device.GUI\Resources\tessdata\" AFAIK tessnet2 is based on tesseract 2.0x so you need there 2.0x language

Re: can't read frequent_words_list file

2011-03-06 Thread Zdenko Podobný
Hello, the message is clear: it can not find file "frequent_words_list". Does it exists? can sent listing of directory where you run 'wordlist2dawg'? Zdenko Dňa 06.03.2011 14:02, Sang Đặng Minh wrote / napísal(a): I using tesseract 2.0.4, and I creat 2 UTF-8 text file, then and then use wor

Re: tesseract-3.01 compiling issue on linux

2011-04-07 Thread Zdenko Podobný
Did you run "./runautoconf ; ./configure" before running make? I have no problem to compile revision 581 on linux. Zdenko Dn(a 07.04.2011 01:56, zl2k wrote / napísal(a): hi, all, I just checked out tesseract-3.01(r581) from svn but got the following compiling error on linux box colfind.cpp

Re: How to improve an existing language?

2011-04-13 Thread Zdenko Podobný
Hi, deu-frak should be community language pack :-) prepared by the piggy (see [1] or [2]) so further improvements should be possible. At the moment I can not find fraktur.tgz (in Google Group files - maybe it was removed), but there where other people interesting in its improving (see [3]).

Re: How to improve an existing language?

2011-04-13 Thread Zdenko Podobný
Dňa 13.04.2011 19:32, Jimmy O'Regan wrote / napísal(a): 2011/4/13 Zdenko Podobný: Hi, deu-frak should be community language pack :-) prepared by the piggy (see [1] or [2]) so further improvements should be possible. At the moment I can not find fraktur.tgz (in Google Group files - may

Re: boxtiff file for arabic

2011-04-21 Thread Zdenko Podobný
There were several such requests for tesseract 3.00 but without result ;-) In past somebody tried to train tesseract based on released box+tiff files ([1], boxtiff-2.01-*.tar.gz [2]) and he got different result that published ;-) So I think it does not make sense to wait for box+tiff files...

Re: Catalan language

2011-05-15 Thread Zdenko Podobný
Dňa 11.05.2011 23:18, jinglada wrote / napísal(a): On May 11, 9:53 pm, zdenko podobny wrote: On Wed, May 11, 2011 at 9:22 PM, jinglada wrote: In the /usr/share/tesseract-ocr/tessdata I have the following files: cat.DangAmbigs spa.DangAmbigs eng.DangAmbigs fra.DangAmbigs por.DangAmbigs ca

Re: Training procedure

2011-06-21 Thread Zdenko Podobný
As far as I know tesseract is developed (or at least tested) on Ubuntu :-). Windows version is port ;-) BTW: this is a stupid bug/feature: you can fix it by renaming file 'spa.cour.g4.tr' to 'spa.cour.exp4.tr'. See comment in source code [1]. This worked for tesseract 3.01 (revision ) on Mandr

Re: Training procedure

2011-06-21 Thread Zdenko Podobný
I know there is one bug in 3.00 (already fixed in svn for 3.01 version) that "works" on linux but not windows [1]. patch is included in that issue if needed also with explanation why it has problem on Linux/Mac and not Windows. If possible I suggest to use recent revision of source code (589)

Re: Can't make box files

2011-06-21 Thread Zdenko Podobný
Hi I run: /usr/local/bin/tesseract tesseract-ocr/eurotext.tif tesseract-ocr/eurotext tesseract-ocr/tessdata/tessconfigs/batch.nochop tesseract-ocr/tessdata/configs/makebox and it created eurotext.box to tesseract-ocr/ (to the same directory where image is located). If you have problem to f

Re: read other languages ​​by tesseract on c #

2011-10-07 Thread Zdenko Podobný
If there is error message: Unable to load unicharset file C:\Program Files\Tesseract-OCR\ita.unicharset" than it means that your program expect language files (ita.*) in directory "C:\Program Files\Tesseract-OCR\" and not in "...\tessdata" Zdenko Dňa 07.10.2011 18:20, Alessandro Latella w

Re: Stopping Tesseract title when using command line tool

2011-12-01 Thread Zdenko Podobný
Simple: no Complicated: redirect standard output (stdout) BTW: it is displayed when tesseract is started and and not before results are put to file (results are not displayed). If you do not like this behavior you can modify source and create your own version. Or you can use tesseract librar

Re: Tesseract Dll For Visual Basic Express 2008

2011-12-30 Thread Zdenko Podobný
http://groups.google.com/group/tesseract-dev/browse_thread/thread/75be5c97eb4d1b3c Tom expect he can finish it (add some doc + packaging) after beginning of the year. But dll seems to work. :-) At least nobody reported something else. Zdenko Dn(a 28.12.2011 15:02, Lahiru Himash Madusanka

Re: Version 3.02 in alpha

2012-02-04 Thread Zdenko Podobný
You are not able to compile any c++ program on linux from source. This is our of tesseract scope to learn you how to compile source. You should read first some manual how to compile program from source. Zdenko Dn(a 04.02.2012 08:31, Sriranga(78yrsold) wrote / napísal(a): Zenko, Thanks for the

Re: Tess 3.02 English training set broken?

2012-02-05 Thread Zdenko Podobný
Can you please provide more details (OS, compiler, how to run/use tesseract)? Zdenko Dn(a 05.02.2012 15:38, patrickq wrote / napísal(a): I am running the latest Tess 3.02 with the new English training set and get the following crash at init with lang: actual_tessdata_num_entries_<= TESSDATA_

Re: Decreased accuracy after training for specific characters

2012-02-12 Thread Zdenko Podobný
Hi Chris, I have the same experience - that leads me to conclusion it does not make sense to train "common" fonts... I think google use different process (more detailed; more/other tools?) comparing to information available on wiki... IMHO situation is improving with each release, so I wait f

Re: Error during using tesseract-ocr

2012-03-06 Thread Zdenko Podobný
This is leptonica error message that indicate problem with image support. If you have version from today, please send output of: tesseract -v Zdenko Dňa 06.03.2012 15:42, Ivan Mushketik wrote / napísal(a): > Hello. > > I tried to run tesseract-ocr v 3.02 with the following params: > tesseract pho

Re: .net 3 Confidence is always 0

2012-03-16 Thread Zdenko Podobný
Dňa 14.03.2012 19:49, Curtis wrote / napísal(a): > I am using the vs 3 .net wrapper. > When I run the function Recognize it ocrs the image fine and I can get > the string. > I need the confidence level of each character, but it is always 0. > What am I doing wrong? > > > > Dim image As New

Re: Version 3.02 in alpha

2012-03-18 Thread Zdenko Podobný
Hi, you did give all details, so I need to guess some details: 1. I guess that you run something like this: $ tesseract binarized.jpg content -l deu but you created makebox file with command $ tesseract binarized.jpg binarized makebox if yes, than difference is in used language file

Re: tesseract under windows and paths

2012-03-22 Thread Zdenko Podobný
Hi Simon, I implemented "--disable-tessdata-prefix" for configure in revision 708. Than means if you build tesseract with this option, TESSDATA_PREFIX is not set during build process to installation directory (usually /usr/share or /use/local/share on linux). I tested it in mingw+msys on Windows

Re: Enchancing "half-toned" image for tesseract processing

2012-03-31 Thread Zdenko Podobný
Dn(a 31.03.2012 15:59, klo wrote / napísal(a): > > I have a scanned PDF material to which I want to add hidden text layer, so > I could index the document. I used ghostscript black and white tiff output > device (tiffg4) to extract pages as tiff images, and here is example of > what they look

Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

2012-03-31 Thread Zdenko Podobný
Dňa 31.03.2012 16:17, klo wrote / napísal(a): > In my simple testing, I find this most common problem, is there a way to > instruct tesseract not to use those glyphs without limiting it to ASCII? > > I use tesseract 3.01 BTW > put them to blacklist with variable tessedit_char_blacklist (search f

Re: Different output for almost identical images

2012-04-06 Thread Zdenko Podobný
Dňa 06.04.2012 17:35, Rufus wrote / napísal(a): > Thanks for the reply. > > I've tried another image(bad2.tiff), which is still a bit different from > good.tiff, and is of the same order regarding the compression ratio. > However, tesseract still doesn't output anything for bad2.tiff. > I then tr

Re: segfaulting again (svn 3.02)

2012-04-12 Thread Zdenko Podobný
Dňa 12.04.2012 18:09, Falke wrote / napísal(a): > i hope this posts in the right order > > addendum to my reply from 30 minutes ago: I rebuilt the bad build. > Didn't help; same error. > Does it mean it segfaults? Can you provide more info (OS, platform, how you run OCR...)? I tried it (tesserac

Re: Tesseract Classification

2012-04-13 Thread Zdenko Podobný
2.01 is too old. So I would suggest to use tesseract executable only or upgrade to tesseract 3.02. Save your data as image (in format that is recognized by tesseract 2.01) and run: tesseract image_file output_file Zdenko Dňa 13.04.2012 06:20, Ankur Rana wrote / napísal(a): > Please see the att

Re: Configuration / Documentation

2012-04-13 Thread Zdenko Podobný
Dn(a 13.04.2012 09:20, troplin wrote / napísal(a): > Hello, > > is there any documentation about config files and configuration variables? > I am especially interested in a list of the most important/useful variables > from a user point of view. > > Regarding config files and API, is the "api_con

Re: Windows newline?

2012-04-25 Thread Zdenko Podobný
Dn(a 25.04.2012 14:06, Nonmaskable Interrupt wrote / napísal(a): > I just built 3.02 from svn using VS2008 and it seems to work fine, except > that newline characters > are Linux standard ('/n') instead of windows ('\r\n') standard. This is a > change from previous behavior; > is it intentional?

Re: how to know the unlabelled blobs location!?

2012-05-03 Thread Zdenko Podobný
Dňa 02.05.2012 16:46, nkantan r wrote / napísal(a): > hi all! > find below the log on generating a tr file; > > > Page 0 > APPLY_BOXES: >Boxes read from boxfile:3312 >Boxes failed resegmentation: 0 >Found 3312 good blobs and 3 unlabelled blobs in 0 words. >0 remainin

Re: Tesseract set accuracy as speed

2012-05-09 Thread Zdenko Podobný
I would expect "For English (and a few other languages)" = "that in svn" -- Zdenko Dňa 06.05.2012 13:40, Sriranga(78yrs) wrote / napísal(a): > which are *few other* languages likely have cube module is trained as well > as tesseract?Accucary is preferred to speed. > > On Sun, May 6, 2012 at 1:18

Re: Error In Code

2012-05-17 Thread Zdenko Podobný
What kind of problem? IMO original reporter did not revealed what was real problem... ;-) -- Zdenko Dňa 16.05.2012 21:24, Aaron Campos wrote / napísal(a): > I have the same problem. how resolve it? > > El viernes, 6 de enero de 2012 11:31:42 UTC+1, Lahiru Himash Madusanka > escribió: >> It has be

Creating searchable pdf with tesseract and pdfbeads

2012-05-27 Thread Zdenko Podobný
Maybe you can write a blog (then post link to forum ;-) ) about work-flow (needed changes, spent time at each step etc.) This could be useful also for non tesseract communities. -- Zdenko Dňa 26.05.2012 09:01, Galt wrote / napísal(a): > Here's my pdf if anyone is interested: > > http://folkpla

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-27 Thread Zdenko Podobný
Well, thanks should go to David who fix the code and Galt who reported/test it. My problem (excluding lack of time;-) ) there is no working hocr validity tool. hocr-tools[1] has something but it looks to have problem with recent python PyXML[2] (I just did quick test). I saw some attempts that rep

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-28 Thread Zdenko Podobný
Dn(a 26.05.2012 23:09, Galt wrote / napísal(a): > Worderful news, Zdenko! > >> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output >> to follow hOCR spec. > I wonder what he did? see [1] and [2]. And I did today r729... We tested output with pdfbeads (1.0.9) and ExactImage

Re: Tesseract remove non-text regions

2012-06-26 Thread Zdenko Podobný
Dn(a 26.06.2012 13:47, christy wrote / napísal(a): > You can call the following from baseapi.cpp as :- > Boxa* boxa = NULL; > Pixa* pixa = NULL; > Pix *pix = tesseract_->pix_binary(); tesseract_ is protected object so it is not available. IMO it could be replaced with: Pix *pix = api->GetThresholde

Re: debug mode for tesseract 3.01

2012-06-27 Thread Zdenko Podobný
Dňa 21.06.2012 08:22, eva charles wrote / napísal(a): > Is the procedure to run the debug mode for tesseract 3.01 same as that > given in the ViewerDebugging wiki ( > http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging)? > > Please help! > Yes. -- Zdenko -- You received this message bec

Re: Tesseract can read a pic, but not a manipulated one

2012-07-03 Thread Zdenko Podobný
Dn(a 03.07.2012 10:32, Acanis wrote / napísal(a): > Hey, > > Ill attach some pics to show my problem. > I get a picture "Bohne.jpg", which is fine for tesseract. But the arrow get > sometimes a "1" and sometimes not! 1. I did find arrow in English traineddata, so it will not be recognized correctl

Re: tesseract-ocr does not output desired results

2012-07-17 Thread Zdenko Podobný
Dňa 17.07.2012 02:32, Wei Liu wrote / napísal(a): > > My platform: Mac OS X 10.7.4 + Xcode 4.3.2 + OpenCV 2.4.0 > > > I want to use tesseract-ocr to recognize a few image (see attachment), and > I wrote a simple function to process the image using OpenCV, which is shown > as following > > > char*

Re: developing new program which passes memory buffer with OCR data to be recognized to tesseract library

2012-07-19 Thread Zdenko Podobný
Dňa 19.07.2012 03:32, newtotesseract wrote / napísal(a): > Hi, > > Thanks for the suggestion. > I found the thread "Include Tesseract in C++ > code" > closer to what I am looking for. > > But, did not get how to create static arch

Re: broken link: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-3.02.tar.gz on http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/setup.html#initial-build-directory-setup

2012-07-26 Thread Zdenko Podobný
Dňa 26.07.2012 05:44, newtotesseract wrote / napísal(a): > Hi, > > Link > http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-3.02.tar.gz > referenced > on Setting up > Tesseract-OCR

Re: Feature request: ranking of dictionary word frequency

2012-08-23 Thread Zdenko Podobný
Dňa 23.08.2012 13:08, Nick White wrote / napísal(a): > A great addition to training would be if one dictionary file was > used, combining freq-words and all-words, and a relative frequency > probability score was given to each word. This would allow more > fine-grained scoring based on exactly how

Re: Extracting text from Gas Sign

2012-09-02 Thread Zdenko Podobný
Did you read FAQ? Did you search this forum for image preprocessing? What suggestion you already tried? -- Zdenko Dňa 01.09.2012 22:17, Mark Stephens wrote / napísal(a): > Perhaps it was a poor assumption but I would have thought it would be > relatively easy to extract the text from a gas sign.

Re: does the latest version of Tesseract OCR perform auto rotation on image out of the box

2012-09-02 Thread Zdenko Podobný
Dňa 02.09.2012 13:40, js wrote / napísal(a): > *What steps will reproduce the problem?* > 1. Ran the tesseract.exe [v 3.0+] with populated engilsh tessdata > directory. > > tesseract.exe my_image.tif myimage_out -l eng -psm 0 > > > *What is the expected output? What do you see instead?* > the myi

Re: where is the tesseract 3.02 ?

2012-09-09 Thread Zdenko Podobný
yes, at the moment 3.02 is in svn repository at http://tesseract-ocr.googlecode.com/svn/trunk/ On 08.09.2012 21:54, fulberto100 wrote: 3.02 version is this? svn checkout *http*://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only On Saturday, September 8, 2012 12:38:28 PM UTC+3,

Re: Can tesseract accept regions?

2012-10-22 Thread Zdenko Podobný
On 22.10.2012 10:15, satuon wrote: Can I specify the regions where usable text is to tessaract, so it doesn't try to OCR the entire page? The page can contain pictures and other non-text areas. Yes you can - have a look at SetRectangle[1] in tesseract API [1] https://code.google.com/p/tessera

Re: Working directories

2012-10-22 Thread Zdenko Podobný
On 22.10.2012 15:04, José Luis Rey wrote: First Thanks for this fantastic project, I will try to collaborate all I Can!! It's possible to set working dirs on Command line? Can you please clarify your understanding of "working dirs"? -- Zdenko -- You received this message because you are sub

Re: Can't make working tesseract in simple project vs2008 tess3.02.02

2012-11-04 Thread Zdenko Podobný
1. Download tesseract-ocr-3.02.02-win32-lib-include-dirs.zip and tesseract-ocr-API-Example-vs2008.zip

Re: Confidence in HOCR file

2012-11-14 Thread Zdenko Podobný
Word confidence in HOCR was implemented after releasing 3.02.02 ;-) so you need to checkout current svn and recompile. For char confidence you need to use tesseract API (to program it). Search in this forum (maybe issues) for example. -- Zdenko On 14.11.2012 15:40, José Luis Rey wrote: Hello

Re: Having traindata files uncombined

2012-11-15 Thread Zdenko Podobný
Can you please use 3.02 version instead of 3.01 and write exact error message? There is possibility to copy text from windows console - select relevant text/lines with pressed left mouse button then click with right mouse button outside of selected text but in console window - highlight will di

Re: Can I configure Tesseract to *always* match a dictionary word?

2012-11-15 Thread Zdenko Podobný
Regarding "user_patterns_suffix" have a look at tesseract manual page [1]. I am not sure if there is possibility to force tesseract choose ocr output from dictionary (I never tried it ;-) ) But you can increase dictionary strength with variables language_model_penalty_non_freq_dict_word and la

Re: Problem with ViewerDebugging with tesseract 3.02.02

2012-11-18 Thread Zdenko Podobný
Did you compile tesseract by yourself? If yes, did you use standard compilation process(autotools)? What kind of options you used? Can you send me config.log? Or did you used version from your distribution? Do you have only one installation of tesseract in your system? -- Zdenko On 18.11.2012

Re: problems with grayed background

2012-11-28 Thread Zdenko Podobný
On 28.11.2012 11:10, sascha4j wrote: i have trouble to ocr an image like in the attachment. only the word text is recognized. i tried several binarization algorithms, but without success. does it make sense to binarize the image ? or has tesseract it's own binarization? Yes, it has

Re: cntraining-debug.exe problem in setup tesseract3.01 with visual studio 2010

2012-11-30 Thread Zdenko Podobný
On 30.11.2012 09:02, Iris wrote: I follow the step according to http://code.google.com/p/tesseract-ocr/wiki/ReadMePre3 I first install the windows installer then unpack the source code, win_vs libraries and language datafile. But when I compile the tesseract project uner visual studio 2010, a win

Re: problem with LED-fonts recognition ;(

2012-12-06 Thread Zdenko Podobný
Thanks for correcting me - really I did not remember that information. -- Zdenko On 05.12.2012 08:06, Speedy wrote: Just check out Ray's October 2007 paper "An Overview of the Tesseract OCR Engine" where it says: The first step is a connected component analysis in which outlines of the compo

Re: Vector image input?

2012-12-14 Thread Zdenko Podobný
tesseract-ocr use leptonica for image IO. List of supported input type also depends on leptonica configuration e.g. if you did not compile jpeg support for leptonica, jpeg will be not supported in tesseract-ocr. So creating list of supported types would be tricky. For possible supported type yo

Re: Training Tesseract for single digit

2013-01-08 Thread Zdenko Podobný
On 08.01.2013 17:13, sunitha raghurajan wrote: I am using Tesseract to read license plate. The tesseract is giving wrong output for digit six. My question is, Can I train the tesseract for single digit 'six'. Any help truly appreciated. Can you post a example of image (with digit 6) that you tr

Re: Segmentation fault while making box file from .tif image

2013-01-21 Thread Zdenko Podobný
https://code.google.com/p/tesseract-ocr/wiki/FAQ#actual_tessdata_num_entries_<=_TESSDATA_NUM_ENTRIES:Error:Ass On 20.01.2013 15:43, Firas almannaa wrote: Hello, I'm trying to make

Re: Detecting Font size in android using Tesseract+Leptonica

2013-01-21 Thread Zdenko Podobný
See my test for font features[1]. It produces output for font size. [1] http://pastebin.com/0dV84hBa -- Zdenko On 21.01.2013 14:33, Karthik Kannan wrote: I'm making an android app to perform OCR on text using Tessearact and Leptonica(for binarization and Otsu thr

Re: How to use OCR in C++ project

2013-03-02 Thread Zdenko Podobný
On 01.03.2013 21:46, Vicky Patil wrote: Hi, I would like to ask whether anyone here knows how to use tesseract in c++? I need to do some character recognition but I do not know how to implement tesseract into my project. Should I be using the dll that comes with the download or I should import th

Re: How to get desire words coordinate from characters coordinate

2013-03-04 Thread Zdenko Podobný
Depending on your skills: a) You can analyze space between boxes to identify words (if you want to use box file) b) You can parse tesseract hocr output (if you have no clue what is hocr, search in this forum) c) You can use C++/C API of tesseract to create your own output - have a look at hocr

Re: Tesseract Variable - tosp_min_sane_kn_sp

2013-05-12 Thread Zdenko Podobný
Dňa 10.05.2013 17:51, newbie wrote / napísal(a): Hello all, Could you please explain me how we use the variable "tosp_min_sane_kn_sp"? and what is the value meaning? Please advise. Thanks. Can you please explain me why do you want to use something you have no clue what it is (and even you don'

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-21 Thread Zdenko Podobný
Please read the wiki https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method Zdenko On Wed, Apr 20, 2016 at 10:37 PM, S.J. Becker wrote: > > I've attached two files. > > The first file is my original one. It returns empty page (with > eng.traineddata). > > I noti

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-21 Thread Zdenko Podobný
On Thu, Apr 21, 2016 at 8:45 AM, ShreeDevi Kumar wrote: > Please file an issue on GitHub repo with these files so that it can be > looked at by the developers. > > Why? To waste their time??? E.g. presented command does not work ('tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess'). If you re

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-23 Thread Zdenko Podobný
On Thu, Apr 21, 2016 at 11:53 PM, S.J. Becker wrote: > > This page only shows the same list I've seen many times before without > any explanation: > > What does mean when it says "script detection" > See https://en.wikipedia.org/wiki/List_of_writing_systems and https://github.com/tesseract-ocr/t

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-23 Thread Zdenko Podobný
On Fri, Apr 22, 2016 at 12:27 AM, S.J. Becker wrote: > > I just did more testing. > > My one word or single character image works with > -psm 7 > -psm 8 > > my two or three lines of text image works with the default of > -psm 3 > as well as > -psm 4 > > They both seem to work with > -psm 6 > > I

Re: [tesseract-ocr] Error when running tesseract in arabic

2016-05-22 Thread Zdenko Podobný
This is not error output. It is leptonica warning message that it can not process given image format in memory and it has to use temp file instead (AFAIK it is platform dependand) Zdenko On Sun, May 22, 2016 at 8:33 AM, wrote: > Hey guys, > I'm running this command: > tesseract photo.jpeg out -

Re: [tesseract-ocr] Error when running tesseract in arabic

2016-05-22 Thread Zdenko Podobný
without input image nobody will help you. Zdenko On Mon, May 23, 2016 at 8:09 AM, wrote: > So... Do you have any idead why the output file is empty then? > > On Sunday, May 22, 2016 at 3:17:10 PM UTC+3, zdenop wrote: >> >> This is not error output. It is leptonica warning message that it can no

Re: [tesseract-ocr] configure: error: leptonica library missing -- FAQ is not working

2016-05-24 Thread Zdenko Podobný
have a look at config.log for reason (search for "pixCreate") Zdenko On Tue, May 24, 2016 at 10:51 AM, Dennis Park wrote: > faq says: > CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" > ./configure > would work, but i still got the problem: > .. > checking for mbsta

Re: [tesseract-ocr] configure: error: leptonica library missing -- FAQ is not working

2016-05-24 Thread Zdenko Podobný
That lines are just summary of test above that lines: configure:16988: checking for leptonica configure:17007: result: yes configure:17009: checking for pixCreate in -llept configure:17034: g++ -o conftest -g -O2 -I/home/work/.jumbo/include -I/home/work/.jumbo/include/leptonica -L/home/work/.jumbo

Re: [tesseract-ocr] Getting a blank tessinput.tif file

2016-06-06 Thread Zdenko Podobný
Your leptonica build support only limited number of image formats. What image you try to process? Zdenko On Mon, Jun 6, 2016 at 1:08 PM, Ashish Goel wrote: > Hello All, > > I am trying to do OCR on a bunch of images. Getting some failures, and I > want to analyse them. > So, to do that, I am tr

  1   2   >