tif
> and lstmf files. Am I right? so where should I place this script file in
> tesseract? or should I directly run this before the generation of the
> box,tif and lstmf files? Please correct me if my understanding is wrong.
>
> Thank you.
>
> On Sat, Oct 5, 2019 at 10:55 PM Shre
see
https://github.com/UB-Mannheim/tesseract/wiki/Install-additional-language-and-script-models
On Tue, Oct 8, 2019 at 3:09 PM Leopold Hamminger
wrote:
> Thank you, Zdenko
>
> I downloaded tesseract and installed it on my PC running Win 10. tesseract
> --version returns: v5.0-0-alpha.20190708. -
See
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
This was for Devanagari and Indic languages.
Also see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-text-requirements
On Thu, Oct 10, 2019 at 12:45 PM peter bence
wrote:
> I'm wor
I suggest that you open issue in tesstrain repo.
The makefile does training from scratch. Is that what you wanted? Do you
have a large enough training text - how many lines? How many iterations for
training?
Eval Char error rate=133.3, Word error rate=96.875
That is a very high error rate.
@AlexanderP maybe able to build one.
On Thu, Oct 10, 2019 at 8:09 PM 'Mario Trojan' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> Dear Community,
>
> what's the best way to use tesseract under CentOS 8 right now?
>
> Currently we're using the EPEL package under CentOS 7.7 (3.04.00)
opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
> dnf install tesseract
> dnf install tesseract-langpack-deu
>
>
> чт, 10 окт. 2019 г. в 18:31, Shree Devi Kumar :
>
>> @AlexanderP maybe able to build one.
>>
>> On Thu, Oct 10, 2019 at 8:09 PM 'Mario
Replace AEN in your box files with AWN and rerun training, using the
original tif files
On Mon, Oct 14, 2019, 12:16 Mobeen Ali wrote:
> Hello everyone! I'm stuck with a problem of creating a traineddata file
> that reads numerals in arabic and gives output in english numerals.
>
>- Input = A
sin) folder. But
> still same problem is there by giving warning message and normalization
> failed message [1]
>
>
>
> On Mon, 14 Oct 2019, 18:34 Shree Devi Kumar, wrote:
>
>> What about text in langdata_lstm?
>>
>> On Mon, Oct 14, 2019 at 2:44 PM Isurianurad
ue, 15 Oct 2019, 12:21 Shree Devi Kumar, wrote:
>
>> Check if you also have an installed version of tesstrain.sh?
>>
>>
>> On Tue, Oct 15, 2019, 11:26 Isurianuradha96
>> wrote:
>>
>>> I changed as you mentioned but giving the same warning as the
There are also third party GUI interfaces for tesseract. The ones that I
have used at times are vietocr and gimagereader.
On Wed, Oct 16, 2019, 17:13 Leopold Hamminger
wrote:
> I was new a few weeks ago and found tesseract quite easy to use. However,
> you should know the basics of console inpu
See
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> Hi tesseract community,
>
> I'm working on a research project about OCR and I'm wondering where the
> includ
https://github.com/tesseract-ocr/langdata_lstm
has the files used.
On Fri, Oct 18, 2019 at 9:39 AM Shree Devi Kumar
wrote:
> See
> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>
>
> On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesser
You can try with uzn files. See https://jsoma.github.io/kull/#/
On Fri, Oct 18, 2019 at 11:03 AM Rahul Dochak
wrote:
> Hi All,
>
> I have a task and I could see a way to approach this but i do not know
> how to ,what i am trying to do is this:
> I want to make a form recogniser and then extr
Rahul
>
> On Friday, October 18, 2019 at 11:16:54 AM UTC+5:30, shree wrote:
>>
>> You can try with uzn files. See https://jsoma.github.io/kull/#/
>>
>> On Fri, Oct 18, 2019 at 11:03 AM Rahul Dochak
>> wrote:
>>
>>> Hi All,
>>>
>>
Check your netspec. Does it meet the required vgsl specs. See wiki for
details and netspec used for various languages.
On Fri, Oct 18, 2019, 15:07 Shubham Gupta wrote:
> Hi All
>
> I am training Tesseract for Perso-Arabic languages using my custom
> dataset. I get *Segmentation fault-core dumped
Please check Tesseract version on both with
tesseract -v
Share an example image and the output you received on Mac OS and Ubuntu.
On Wed, Oct 23, 2019, 00:46 Yu Wang wrote:
> Hi, I experienced the same as Karan reported. I first installed tesseract
> on my macbook pro, then later on an Ubunt
blicly.
>
> On Wed, Oct 23, 2019 at 3:10 AM Shree Devi Kumar
> wrote:
>
>> Please check Tesseract version on both with
>>
>> tesseract -v
>>
>> Share an example image and the output you received on Mac OS and Ubuntu.
>>
>>
>>
>>
Looks ok. The dimensions need to match the bounding box in your tif.
You can extract unicharset from the training text also.
On Thu, Oct 24, 2019, 15:00 Adam Funk wrote:
> Hi,
>
> I'm a bit confused by some of the comments in the tesseract
> documentation, issues, and wiki about the addition of
You are mixing legacy Tesseract training and LSTM training.
The traineddata and other files from jtessboxeditor seem to be for the
legacy engine.
On Fri, Oct 25, 2019, 11:18 'ZenMaster181' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> Hi, I am new to this training tesseract.
> I
If you have the box and tiff files from jtesseditor, you can use
https://github.com/tesseract-ocr/tesstrain for training
However, training is needed only in special cases.
Have you tried with existing traineddata files?
On Fri, Oct 25, 2019 at 1:02 PM 'ZenMaster181' via tesseract-ocr <
tesseract
Please see
https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--4alpha--network-specification-for-tessdata_fast
Your trainedadata file is of legacy format. It will NOT work
You are mixing many different approaches for training.
If you have box/tiff pairs, use makefile from tesseract-ocr/tesstrain
If you want to train from text and fonts, use
tesseract-ocr/tesseract/src/tesstrain.sh
On Fri, Oct 25, 2019 at 2:57 PM 'ZenMaster181' via tesseract-ocr <
tesseract-ocr@g
Please open an issue in the tesstrain repository and include test data for
your issue to be reproduced eg. a sample of files that fail (text that
leads to the 4c lstmf files)
On Sun, Oct 27, 2019 at 10:16 PM J Adam Funk wrote:
> I have partly figured out what's wrong. From the 9 matching *.
Have you tried to ocr it character by character, using appropriate psm.
On Tue, Oct 29, 2019, 09:42 Dave Wood wrote:
> I am trying to use Tesseract to OCR screen shots from various Windows
> applications. So essentially the data is a random collection of letters
> and numbers, not written words
It fails with latest code.
See https://github.com/tesseract-ocr/tesseract/issues/2748
Try with an older commit.
On Tue, Nov 5, 2019, 11:32 Khangaroo wrote:
> Hi. I'm trying to create a fine-tuned model for Tesseract, but the
> tesstrain.sh script always appears to fail on "Phase E: Generatin
Google search about uzn, there are utilities to generate them.
On Tue, Nov 5, 2019, 14:20 Shree Devi Kumar wrote:
> It fails with latest code.
>
> See https://github.com/tesseract-ocr/tesseract/issues/2748
>
>
> Try with an older commit.
>
> On Tue, Nov 5, 2019, 11:32 K
See
https://stackoverflow.com/questions/34981144/split-text-lines-in-scanned-document
On Sat, Nov 9, 2019 at 3:10 AM Aaron Stewart
wrote:
> If you have any suggestions on how to split input images into individual
> text lines, I would appreciate it. I am able to use Python and OpenCV, but
> I
See
https://github.com/tesseract-ocr/tesseract/issues/2580#issuecomment-553393800
for
an example
On Wed, Nov 13, 2019 at 6:17 PM Kljuka Kljucavnicar
wrote:
> Hi,
> I would like to OCR an image with a single word on it and output .hocr
> file with coordinates of each character on that image (norm
Process the same invoice image on both platforms with tesseract command
line and compare those results.
post results of tesseract --version on each
inform which traineddata file you are using - language is eng, but is it
best/fast or tessdata… etc.
On Fri, Nov 15, 2019 at 5:18 PM MATHANKUMAR m
tesseract --version
Share output of above command on each platform.
Share an image and output on each platform.
On Sun, Nov 17, 2019 at 12:54 PM Mobeen Ali wrote:
> Hi everyone!
>
> i have successfully created my own custom traineddata file. I've done the
> training on ubuntu OS and it was giv
You can use --oem 0 and 2 only with the traineddata file from tessdata
repo. Those are the only files which also have the legacy models.
On Tue, Nov 19, 2019, 11:07 MATHANKUMAR m wrote:
> I do facing an issue while using the OCR engine modes 0 & 2.
>
>
> Failed loading language 'eng',Tesseract c
to work with oem 0,2 values.
> what i supposed to do get a response from those values
>
> On Tuesday, 19 November 2019 12:12:20 UTC+5:30, shree wrote:
>>
>> You can use --oem 0 and 2 only with the traineddata file from tessdata
>> repo. Those are the only files which also ha
able then provide me.
> On Tuesday, 19 November 2019 12:23:37 UTC+5:30, shree wrote:
>>
>> If you so want, you can copy the legacy model files from the traineddata
>> in tessdata repo to another traineddata.
>>
>> See the combine_tessdata command for unpacking and combini
https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation
On Wed, Nov 20, 2019, 12:46 Essam Zaky wrote:
> Dears sorry for this basic question
> I'm new in Linux world
> now i need to build ,debug , and trace tesseract code and see how it's
> working step by step in linu
https://www.cs.cmu.edu/~gilpin/tutorial/
https://web.eecs.umich.edu/~sugih/pointers/summary.html
On Wed, Nov 20, 2019, 16:10 Essam Zaky wrote:
>
> Thanks Shree
> The link describes the build process
> ?but what is the IDE will be used to debug and trace the code ,In windows
&g
convert test1.png -despeckle -despeckle -despeckle -despeckle -despeckle
-despeckle -despeckle -despeckle -despeckle -despeckle miff:- | textcleaner
-f 25 -o 10 - result.png
convert -units PixelsPerInch result.png -resample 300 result1.png
tesseract result1.png -
27627
uses textcleaner from http:
ref: https://imagemagick.org/discourse-server/viewtopic.php?t=33628#p154457
On Sat, Nov 23, 2019 at 11:53 AM lucmaa wrote:
> Hi, shree
> Why is the option -despeckle repeated so many times in the command
> convert?
>
> On Friday, 22 November 2019 13:17:32 UTC+8, shree wrote:
>
Training for all languages including RTL languages is done in LTR order.
See https://github.com/tesseract-ocr/tesseract/issues/2082 and other
related issues in github
On Sun, Nov 24, 2019 at 1:28 AM Ishak DÖLEK wrote:
> Hi;
> I create a trainneddata for an Arabic font.
> I prepared the ara.train
have you tried `osd` - orientation and script detection?
On Mon, Nov 25, 2019 at 8:13 PM Jeetendra Ahuja <
jeetendra.ahuja...@gmail.com> wrote:
> So before processing a document, we want to rejects ones which are CJK so
> I've used Tesseract for this.. It does pretty good job but some times when
Also try with 300 dpi
On Mon, Nov 25, 2019 at 9:45 PM Jeetendra Ahuja <
jeetendra.ahuja...@gmail.com> wrote:
> Nopes, I will do it. Thanks.
>
> On Monday, November 25, 2019 at 9:48:08 AM UTC-5, shree wrote:
>>
>> have you tried `osd` - orientation and script detection?
tessdata supports both legacy engine and lstm engine. Tessdata_fast and
tessdata_best only support lstm engine.
To use tessdata_fast , use oem engine code 1.
On command line it is --oem 1.please look up the corresponding syntax.
On Sat, Dec 7, 2019, 14:06 NY C wrote:
> Hi, I am using tess-two
ocrEngineMode
On Sat, Dec 7, 2019, 14:35 Shree Devi Kumar wrote:
> tessdata supports both legacy engine and lstm engine. Tessdata_fast and
> tessdata_best only support lstm engine.
>
> To use tessdata_fast , use oem engine code 1.
>
> On command line it is --oem 1.
text2image is not for use with scanned images.
Please see the repo tesseract-ocr/tesstrain for training using images.
On Mon, Dec 9, 2019, 15:23 P007 wrote:
> Hi,
> I want to use tesseract-OCR for Hindi language working with images. after
> installation all steps when I tried to execute the com
Run tesseract --version on the different systems.
Are thetraineddata files being used on the different systems the same?
Share an image and the different output received in each case.
On Mon, Dec 16, 2019, 17:58 adesh gautam wrote:
> Hi,
>
> I am using tesseract-ocr on my images, and i am gett
Tesseract 4 lstm engine and traineddata work on line images. Character
level bounding boxes are not accurate as has been reported in multiple
issues.
On Mon, Dec 16, 2019, 19:02 Mazzwar wrote:
> Supposing I have a dataset of images with bounding boxed words, is it
> possible to retrain the word
one with an lstm? Thanks
>
> On Monday, December 16, 2019 at 6:02:51 PM UTC+2, shree wrote:
>>
>> Tesseract 4 lstm engine and traineddata work on line images. Character
>> level bounding boxes are not accurate as has been reported in multiple
>> issues.
&g
12:47:28 PM UTC+5:30, shree wrote:
>>
>> Please check file sizes for eng.traineddata - they maybe different
>> versions even though they are called the same.
>>
>> On Mon, Dec 16, 2019 at 9:06 PM adesh gautam wrote:
>>
>>>
>>> There is the
You can try to finetune tessdata_best/script/Arabic.traineddata for Ottoman.
If you have line images and their groundtruth transcription, you can use
makefile process from tesstrain.
See https://github.com/tesseract-ocr/tesstrain/issues/128
Tesseract recognizes images to Unicode code points (UTF8
Please use https://github.com/tesseract-ocr/tesstrain
This works on line images and their ground-truth transcription.
On Windows, you could install WSL for running the *NIX scripts.
On Thu, Dec 19, 2019 at 11:14 AM preeti padalia
wrote:
> Hi,
>
> We are using tesseract to perform actions and v
Check https://github.com/OpenITI/OCR_GS_Data/tree/master/AzTurkish/kulliyati
On Fri, Dec 20, 2019, 03:30 Serkan Taş wrote:
> Hi Shree,
>
> I checked git page you referred and need some time to prepare line images
> and their ground-truth transcription. I guess I can but will ta
You can create traineddata with the --stop-training while lstmtraining
continues to run.
If you are using tesstrain makefile then it has a target called traineddata
which will generate traineddata file for each intermediate checkpoint.
You can stop and start training but I have a feeling that tra
Please see the repo tesseract-ocr/tesstrain, specifically wiki pages
regarding training for Fraktur.
On Fri, Dec 27, 2019, 00:51 Scott M. Sanders wrote:
> If you can't see the bad_rep.html, here is a pdf version.
>
> Le jeudi 26 décembre 2019 14:17:46 UTC-5, Scott M. Sanders a écrit :
>>
>>
>> I
Run the command
combine-tessdata -u eng.traineddata eng.
This will unpack all components of the traineddata file, including
lstm-unicharset
On Fri, Dec 27, 2019, 14:27 Ashwini Nande wrote:
> How to generate lstm-unicharset for tesserasct 4?
>
> --
> You received this message because you are
Formatting info is not retained in tesseract4. It was available in 3.0x
On Fri, Dec 27, 2019, 22:29 Scott M. Sanders wrote:
> I added the following code, which has improved the results. I thought that
> adding 'alto' would create an xml file with formatting information, but it
> didn't work. Is
Please also provide tesseract version information from a machine where it
is working.
On Sat, Jan 4, 2020 at 1:51 PM Votum V wrote:
> I've been using tesseract for a while now to read text from images that I
> take with a script for a game I am automating. I recently had to do a fresh
> install
2lib/1.0.6 liblz4/1.7.5
>
>
>
> On Saturday, January 4, 2020 at 4:44:39 AM UTC-4, shree wrote:
>>
>> Please also provide tesseract version information from a machine where it
>> is working.
>>
>> On Sat, Jan 4, 2020 at 1:51 PM Votum V wrote:
>>
>>>
Thanks for the info. It looks like a helpful set of tools.
Please confirm whether this is for training legacy tesseract and which
versions of tesseract are compatible with it.
On Sun, Jan 5, 2020, 02:22 Wincent Balin wrote:
> Hi all,
>
> I would like to announce pytesstrain, a collection of Tes
try --psm 6
ubuntu@tesseract-ocr:~/TEST$ tesseract lao.jpg -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 197
Empty page!!
Estimating resolution as 197
Empty page!!
ubuntu@tesseract-ocr:~/TEST$ tesseract lao.jpg - --dpi 300
Empty page!!
Empty page!!
ubuntu@tesserac
ica-1.78.0
>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
>>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>> Found AVX2
>>> Found AVX
>>> Found SSE
>>> Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3
Have you tried OMP_THREAD_LIMIT=1
On Tue, Jan 7, 2020 at 4:18 AM George Varghese wrote:
>
> reason I want to do this :
>
> I found that sometime other processes which runs on the same server, gets
> an exit code of 255 and does not complete. So If I can limit the usage of
> tesseract to 2 core
Read your textfile line by line
run text2image to create box/tif, similar to following.
text2image --fonts_dir="$unicodefontdir" --text="${linetext}"
--strip_unrenderable_words --xsize=2500 --ysize=300 --leading=32
--margin=12 --exposure=0 --font="$fontname" --outputbase="${fontname//
/_}.exp0
x27;t know how to run it and work with it, so please if you can help me to
> make a new traindata because I don't wanna use existing traindata!
> Thanks
>
>
> On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:
>>
>> Read your textfile line by line
>>
r/eng/eng.traineddata \
> > --model_output /data/output/mem.traineddata
> >
> > The file I'm using in --continue_from always has a fresh timestamp, but
> > the other checkpoints (with numbers in the filenames) in the same
> > directory are quite old.
>
try hocr output as follows
tesseract choices.png choices -c lstm_choice_mode=2 hocr
On Thu, Jan 9, 2020 at 11:43 AM 叶新舟 wrote:
> Hi:
>I found that tesseract by default return a recognize result (a single
> char for example) with the maxinum confidence,
>yet in my case, I want a list (
output is utf-8, how are you opening it? what is your locale?
On Thu, Jan 9, 2020 at 5:37 PM Manankumar Bhatt
wrote:
>
> I am running command "Tesseract image.jpg output -l eng -psm 6" which
> generates output.txt file.
>
> On Thursday, 9 January 2020 15:15:29 UTC+5:30, universal reseller wrote:
Tesseract reads only image files, not pdf. You can convert PDF to image
(tif, png) and OCR those.
Or use wrappers that use tesseract.which take a PDF and convert to text.
Look under add-ons in wiki.
On Mon, Jan 13, 2020, 00:31 'pjfarley3' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
i text.
>
> Library : Tess-Two
>
> Platform : Android
>
> How i can fix the problem related to spaces. Hereby, attaching a
> screenshot, input and output text.
>
> Regards
>
> On Tuesday, May 29, 2018 at 4:33:43 PM UTC+5:30, shree wrote:
>>
>> set the config vari
Take a look at tesseract-ocr/tesstrain
On Tue, Jan 14, 2020 at 10:13 PM 'Fabio Lugli' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> Hello everyone, i'm trying to train tesseract on handwriting, knowing that
> it's not the best option, using the latest version available for Windows.
Please share a couple of lstmf files for testing.
On Wed, Jan 15, 2020 at 8:03 PM 'Fabio Lugli' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> After some work i am able to:
> - Use the method *lstmbox* of *tesseract.exe* to obtain the *.box* files
> of my *.tif* images
> - Use the t
ooglegroups.com> wrote:
> Yes, i forgot to do it in the latest post. I share a couple of the images
> and their correspondant .*box *and .*lstmf *files. The others that i
> tried until now are very similar to these ones.
>
> Il giorno mercoledì 15 gennaio 2020 15:38:23 UTC+1, sh
ase I can simply copy those file in the folder?
>
> Il giorno giovedì 16 gennaio 2020 10:45:59 UTC+1, shree ha scritto:
>>
>> Are you sure you have the files in the right places? It seems to work for
>> me...
>>
>> ubuntu@tesseract-ocr:~/tesseract$ cd ../TEST/
at this is not what i should have inside *all-lstmf*
> ?
>
> Il giorno giovedì 16 gennaio 2020 12:04:50 UTC+1, shree ha scritto:
>>
>> tesseract unpack is a new feature by @stweil - not yet in the master
>> branch. I was testing to see that your lstmf files are read corre
ittsskell from*
> *Iteration 0: BEST OCR TEXT : k MOVE t0 stoe Mr. GarkkeldR Prom*
> *File eng.test.pro0.lstmf line 0 :*
>
> And then nothing. It opens a new terminal prompt. Could it be using
> windows the cause of this issue?
>
> P.S. Thank you for all your time that you pass answ
orno giovedì 16 gennaio 2020 14:26:50 UTC+1, shree ha scritto:
>>
>> I haven't trained on windows. If you want to do training, it will be
>> better to use Linux.
>>
>> On Thu, Jan 16, 2020 at 6:30 PM 'Fabio Lugli' via tesseract-ocr <
>> tesser
dd ons" part of the wiki doesn't actually have
> a PDF-to-OCR'ed-text wrapper as far as I can see.
>
> Still searching for a solution, but thanks for trying to help.
>
> Peter
>
> On Monday, January 13, 2020 at 1:49:31 AM UTC-5, pjfarley3 wrote:
>>
>
Verify that you don't have an older version of tesstrain.sh
Try using tesseract/src/training/tesstrain.sh
and see if maxpages takes effect
On Sat, Jan 18, 2020 at 1:47 PM Fil wrote:
> I'm trying to figure out how to train tesseract from scratch using
> auto-generated box/tif/lstm files. I've be
See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#ocr-results
Sometimes using multiple models (last three) from training gives better
results.
On Mon, Jan 20, 2020 at 1:52 PM 'Fabio Lugli' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> After working a couple of days on
Please share your input files to see if I can replicate this.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To vie
of pages I specified, not just
>> generate exactly what's in the eng.training_text file and nothing more/less.
>>
>> On Tuesday, January 21, 2020 at 10:37:41 PM UTC-8, shree wrote:
>>>
>>> Please share your input files to see if I can replicate this.
>>&g
Is there a Unicode font for modi script?
On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:
> Hi... I want to create Modi script (Marathi language) traineddata in
> tesseract for OCR. Can somebody guide what steps should I follow.
> I referred
Thanks for the link to Modi Unicode font.
I would convert the Marathi training text to Modi script (use Aksharamukha)
and then train using the unicode font.
On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW
wrote:
>
> On Jan 26, 2020, at 08:16, Shree Devi Kumar wrote:
>
> Is there a
Not all viewers work alike. Try with the free Adobe Acrobat Reader or the
viewer in Chrome.
When I last checked most readers/viewers will select and search text in
tesseract generated pdfs. Many times the highlighting of selection is
incorrect but if you copy and paste all recognized text should b
ance.
>
> On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote:
>>
>> Thanks for the link to Modi Unicode font.
>>
>> I would convert the Marathi training text to Modi script (use
>> Aksharamukha) and then train using the unicode font.
>>
>> O
Please see https://github.com/Shreeshrii/tesstrain-ckb It uses a modified
training text based on what you sent and earlier text that I had from
Pewan and other corpora.
Currently the training data includes
* AWN 0-9
* AEN - ARabic numbers
* No Persian numbers since some shapes are similar to Arab
-txt2img.sh
https://github.com/Shreeshrii/tesstrain-ckb/blob/master/3-img2lstmf.sh
https://github.com/Shreeshrii/tesstrain-ckb/blob/master/4-train-layer.sh
On Tue, Jan 28, 2020 at 12:08 PM manu pranay
wrote:
> shree,
> can you please help me out how to perform arabic training on tesse
Please see https://github.com/tesseract-ocr/tesstrain/wiki
There are already newly trained models by @stweil for Fraktur.
On Tue, Jan 28, 2020, 22:46 Val LNB wrote:
> *How to perform incremental training on Tesseract 4.0+?*
>
>
> I want to improve the existing fraktur (frk) model with some 6000
. Pango suggested font 'MarthiCursiveT Medium'*
>
> Please advise for both the queries.Thanks in advance
>
> On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote:
>>
>> For LSTM training punc, numbers, wordlist are NOT required. You can add
>> them if y
The default language that tesseract uses when none are specified is eng.
Hence you get box file with English characters.
There is currently no `Modi` traineddata so you can't use that, You could
use `-l mar` to use Marathi but obviously the recognition will not be
correct.
I suggest that you use
t;
>
> Interestingly, .png failes are used when running training so I could have
> perhaps skipped conversion to .tif since I started with .png! :)
>
> Now, the big question, how long will it take to run 10,000 epochs on
> average 4 core Xeon v3 VM?
>
>
>
>
>
>
ing tesseract
>> with images. Thanks once again
>>
>>
>>
>> On Friday, January 31, 2020 at 12:39:31 AM UTC-5, shree wrote:
>>>
>>> Please see https://github.com/Shreeshrii/tesstrain-modi for finetune
>>> training for Modi from Marathi using synt
If you send a couple of scanned images with their ground truth
transcription and box files, I can test with that and suggest next steps.
On Sat, Feb 1, 2020, 09:28 Shree Devi Kumar wrote:
> tesseract-ocr/tesstrain repo has makefile for training with images.
>
> See
> https:
The version of leptonica that you have
leptonica-1.79.0
libpng 1.2.50 : zlib 1.2.8
Only has support for png. All others will fail.
You need to change leptonica build to include libtiff etc.
On Sat, Feb 1, 2020, 05:48 lundissimo wrote:
> Thank you for that link. I hadn't retrieved the file f
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#training-just-a-few-layers
On Sat, Feb 1, 2020 at 11:33 AM manu pranay
wrote:
> Thank you so much for your help shree.
> the links you provided were very helpful for me.
>
> now i am trying to train lstm training wit
data/modi/list.eval \
--max_iterations 99
On Sat, Feb 1, 2020 at 11:33 AM manu pranay
wrote:
> Thank you so much for your help shree.
> the links you provided were very helpful for me.
>
> now i am trying to train lstm training with retraining the top layer.
> can you please pro
For Debian you can also get the latest packages from
https://notesalexp.org/tesseract-ocr/
On Sat, Feb 1, 2020 at 10:56 AM Shree Devi Kumar
wrote:
> The version of leptonica that you have
>
> leptonica-1.79.0
> libpng 1.2.50 : zlib 1.2.8
>
> Only has support for png. Al
https://github.com/impactcentre/ocrevalUAtion
https://github.com/eddieantonio/ocreval
https://github.com/tesseract-ocr/tesstrain/wiki/German-Konzilsprotokolle
On Sat, Feb 1, 2020 at 4:31 PM manu pranay wrote:
> thank you shree.
> I am done with my retraining top layer training with
-modiLayer_1.017_157724_324000/report_modiLayer_1.017_157724_324000-modi-ALL.txt
for an example
Do you have a workflow for tesseract training using your tools? If so, I
would like to add/refer to it in Tesseract documentation.
On Tue, Feb 4, 2020 at 2:06 AM Wincent Balin
wrote:
> Hi Shree,
>
> I am
>
> By the way, I added a create_ground_truth utility, which creates .gt.txt
> files as well as the associated .tif files for every specified font, to
> the package. I think it could be useful for anyone who does not have a
> ground truth collection yet.
>
> Thanks, I tried it with latest tesseract
Re: max threads, please see
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-455614504
I will test the new scripts later and report back
On Mon, Feb 10, 2020 at 12:28 AM Wincent Balin
wrote:
> Hello Shree,
>
> I just uploaded new version of the package. About the fix
Hello Wincent,
Thanks for the new version of package.
No errors regarding font now and not slow either.
Tested on Ubuntu.
On Mon, Feb 10, 2020 at 12:28 AM Wincent Balin
wrote:
> Hello Shree,
>
> I just uploaded new version of the package. About the fixes:
>
> 1. --fonts_d
601 - 700 of 994 matches
Mail list logo