Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar wrote: > > Each of my PNG files have file names that indicate ground truth, and I > have a little script that generat

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-20 Thread Shree Devi Kumar
7 --oem 1 -c > > tessedit_char_whitelist=',0123456789' > 638,997.png out > Failed to load any lstm-specific dictionaries for lang swtor!! > Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica > Warning: Invalid resolution 0 dpi. Using 70 instead. > >

Re: [tesseract-ocr] Making a serachable PDF.

2020-09-25 Thread Shree Devi Kumar
Try to use a gui frontend such as gimagereader or ocrmypdf. Tesseract does not take pdf as input. On Fri, Sep 25, 2020, 12:58 Arvind Mahesh wrote: > > Complete programming noob, so please pardon my ignorance. > I really want to convert this PDF into a searchable pdf but I barely > understand any

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-27 Thread Shree Devi Kumar
Thank you for sharing the results of your trial with fine-tuning and getting better results with the official traineddata after pre-processing the images. Hope your notes will help other users with similar questions. On Sun, Sep 27, 2020, 20:51 Grad wrote: > @shree thank you for the advice,

Re: [tesseract-ocr] Diacriticals Training

2020-09-27 Thread Shree Devi Kumar
I am currently running a training run based on synthetic training data for Sanskrit to support both Devanagari script with vedic accents as well as iAST (Roman with diacritics support). I will share the traineddata for you and others who are interested to test how well it works with real life image

Re: [tesseract-ocr] unable to install tesseract-ocr in RHEL

2020-09-29 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki#rhelcentosscientific-linux-fedora-opensuse-packages On Tue, Sep 29, 2020, 12:53 Yeshwant Kumar wrote: > Hi folks, > > We are having a problem in building docker image. > > > > When base image of docker is *ubuntu:focal* we are able to install

Re: [tesseract-ocr] Diacriticals Training

2020-10-01 Thread Shree Devi Kumar
Please read tesseract documentation regarding lstm training by replacing a layer. On Thu, Oct 1, 2020, 11:29 shreyansh dwivedi wrote: > Hello Shree, > Firstly, thank you for looking into it. Secondly, I would be grateful if > you share the piece of code with the explanation part of how

Re: [tesseract-ocr] Tesseract failing for very clear image

2020-10-05 Thread Shree Devi Kumar
Try to add a little bit of white border to image and see. Try --psm 6 On Mon, Oct 5, 2020, 11:00 Guillaume Bersac wrote: > Hello, > > ### Environment > **Tesseract Version**: > tesseract 4.0.0 > leptonica-1.76.0 > libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : > libtiff 4.

Re: [tesseract-ocr] PyTesseract not recognizing decimal points

2020-10-06 Thread Shree Devi Kumar
Have you tried cropping the image to remove the arrowhead to see if that improves the result? On Tue, Oct 6, 2020 at 9:42 AM Andrew wrote: > As per my question on StackOverflow: PyTesseract not recognizing decimals >

Re: [tesseract-ocr] Diacriticals Training

2020-10-08 Thread Shree Devi Kumar
nk&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar wrote: > I am currently running a training run based on synthetic training data for > Sanskrit to support both Devanagari script with vedic

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-10-10 Thread Shree Devi Kumar
n is if i have a substantial > amount of images and then process and produce the line image and ground > truth from it- will that help me in improving the detection? > > On Sunday, September 27, 2020 at 9:21:17 PM UTC+6 Grad wrote: > >> @shree thank you for the advice, it wa

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-10-11 Thread Shree Devi Kumar
Tesseract will make a checkpoint, if needed, every 100 iterations, so I suggest a minimum 50-100 line images to test finetuning. Also, one of your image samples has a lot of noise on the right side. Crop all extra parts. Also for `ben` you should choose the Indic language option in tesstrain. On S

Re: [tesseract-ocr] training doubts

2020-10-20 Thread Shree Devi Kumar
For English, most of the times, preprocessing your images and using official traineddata will give better results than trying to do training. For finetuning, ( https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#fine-tuning-for-impact) what is recommended is using the existing trai

Re: [tesseract-ocr] add new characters

2020-10-24 Thread Shree Devi Kumar
Ray has suggested using plus-minus type of training for adding a couple of characters to the traineddata. Did you try that? Please share the training data you used (box/tiff pairs or lstmf files). I have done replace a layer training for Sanskrit. It adds the two characters you want (in addition

Re: [tesseract-ocr] Fwd: Training tesseract OCR

2020-10-30 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain On Sat, Oct 31, 2020 at 9:54 AM bosh sherikar wrote: > Please Reply back > > -- Forwarded message - > From: bosh sherikar > Date: Tue, Oct 13, 2020 at 10:42 PM > Subject: Training tesseract OCR > To: > > > Dear community, > > I ha

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-10-31 Thread Shree Devi Kumar
Are you trying to train for the legacy tesseract engine? On Sun, Nov 1, 2020, 03:29 Cailey McVay wrote: > Hello! > I am working on a project that is trying to read borehole video depths. We > trained a new language to read these numbers called NTS. When we use > tesseract on the images without t

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-10-31 Thread Shree Devi Kumar
>When we use tesseract on the images without the trained language we receive outputs that are accurate about 50% of the time. You haven't shared a sample image. Sometimes preprocessing the images, using a whitelist in case of limited character set can be the solution rather than training. On Sun,

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-11-01 Thread Shree Devi Kumar
ple image. I believe we are using the legacy > engine. Does this help? > > On Saturday, October 31, 2020 at 11:15:46 PM UTC-4 shree wrote: > >> >When we use tesseract on the images without the trained language we >> receive outputs that are accurate about 50% of the time. >

Re: [tesseract-ocr] Diacriticals Training

2020-11-05 Thread Shree Devi Kumar
ratch. ,. shapetable, tr etc are all files for legacy engine, 3.0x and before. It is supported in tesseract4 with --oem 0 On Thu, Nov 5, 2020, 17:14 Shree Devi Kumar wrote: > Are you trying to train for the legacy tesseract engine? > > On Thu, Nov 5, 2020, 16:46 shreyansh dwivedi >

Re: [tesseract-ocr] Low tesseract accuracy

2020-11-11 Thread Shree Devi Kumar
Suggest you pre-process images instead of training. See https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html On Tue, Nov 10, 2020 at 12:14 PM Dinesh Yakkanti wrote: > Hello Everyone, >I am trying to build custom tesseract-ocr model. I am getting > high error rate even if i kep

Re: [tesseract-ocr] Diacriticals Training

2020-11-12 Thread Shree Devi Kumar
for 4.0. you can try plusminus or replace top layer type of training. For good results you need a lot of training data, eg. 5 text lines. On Thu, Nov 12, 2020, 12:21 shreyansh dwivedi wrote: > Hello shree, > Than, what is the way to train the sanskrit along with roman diacritical

Re: [tesseract-ocr] Diacriticals Training

2020-11-12 Thread Shree Devi Kumar
-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Thu, Nov 12, 2020 at 4:08 PM Shree Devi Kumar wrote: > Please see tesseract-ocr/tesstrain repo > > You need line images and their groun

Re: [tesseract-ocr] tesseract-ocr for train persian language

2020-12-11 Thread Shree Devi Kumar
I don't think jTessBoxEditor supports RTL languages like Persian. You can try using tesstrain.sh On Fri, Dec 11, 2020 at 8:57 PM alireza m wrote: > hi i want to train persian language by b nazanin font but jTessBoxEditor > doesn't have b nazanin font how can i add new font??? > > -- > You receiv

Re: [tesseract-ocr] Diacriticals Training

2020-12-14 Thread Shree Devi Kumar
Appreciate your offer to help and provide feedback as well as training data. Let me try to answer your queries: 1. > I have been using san. But was unaware that you can also use Devanagari. What is the difference? san has been trained for Sanskrit. But it is missing certain Devanagari character

Re: [tesseract-ocr] Tesseract Performance

2020-12-24 Thread Shree Devi Kumar
>testing an unseen image, the performance was exactly the same. Can you share the image (preferably a page) and expected result? On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta < ranjansou...@gmail.com> wrote: > Hi everyone, > I wanted to do fine-tune the ben.traineddata model by using so

Re: [tesseract-ocr] Tesseract Performance

2021-01-01 Thread Shree Devi Kumar
Shreeshrii, > > Can you please tell me the training command used? Also, how can I create > the graphs and these other documents? > > On Sat, 26 Dec 2020, 18:37 Shree Devi Kumar, wrote: > >> Soumik, >> >> I used your groundtruth and trained using ben as the START

Re: [tesseract-ocr] Tesseract Performance

2021-01-01 Thread Shree Devi Kumar
data (not seen by lstmtraining either for training or eval, shows an improvement over both ben and script/Bengali. To improve results further, check groundtruth transcription for any missing words, normalize the text and try with some more training data. On Fri, Jan 1, 2021 at 6:41 PM Shree Devi

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
rt bidi.algorithm > ModuleNotFoundError: No module named 'bidi' > Makefile:207: recipe for target 'data/ben-ground-truth/24-022.box' failed > make: *** [data/ben-ground-truth/24-022.box] Error 1 > > I should mention I double checked the 24-022.gt.txt and 24-022.tif

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
> ranjansou...@gmail.com> wrote: > >> Hi Shree, >> >> I installed the bidi module. The error went away, but the training does >> not happen again. Please find the log and training script attached. >> FYI I am using the makefile from the master branch. Do I n

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Old versions of tesstrain.sh used to limit training to 3 pages. Looks like you may have an old version in the path somewhere. On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: > I have a script to train tesseract and I ran it on Arch Linux, Debian, and > even a docker container and they all produce

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
Or you may have an old version of data/ben/checkpoints/ben_checkpoint -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
n > > On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote: > >> Old versions of tesstrain.sh used to limit training to 3 pages. Looks >> like you may have an old version in the path somewhere. >> >> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: >> >

Re: [tesseract-ocr] Easily readable Russian not recognized in language app screenshot

2021-01-07 Thread Shree Devi Kumar
ar Russian, no-noise PNGs—and what could be done about it. >> >> On Thursday, October 8, 2020 at 7:08:28 AM UTC+2 shree wrote: >> >>> Give each region of interest separately. >>> >>> >>> <http://www.avg.com/email-signature?utm_medium=email&utm_so

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
ther 2 errors are occurring? > On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: > >> Your training text file is only 175 lines, so the rendered image fits in >> 4 pages. You need to use a larger text if you want more pages. >> >> Also check that your fonts supp

Re: [tesseract-ocr] make training does nothing when run

2021-01-08 Thread Shree Devi Kumar
>After placing the groundtruth files in a folder called *data/foo-ground-truth* inside the main *tesseract *repo folder, data/foo-ground-truth needs to be under the tesstrain folder not tesseract folder. You can use ground-truth in a different location, in that case you have to refer to it whi

Re: [tesseract-ocr] make training does nothing when run

2021-01-08 Thread Shree Devi Kumar
`tessdoc` repo. On Fri, Jan 8, 2021 at 9:05 PM Keith wrote: > Shree, > > Thank you for your reply. I should have gone to bed (it was like 2 AM my > time on a work night) instead of continuing to bang my head. > > When I saw your message this morning, I was thinking, "

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-12 Thread Shree Devi Kumar
; encoding string problem. I wonder if it's a problem with the unicharset > extractor? > On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote: > >> Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for >> updates >> >> On Saturday, Jan

Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2021-01-19 Thread Shree Devi Kumar
>*wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata * That is not correct. You need to get the `raw` file. https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata *wget https://githu

Re: [tesseract-ocr] New release for tessdata_{fast,best}?

2021-01-27 Thread Shree Devi Kumar
>The Internet Archive has switched to using Tesseract for all our OCR, I am so happy to hear this. It will be great to have the Indic languages that were marked as non-ocrable so far be converted to text correctly on Internet Archive. Is there any page with instructions to do this? Can a language

Re: {EXTERNAL}[tesseract-ocr] Installing tessdata

2021-01-27 Thread Shree Devi Kumar
Please see https://tesseract-ocr.github.io/tessdoc/Data-Files.html Also the readme files in the three repos https://github.com/tesseract-ocr/tessdata_fast On Thu, Jan 28, 2021, 03:20 Peter Kronenberg wrote: > Hi, can someone help with these questions? Just trying to understand > better how

Re: [tesseract-ocr] Training tesseract, APPLY_BOXES: ... FAILURE! Couldn't find a matching blob for BENGALI language.

2021-01-28 Thread Shree Devi Kumar
For Bengali, you need to train the LSTM model. Legacy model training won't work. On Thu, Jan 28, 2021, 22:32 Boring Guy69 wrote: > > Hello i am new to tesseract. i am working on bengali language [kalpurush > font]. > I got lots of error when i make TR files. if i describe my work flow > At first

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
Add the following to your lstmtraining command and see. --debug_interval -1 On Fri, Feb 5, 2021 at 4:05 PM Kumar Rajwani wrote: > HI, > i am trying to finetune eng.traindata as per my images i have tried to > train but all time i am stuck somewhere can you tell me how can i procced > further.

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
lView: Waiting for server... > Error: Unable to access jarfile ./ScrollView.jar > sh: 1: kill: No such process > On Friday, February 5, 2021 at 4:28:14 PM UTC+5:30 shree wrote: > >> Add the following to your lstmtraining command and see. >> --debug_interval -1 >> >>

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani wrote: > hi, > i have tried minus 1 and got following result > Iteration 0: GROUND TRUTH : ) @® > Iteration 0: BEST OCR TEXT : Yo > File eng.arial.exp0.lstmf line 0 : > > What's your version of tesseract? What o/s? > Without your files, it's diffic

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
usp=sharing >> this is my notebook you can see complete process in finetune 2 section. >> >> >> On Friday, February 5, 2021 at 4:55:43 PM UTC+5:30 shree wrote: >> >>> On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani >>> wrote: >>> >>>>

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
gt; On Friday, February 5, 2021 at 5:50:30 PM UTC+5:30 Kumar Rajwani wrote: > >> main thing is i want to learn about training tesseract on image level so >> can you please tell me how can i procced further. i want to know where is >> the main problem. >> >> &

Re: [tesseract-ocr] Tesseract output text and symbol

2021-02-18 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/FAQ.html#what-page-separators-are-used-in-txt-output-by-tesseract-400 On Thu, Feb 18, 2021, 12:15 J Cassar wrote: > Good Day, > > I've used tesseract on a number of jpeg images ( see input image attached) > and it works fine as it outputs the text. How

Re: [tesseract-ocr] Re: To make traineddata file non-traineable

2021-02-24 Thread Shree Devi Kumar
Yes. Usage for compacting LSTM component to int: combine_tessdata -c traineddata_file On Wed, Feb 24, 2021 at 10:56 PM Jennil Thiyam wrote: > HI shree, so by running this command, the model will be in its > integer/fast version? > > On Wed, Feb 24, 2021 at 10:27 AM shree wro

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-13 Thread Shree Devi Kumar
You have not stated the version of tesseract that you are using. >We downloaded some online training data available for the language Malayalam You have not mentioned from where you got it. Are these the official traineddata files? >we found that few special characters in the language are not pic

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-20 Thread Shree Devi Kumar
avinash singh wrote: > Hello Shree, > > Thank you for your reply, > > We have used tesseract 4.0 alpha > > The Training Data is used from the below > > https://github.com/tesseract-ocr/tessdata_best > > https://tesseract-ocr.github.io/tessdoc/Data-Files.html > &

Re: [tesseract-ocr] Properly Insert OCR Into Separate Columns

2021-03-21 Thread Shree Devi Kumar
Please see the newly added table detector to the master branch https://github.com/tesseract-ocr/tesseract/pull/3330 On Mon, Mar 22, 2021, 10:53 Daniel Lu wrote: > Hi, > > I am trying to read hundreds of pages of information like the picture > below into a CSV file. For us humans, it is very cle

Re: [tesseract-ocr] downgrade to last tessract alpha version tesseract 5.0.0-alpha-20201231-246-gfe61

2021-03-23 Thread Shree Devi Kumar
Please report as issue in tesseract repo. On Tue, Mar 23, 2021, 13:46 Kumar Rajwani wrote: > The latest push is working fine but when image is blury or have some noise > it can't able to pass the image. it shows Detected 12 diacritics . The > previous version was working fine with my images. > >

Re: [tesseract-ocr] downgrade to last tessract alpha version tesseract 5.0.0-alpha-20201231-246-gfe61

2021-03-23 Thread Shree Devi Kumar
@AlexanderP/tesseract-debian Is there a way to use older ppa versions? On Tue, Mar 23, 2021, 13:46 Kumar Rajwani wrote: > The latest push is working fine but when image is blury or have some noise > it can't able to pass the image. it shows Detected 12 diacritics . The > previous version was w

Re: [tesseract-ocr] Installing tesseract 5 via vcpkg

2021-03-24 Thread Shree Devi Kumar
Yes, -head doesn't work with vcpkg. You can install the dependencies via vcpkg and then build tessaract. See https://github.com/tesseract-ocr/tesseract/actions/runs/681261367/workflow for the steps On Wed, Mar 24, 2021, 19:44 Fábio Ramos wrote: > Hello, I've tried using vcpkg to install tesse

Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Shree Devi Kumar
Try with newer version of tesseract. On Thu, Mar 25, 2021, 13:19 Vidya Chitragar < vidya.chitra...@lucidatechnologies.com> wrote: > Hi Every one. > I am using pytesseract with tesseract-ocr version 3.05.02 for conversion > of scanned pdf document of 1000k pages to searchable pdf document but my >

Re: [tesseract-ocr] tesseract failing on extremely simple example

2021-03-27 Thread Shree Devi Kumar
Do you have the font used in the sample? Do you only need to recognise numbers in it? On Sat, Mar 27, 2021, 16:10 Marvin Thielk wrote: > I've tried a variety of pre-processing attempts and different configs, but > this feels like it should be an easy detection task. > > I've tried with several d

Re: [tesseract-ocr] tesseract failing on extremely simple example

2021-03-30 Thread Shree Devi Kumar
k you so much! > > What hyperparameters did you use for training? number of pages? epochs? > > Which model did you start with? your file seems smaller than other > eng.traineddata files. > > Thanks, > ~Marvin > > On Sun, Mar 28, 2021 at 10:16 AM Shree Devi Kumar > wrote

Re: [tesseract-ocr] tesseract WIndows 10 Newbie here

2021-04-01 Thread Shree Devi Kumar
Check that the tesseract directories is added to Path so that it can be found. On Fri, Apr 2, 2021, 11:03 Gianfranco Dy wrote: > One of the requirements of alimranahmed/LaraOCR is that the Tesseract > command is accessible > > [image: Capture.PNG] > after installing tesseract-ocr-w64-setup-v5.0.

Re: [tesseract-ocr] Lstm training parameters

2021-04-04 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_best.md On Mon, Apr 5, 2021, 00:52 Adriana Camilleri wrote: > By any chance, is there any information out there about the training > p

Re: [tesseract-ocr] Unable to understand Iterations?

2021-04-14 Thread Shree Devi Kumar
It has seen only 600 lines of data of which only 300 have been used for learning. Iterations are different from an epoch which is going through all training data. On Wed, Apr 14, 2021, 01:36 GCP COGNEXT wrote: > What does *At Iteration 300/600/600.* > > Let's assume I have 10k data and I want

Re: [tesseract-ocr] What is Max Iterations & Epochs in tesstrain Makefile

2021-04-14 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#lstmtraining-command-line Epoch has been recently added to the tesstrain makefile and converts to number of iterations based on amount of training data. On Wed, Apr 14, 2021, 01:36 GCP COGNEXT wrote: > Hi All, > > I w

Re: [tesseract-ocr] What do iteration numbers mean in the train logging?

2021-04-14 Thread Shree Devi Kumar
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints Epoch size depends on your training data. If you have 1000 lines of training data, then 1 epoch is 1000 iterations. If you have 5 lines of training text, 1 epoch is 5 iterations. On Wed,

Re: [tesseract-ocr] What are Langdata repository given for retraining Tesseract

2021-04-15 Thread Shree Devi Kumar
Use langdata_lstm repo for LSTM training. That has larger training text. On Thu, Apr 15, 2021, 00:52 Venkatapathy S wrote: > Hi, > I want to retrain Tesseract from the scratch for a particular language(I > have read as many resources as possible, including warnings, from the > Tutorial

Re: [tesseract-ocr] Doubt about using 5.0.0-beta-20210916 before release version is available

2021-10-19 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/3331#issuecomment-946532564 On Tue, Oct 19, 2021, 16:26 juan carlos hernández < juan.carlos.h.c.valen...@gmail.com> wrote: > Hello > I'm working in a project that needs OCR and we have choosed to use > Tesseract. We would like to use v5.0.0, b

Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-20 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesstrain/wiki for detailed examples of tesseract training for handwritten texts. On Sat, Nov 20, 2021 at 11:53 AM Peter Geraghty wrote: > sorry, by word recognition, I meant word and character localization. > > On Friday, November 19, 2021 at 11:04:38

Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-21 Thread Shree Devi Kumar
Also see the Technical Information section in https://tesseract-ocr.github.io/tessdoc/ On Mon, Nov 22, 2021, 01:36 Peter Geraghty wrote: > Thank you!!! will do! > > On Sunday, November 21, 2021 at 12:51:51 AM UTC-6 shree wrote: > >> Please see https://github.com/tesseract-

Re: [tesseract-ocr] training scripts in 5.0.0

2021-12-04 Thread Shree Devi Kumar
Please see the tesstrain repo. Python version of tesstrain.sh etc have been moved there. On Sat, Dec 4, 2021, 22:37 Marco Atzeri wrote: > Hi, > > I am updating the cygwin package from 4.1.1 to 5.0.0 > and I noticed that 3 scripts > >language-specific.sh >tesstrain.sh >tesstrain_utils

Re: [tesseract-ocr] compile tessract 5.0 in win10

2022-01-01 Thread Shree Devi Kumar
You can download windows binaries from https://github.com/UB-Mannheim/tesseract/wiki On Sat, Jan 1, 2022, 16:54 杜德銘 wrote: > the original : > > vcpkg install tesseract:x64-windows for 64-bit. Use –head for the master > branch. > > is not 5.0, is 4.1. > > can update this command? > > reply by

Re: [tesseract-ocr] compile tessract 5.0 in win10

2022-01-01 Thread Shree Devi Kumar
I have also posted in vcpkg repo for them to update the official package to 5.0.0. https://github.com/microsoft/vcpkg/issues/16019 On Sat, Jan 1, 2022, 17:20 Shree Devi Kumar wrote: > You can download windows binaries from > https://github.com/UB-Mannheim/tesseract/wiki > > > &g

Re: [tesseract-ocr] Tesseract 4.0 - Multiline text

2022-03-23 Thread Shree Devi Kumar
Use the hocr option. On Thu, Mar 24, 2022, 10:52 Muraliraj DK wrote: > I am not sure if you have looked at the image. What i meant on Multi line > text is when the sentence is wrapped to next line i would like to extract > as single sentence instead of 2 lines (paragraph). > > Single line is - s

Re: [tesseract-ocr] Ubuntu : Unable to locate package libleptonica-dev

2022-03-31 Thread Shree Devi Kumar
https://packages.ubuntu.com/focal/libleptonica-dev On Fri, Apr 1, 2022, 11:07 polki paul wrote: > Hello, > > how to install libleptonica-dev on Ubuntu 20.04 ? > > > *sudo apt-get updatesudo apt-get install libleptonica-dev* > > > > > *Reading package lists... DoneBuilding dependency treeReading

Re: [tesseract-ocr] Ubuntu : Unable to locate package libleptonica-dev

2022-03-31 Thread Shree Devi Kumar
ionic main" and paste it as shown below on the next line. If you are using a different release of ubuntu, then replace bionic with the respective release name. deb http://archive.ubuntu.com/ubuntu bionic universe On Fri, Apr 1, 2022, 11:49 Shree Devi Kumar wrote: > https://packages.ubu

Re: [tesseract-ocr] Running Tesseract 5 on Linux

2022-04-03 Thread Shree Devi Kumar
Have you tried instructions on https://tesseract-ocr.github.io/tessdoc/Installation.html On Sun, Apr 3, 2022, 22:08 'Peter Kronenberg' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Has anyone had any luck installing Tesseract 5 on Linux? It doesn’t seem > to be available in any of

[tesseract-ocr] Re: Kurdish traineddata

2022-10-16 Thread Shree Devi Kumar
Thank you for sharing information regarding successful training of Kurdish traineddata for Tesseract. Please also let us know whether the traineddata is available for others to use. You may want to contribute to the tess_contrib repo. Let us know whether the recognition covers 0-9 digits in Arabi

Re: [tesseract-ocr] Re: Kurdish traineddata

2022-10-17 Thread Shree Devi Kumar
ble on > > https://github.com/KurdishBLARK/KurdishOCR > > On Sun, Oct 16, 2022 at 20:59 Shree Devi Kumar > wrote: > >> Thank you for sharing information regarding successful training of >> Kurdish traineddata for Tesseract. >> >> Please also let us know wheth

Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Shree Devi Kumar
Aurebesh seems to be different symbols mapped to the English alphabet rather than a new font for English, hence training would need to be for a new language rather than just fine-tuning. On Sat, Apr 1, 2023, 10:47 Ali Abedian wrote: > Hello, > > Thank you for providing the references, but I'm st

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-08-09 Thread Shree Devi Kumar
Include the default fonts also in your fine-tuning list of fonts and see if that helps. On Wed, Aug 9, 2023, 2:27 PM Ali hussain wrote: > I have trained some new fonts by fine-tune methods for the Bengali > language in Tesseract 5 and I have used all official trained_text and > tessdata_best and

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
, iteration=6112200, sample_iteration=6112270, null_char=284, learning_rate=0.001, momentum=0.5, adam_beta=0.999 On Fri, Sep 15, 2023, 9:50 PM Des Bw wrote: > For the last couple of days, I have been trying to train the amh data to > include some missing characters. > > I have seen th

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
The language name headings seem to be missing from the tessdoc page for tessdata_fast Please revert to an older version of page from history On Sat, Sep 16, 2023, 2:08 PM Shree Devi Kumar wrote: > > https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md >

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
ory of components from the .traineddata file. *-l* *.traineddata* *FILE*...: List the network information. On Sat, Sep 16, 2023, 2:11 PM Shree Devi Kumar wrote: > The language name headings seem to be missing from the tessdoc page for > tessdata_fast > > Please revert to an older ve

Re: [tesseract-ocr] Does the checkpoint_name contain the number of iterations

2023-09-20 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints On Wed, Sep 20, 2023, 2:53 AM Des Bw wrote: > I couldn't understand what the numbers on the checkpoint_names are. > I looked at this one: but clear to me. > > https://github.com/tesseract-ocr

Re: [tesseract-ocr] How to generate training images with noise

2023-10-12 Thread Shree Devi Kumar
Have you looked at https://github.com/tesseract-ocr/tesstrain On Thu, Oct 12, 2023, 11:45 PM Keith Smith wrote: > Hello, > > I am trying to use tesseract to OCR the MICR line of checks (i.e. the > micr-e13b font). The training data that I found at > https://github.com/BigPino67/Tesseract-MIC

Re: [tesseract-ocr] How to generate training images with noise

2023-10-13 Thread Shree Devi Kumar
com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar > wrote: > >> Have you looked at >> >> https://github.com/tesseract-ocr/tesstrain >> >> >> >> On Thu

Re: [tesseract-ocr] How to generate training images with noise

2023-10-13 Thread Shree Devi Kumar
what I understand, > tesseract requires on the order of 10K images and box files to train on. > However, unless I am missing something, what I read at > https://github.com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Shree Devi Kumar
Thanks. I did follow the training wiki. However, since Hindi uses CUBE mode, it is not possible to train for that. I am trying to train for san - Sanskrit which uses the same devanagari script, in Non-cube mode. On Thu, Apr 18, 2013 at 1:34 AM, Sven Pedersen wrote: > This is covered in the FAQ

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Shree Devi Kumar
Thanks, Zdenko! I think it would be helpful to add this to the training pages wiki in the next update. If possible, also add a list of the languages that use the Cube mode. On Thu, Apr 18, 2013 at 3:05 AM, zdenko podobny wrote: > > >> >> >> >> I remember one user post, that he wasted a lo

concatenating tr files

2013-04-18 Thread Shree Devi Kumar
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 says: An alternative to multi-page tiffs is to create many single-page tiffs for > a single font, and then you must cat together the tr files for each font > into several single-font tr files. In any case, the input tr files to > mftr

Re: concatenating tr files

2013-04-18 Thread Shree Devi Kumar
Thanks, Zdenko. Will do and post the link here. Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Apr 18, 2013 at 11:46 PM, zdenko podobny wrote: > post somewhere your files, so we can test it on li

Re: tesseract testing suite

2013-04-18 Thread Shree Devi Kumar
On Thu, Apr 18, 2013 at 11:02 PM, Nick White wrote: > Hi Shree, > > I'm glad you found my article helpful. Apologies for the delay in my > reply to you. I'll answer your questions below. > Thanks, Nick! > > > I have found that trying to improve recognit

How do I add this to unicharambigs file?

2013-04-22 Thread Shree Devi Kumar
​While doing OCR with san.traineddata I am getting many cases where​ [ ​ga ​ग] [virāma ्] [ZWJ] i.e. ग्‍‍ followed by ा is being output, instead of ग similarly for श ण etc. Zero width joiner is not a unit in the unichar file. And, most half letters are shown with viraama - so I may have ग् in u

Re: Training individual characters in an existing language

2013-04-22 Thread Shree Devi Kumar
t; where x is the ISO639-3 language code. The UTF-8-encoded file should > contain equal sign-delimited oldValue=newValue pairs. > Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Apr 22, 2013 at 2:00 PM, A

Re: Training individual characters in an existing language

2013-04-22 Thread Shree Devi Kumar
See http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/combine_tessdata.1.html for instructions on how to unpack the unicharambigs file and how to overwrite it in the traineddata after update. Shree Devi Kumar भजन - कीर्तन

Re: concatenating tr files

2013-04-22 Thread Shree Devi Kumar
s, all of same font and tesseract seems to be working. Maybe the errors will come if I try to use more than one font or if I go over the 32 file limit. Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Apr 23

Re: Training individual characters in an existing language

2013-04-23 Thread Shree Devi Kumar
e first field of some line in the unicharset file, ie >>>> it must a recognizable unit. >>>> >>> >>> If that doesn't work, you can try post-processing the OCR output. >>> VietOCR allows a user defined susbtitution file for the same. >>&

Re: concatenating tr files

2013-04-23 Thread Shree Devi Kumar
T hanks, Quan. On Tue, Apr 23, 2013 at 4:15 AM, Quan Nguyen wrote: > .tr are binary files; as such, you should use: > > copy /b san.sanskrit2003.exp0*.tr san.sanskrit2003.exp2000.tr > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group.

Re: tesseract is not opening my tiff's

2013-05-05 Thread Shree Devi Kumar
I think you need "quote marks" around the filenames as they have space in them. Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 6, 2013 at 4:30 AM, wrote: > Hi, > I am totally new wi

Re: jTessBoxEditor 0.6 Beta release

2013-05-06 Thread Shree Devi Kumar
modify the box files through the editor. Please read the program documentation / help file for more details. Shree If you want t Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 6, 2013 at 12:19 PM, mamata

Re: Diff between unicharambigs and DangAmbigs

2013-05-07 Thread Shree Devi Kumar
unicharset file. I have a feeling that the ligature - "U+017FU+0068" is not there in it, but U+EAB1 is there. That is why you could use the latter in your substitutions. Shree Shree Devi Kumar भजन - कीर्तन - आ

Re: VS2008 Express Edition - how to use this to see debug values?

2013-05-07 Thread Shree Devi Kumar
, I am assuming that the other case would have a 'space' character between them. Anyway, That was the reason for wanting to follow the program in VS2008. If you know of some instructions/tutorial to do that and can point me to it, that will be great. Thanks, Shree Shree

Re: VS2008 Express Edition - how to use this to see debug values?

2013-05-08 Thread Shree Devi Kumar
Thanks, Tom. Appreciate the instructions and I'll give them a try. Any idea when the 'significant changes in the works' will be ready for release. Shree Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.rampariv

<    4   5   6   7   8   9   10   >