Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-19 Thread chandra churh chatterjee
Environment : Ubuntu 16.04 LTS --- Check : Running

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-19 Thread Ramakant Kushwaha
Thanks @Chandra, I am beginner for this, Please help me with the complete documentation. On Thu, Jul 19, 2018 at 3:38 PM, chandra churh chatterjee < chandrachurh.chatterje...@gmail.com> wrote: > I have already used tesseract 4.0 version for training on hand written > digits. > The steps are as f

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-19 Thread chandra churh chatterjee
I have already used tesseract 4.0 version for training on hand written digits. The steps are as follows: 1.The best way to do is use some handwriten fonts from Google or any where else. 2.use the "tesstrain.sh" script to generate the starter trained data using the text corpus containing only 0-9 di

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Ramakant Kushwaha
Thanks Lorenzo, I will try OPENCV + SIFT + MNIST, will update you soon. On Wednesday, July 18, 2018 at 5:26:05 PM UTC+5:30, Lorenzo Blz wrote: > > ​​ > > A MNIST trained model does character recognition, not detection. You first > need to isolate characters to use it. The advantage is that it

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Lorenzo Bolzani
​​ A MNIST trained model does character recognition, not detection. You first need to isolate characters to use it. The advantage is that it is already trained and I think it may work better than fine tuning tesseract because the handwritten digits are quite different from standard fonts. The di

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Ramakant Kushwaha
@Lorenzo As per my understanding MNIST in useful for detecting individual char/digit, so for using MNIST I have to do below steps,* correct me if i am wrong* 1. Gray + Threshold (Opencv) 2. Extract Connected components (MSER opencv) 3. run a loop over connected components list(sorted) and crop in

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Lorenzo Bolzani
​​ This is exactly the MNIST problem . I would not use tesseract for this. You can download something like this: https://github.com/EN10/KerasMNIST that comes with pre-trained models too. The problem you'll have will be to extract the di

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Soumik Ranjan Dasgupta
Follow https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 to create the traineddata. Copy the eng.traineddata file to $TESSDATA_PREFIX directory, and you'll be good to go. On Wed, Jul 18, 2018 at 1:20 PM Soumik Ranjan Dasgupta < srd1...@cse.jgec.ac.in> wrote: > I normally use

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-18 Thread Soumik Ranjan Dasgupta
I normally use a custom python file to generate the training text. Attaching a sample text corpus containing only digits 1234. On Wed, Jul 18, 2018 at 12:04 PM Ramakant Kushwaha < ramakant.sing...@gmail.com> wrote: > @Soumik,Thanks Soumik, but I am not getting it, please provide me some > links t

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Ramakant Kushwaha
@Soumik,Thanks Soumik, but I am not getting it, please provide me some links to understand it. I am very new to this thing. can you guide me in creating text corpus of digit with different fonts @Lorenzo, I want to detect digits written in boxex of below image, it's a cash deposit form of a ban

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Soumik Ranjan Dasgupta
Try creating a text corpus with only digits using various handwritten fonts that come close to your dataset from fonts.google.com. Use tesstrain.sh for rendering the images, and lstmtraining to train tesseract - you'll achieve a fair accuracy. On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani wrot

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Lorenzo Bolzani
​​ Generating the training data is a completely different problem from training tesseract. If you want to recognize full words it's better to have full words (or numbers), not individual characters so that the process of splitting the words into characters is done by tesseract. Unless you just wa

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Ramakant Kushwaha
*Thank you so much for guiding me. * *I had read links and sub-links provided and as suggested I will use OCR-D(* https://github.com/OCR-D/ocrd-train*) for training * I want to know what is the *best way to create pairs of [*.tif, *.gt.txt] from tif image for two and more fonts . Is their any

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

2018-07-17 Thread Lorenzo Bolzani
Have a look at this thread: https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ It's easier than it seems, you do not need per character boxes with 4.0, just one per line (that ocr-d automatically generates). If your text is already split into lines you do not have to do anything m