[tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread 대학원 컴퓨터공학과
Hi all, I tried to run an example of LSTM training and used the following command: *for f in *.tif; dotesseract $f ${f%.*} -l deu lstmbox done* The result of box files seems detect by single-level box instead of character-level box. All the character shares the same coordinates, width a

[tesseract-ocr] Arabic characters and numbers

2023-11-01 Thread Mostafa Abdo
Is there a train data file that contains Arabic characters and numbers? I can get only characters or numbers not both Also I use this with JAVA not the OCR Tool -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and

[tesseract-ocr] Re: Arabic OCR

2023-11-01 Thread Mostafa Abdo
Did you find anything ? On Wednesday, 16 March 2022 at 09:47:05 UTC+2 Tahir Rehman wrote: > Hi all, > > I'm working on project that needs OCR for for Arabic Cards, these cards > can be Identity cards, business card or visiting cards . > > if anyone have an idea for any open source project or op

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Zdenko Podobny
Are you following official tutorials? Did you read the documentation? Have you tried to check the official training repository and provided examples? Zdenko st 1. 11. 2023 o 10:15 TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ < khanhtran...@khu.ac.kr> napísal(a): > Hi all, > > I tried to run an example of

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Keith Smith
fyi, I asked the same question in https://groups.google.com/g/tesseract-ocr/c/9myrnSD0HKM On Wednesday, November 1, 2023 at 7:21:37 AM UTC-4 zdenop wrote: > Are you following official tutorials? > Did you read the documentation? > Have you tried to check the official training repository and pro

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Dellu Bw
On 1 Nov 2023 at 11:51:27 AM, TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ < khanhtran...@khu.ac.kr> wrote: > Are you trying to generate box files from the images (tif files)? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this gro

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
I don't know what you are trying to do. I am not familiar with this method of box generation. But, I think the command you are running is supposed to generate them with the same coordinates. Look at the example here: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html On Wednes

Re: [tesseract-ocr] How to generate training images with noise

2023-11-01 Thread Des Bw
I am not sure if you are supposed to use those box files for training purposes. All the guides and manuals I have read use either text2image script, or the manual method(which is presumably outdated method). On Wednesday, October 18, 2023 at 6:27:58 PM UTC+3 Keith Smith wrote: > I tried using

[tesseract-ocr] Re: Arabic characters and numbers

2023-11-01 Thread Des Bw
Doesn't the official Arabic model include the numberal? The Arabic numberals are supposed to be part of almost all the models. The Amharic model, I am working on, for example, does recognize Arabic numerals (of course, along with the regular letter characters). -- You received this message bec

Re: [tesseract-ocr] OCR

2023-11-01 Thread Des Bw
You need to try to process the images first. I recommend you to try ScanTailor. You can then import the processed images to Tesseract. The accuracy will improve. Are you using the official English model to ocr them? On Wednesday, November 1, 2023 at 2:18:54 PM UTC+3 zdenop wrote: > Read the do

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
"Please note that box files generated using makebox config file are OK for training legacy models but not for LSTM training.". Makebox is the tool included inside tesseract to generate box files. It looks like that was used for the legacy model. For the current model, text2image is the way to d

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread 대학원 컴퓨터공학과
Thank you for your responses. Regarding my question and referring to the official documentation at Tesseract Documentation, the generated .box files have the *same coordinates* for every character because they use line-level boxes instead of character-level boxes. Also, I have a couple of conce

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread 대학원 컴퓨터공학과
Thank you for your responses. Regarding my question and referring to the official documentation at https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html , the generated .box files for LSTM-based training have the *same coordinates* for every character because they use line-level bo

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
*1. using sythetic data: * What can you do if you do not have a data that is confirmed to be accurate? The only way around that I know is to use sythetic data. That is: you generate the images from the texts using text2image script. You then train from that one. The accuracy of the result mod

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
To clarify, Shree's script is useful in case your images are not single line. If they are all single line, that script won't do much for you. On Wednesday, November 1, 2023 at 4:20:09 PM UTC+3 Des Bw wrote: > > *1. using sythetic data: * > What can you do if you do not have a data that is conf

[tesseract-ocr] Re: Arabic characters and numbers

2023-11-01 Thread Mostafa Abdo
I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers On Wednesday, 1 November 2023 at 14:09:45 UTC+2 desal...@gmail.com wrote: > Doesn't the official Arabic model include the numberal? > The Arabic

[tesseract-ocr] Re: Arabic characters and numbers

2023-11-01 Thread Tom Morris
On Wednesday, November 1, 2023 at 10:02:22 AM UTC-4 mosta@gmail.com wrote: I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers You might want to clarify whether you are referring to: https:

[tesseract-ocr] Re: Dot-matrix woes

2023-11-01 Thread Slartybartfast
Doesn't anybody have any ideas? :-( On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast wrote: > Hi > I am a new tesseract user, and I'm really struggling to get it to produce > any kind of sensible results, especially with numerical text. I have some > text that looks like this: >