Re: [tesseract-ocr] Should box include surrounding space?

2023-10-18 Thread 'Danny Wilson' via tesseract-ocr
Sorry, I had the coordinate system flipped on my last post. Here is a correct image produced by text2image and includes both FULLWIDTH COMMA and COMMA.  For both types of comma, the boxes produced by text2image include only the boundaries of the glyph itself and does not consider the vertical

[tesseract-ocr] How to do OCR for 96dpi screenshots from computer display with 100% accuracy?

2023-10-18 Thread Vadim Melnik
Hello, We are processing screenshot PNG images from computer display with 96dpi resolution. This is just B/W images rendered with known truetype single font with fixed size, w/o antialiasing or any other subpixel rendering things. Picture structure is clear, opaque and pixelated like listed bel

Re: [tesseract-ocr] Should box include surrounding space?

2023-10-18 Thread 'Danny Wilson' via tesseract-ocr
Because of some issues with licensed fonts not working with text2image, we wrote our own image and box file generator in Swift on the Mac. We use that to generate a data set for 100,000 text lines and feed that into the regular training on Linux. Using a non-licensed font, I checked what box te

Re: [tesseract-ocr] How to generate training images with noise

2023-10-18 Thread Keith Smith
I tried using tesstrain but am not getting 0% accuracy, so any help on what I'm doing wrong or misunderstanding would be greatly appreciated. Specifically, here is what I did given my 20K check images and data from my x9.37 file. For each check, I 1. cropped the image so that they included only t

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-18 Thread Des Bw
But, if your options are only to manually edit the boxes, I really have no knowledge of it. I have never tried that route. On Wednesday, October 18, 2023 at 3:43:51 PM UTC+3 Des Bw wrote: > You need a large data. That is all. > If you can collect a lot of text lines that contain all those typ

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-18 Thread Des Bw
You need a large data. That is all. If you can collect a lot of text lines that contain all those types of commas: and produce the training material using text2image (synthetic data) for each font, I am pretty sure Tesseract will learn all of them with no problem. On Wednesday, October 18, 2

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-18 Thread 'Danny' via tesseract-ocr
There are a few "commas" used in CJK which makes it complicated for me. *FULLWIDTH COMMA U+FF0C* (link ) which might have the glyph in the center of the box or in the lower left corner depending on the font: [image: Screenshot 2023-10-18 at 17.19.27.pn

[tesseract-ocr] Re: Watching the learning iteration is better method than watching the BCER

2023-10-18 Thread Des Bw
In other words, the BCER is an unreliable measure of accuracy. At least, that is my experience training from synthetic data. On Wednesday, October 18, 2023 at 10:10:00 AM UTC+3 Des Bw wrote: > I am just writing a little observation here for beginners like me. > ( would love to be corrected i

[tesseract-ocr] Watching the learning iteration is better method than watching the BCER

2023-10-18 Thread Des Bw
I am just writing a little observation here for beginners like me. ( would love to be corrected if I am wrong). I am training by cutting the top layer of a best model; to improve the existing model. I have about 400,000 lines of texts; and generated the box and images files using text2image.