If you have single line images, then you only need matching single line text transcription for the tesstrain makefile training process. It will generate the required box files.
This is different from the old text2image process. >>Images must be TIFF and have the extension .tif or PNG and have the extension .png, .bin.png or .nrm.png. >>Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt. Please try a test run with the example set-up. On Fri, Oct 13, 2023, 3:43 PM Keith Smith <keithsmith...@gmail.com> wrote: > Yes I have. I am asking about how to automate the generation of the > ground truth images and box files, because from what I understand, > tesseract requires on the order of 10K images and box files to train on. > However, unless I am missing something, what I read at > https://github.com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar <shreesh...@gmail.com> > wrote: > >> Have you looked at >> >> https://github.com/tesseract-ocr/tesstrain >> >> >> >> On Thu, Oct 12, 2023, 11:45 PM Keith Smith <keithsmith...@gmail.com> >> wrote: >> >>> Hello, >>> >>> I am trying to use tesseract to OCR the MICR line of checks (i.e. the >>> micr-e13b font). The training data that I found at >>> https://github.com/BigPino67/Tesseract-MICR-OCR/blob/master/Tessdata/mcr.traineddata >>> does not produce accurate results on my data set. >>> >>> I have a set of over 20K check images along with the MICR text for those >>> images; however, I do not have box files for them. >>> >>> So I started generating box files and manually correcting them via >>> JTessBoxEditor, but I soon learned that it would take a LONG time to do >>> this for enough checks to properly train tesseract. So I am just started >>> generating synthetic images using tesseract's text2image; however, the >>> images generated are perfect (i.e. no blur, skew, etc), so I am doubting >>> that this will result in training tesseract to handle my less-than-perfect >>> check images. >>> >>> Does anyone have suggestions for the best methodology to use? Is there >>> a way to get text2image (or another tool) to generate less-than-perfect >>> images? Or can someone suggest a less labor intensive way of using real >>> check images to train tesseract? >>> >>> Thanks in advance, >>> Keith >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0U9Pd9kmG48tXTG3u%2BnzFiGe8uZsLxX%3DoZ7T8D1_Ocg%40mail.gmail.com.