Forgive me, I have lots of questions and will be trying to separate out one question per conversation (so that those searching later may more easily find the answers).
I'm working with scanned images of a textbook like layout - occasional drop-caps, text in 2 or occasionally 3 columns that flows around images (sometimes an actual square or rectangle, others the image had the background removed and the text flows around the subject) and jargon (most of the book is English, but there is topic specific jargon, abbreviations of the jargon, and, even worse, acronyms and symbols of said jargon), where fractions are used, they are in the form of smart fractions (so, something like 1/4" uses the space of 2 characters, not 4). Also, the lighting during the scan was uneven and the original images were taken at approx 250 dpi. There is also tabular data (worst case, I'm fine with the tabular stuff not being included in the ocr results). I've preprocessed the images, including binerization and upscaling to get 300dpi for tesseract to work with, but the uneven lighting wasn't able to be entirely fixed (would need to rescan unless someone knows of a way to fix in GIMP, and that is not an option right now) which made binerization of some blocks on some pages less successful than others. That's the background, may need to refer back to it with other questions. So far (I've tried OEM 0 and 1) results are "ok" but there are errors - both high confidence words that are wrong, and low confidence words that are actually correct, as well as difficulty with the fractions and orphans from the drop caps. Some of the jargon related stuff is iffy too (when lighting and binerization is clear, LTSM runs pick most of it up pretty well, though). Using a hOCR viewer - ScribeOCR, which I found out about on list - isn't going so well, the physical book these images were taken from is approximately US Letter sized and scribeocr is "stuck" on showing me the whole page, which makes the text too small to actually read (and since I have wrong high confidence and correct low confidence, I can't depend on the color coding) - if I could read it I could correct there. So, how, exactly, does one go about correcting hocr results? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/52f25aba-b7a9-4e37-b070-ce3aa120cd43n%40googlegroups.com.