[tesseract-ocr] Hyphenation postprocessing

Lars Aronsson Sun, 05 Feb 2023 18:57:49 -0800

Is it possible to instruct tesseract for the image:

 Let us build a snow-
 man on the lawn.


to output in txt format:

 Let us build a
 snowman on the lawn.

This would almost preserve line breaks, while at
the same time making hyphenated words whole
and searchable.

It seems to me that the source has code to recognize
hyphenated words, and it should be possible to
implement this behaviour as an option.


--
  Lars Aronsson (l...@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2659e698-54b8-38cc-060e-db993aa0a1a6%40aronsson.se.

[tesseract-ocr] Hyphenation postprocessing

Reply via email to