[ https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806141#comment-16806141 ]
Oleg Tikhonov commented on TIKA-2650: ------------------------------------- There is no simple solution. Here is some research related to [link automatic text correction|https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&ved=2ahUKEwiW5P23r6zhAhU-TxUIHXawD2wQFjAKegQIAxAC&url=https%3A%2F%2Flinguistics.washington.edu%2Ffile%2F532%2Fdownload%3Ftoken%3DhlHhM4Qw&usg=AOvVaw09nb2qj9vESK5LHV-LORcn] > Soft-hyphen is not extracted properly > ------------------------------------- > > Key: TIKA-2650 > URL: https://issues.apache.org/jira/browse/TIKA-2650 > Project: Tika > Issue Type: Bug > Components: app > Affects Versions: 1.18 > Reporter: Saurabh Patil > Priority: Blocker > Attachments: Peter Rabbit.pdf, output.txt > > > We are tring to extract text from PDF. if PDF having any big word at the end > of line then after half word there is soft hyphen and remaining word goes to > next line. but which extracting these text TIKA automatically replace hyphen > with space. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)