Hello everybody, I don't give up the fight, but it's hard. I have finded a solution for the ligature with a best converter wich tranlated more precisely PDF to plain text. But a new problem has occured. In french particulary, but it should be the case in english too, I have a big problem ' " brackets wich polluted the counting of the words. Actullaly the fonction remove ponctuation are not able to treated this "punctuation".
The solution should be to produce a more precise fonction in remove punctation which allowed to destroy any bracket. The problem is that brackets are not separeted of the word with space, but normally there are jsut before or after the word. So, remove punctuation undertand the bracket as a part of the word. Another problem, less important, is the bad account of words in reason of s or not and so on. For the fonction TermDocumentMatrix may be there is an option for ask only the word, but I don't find it. For the moment I treat this probleme with my little fingers. I open all the texts with word to ellimanted all the bracket with a small macro. But it's not an easy way with much undred texts in my corpus. For plural I take the word with or without s and i make the difference. Fortunaltly, I wish to conserve only 40 more meagningfull words of the corpus. I know what kind of improvement could be done but I m just a user not an ingeneer. I think little improvements could be realize by the magical ingeneer wich work for the communauty as I try modestly with my comments. Thank's for all, Mickaƫl -- View this message in context: http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4442728.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

