Hello, For an R solution to importing text for text mining us the package tm.
Check out line 6-11 in the following repo: https://github.com/andrewdefries/CorpusReaders/blob/master/CorpusReader/server.R Using tm you can import text and perform some operations: MyCorpus<-tm_map(MyCorpus, tolower) MyCorpus<-tm_map(MyCorpus, stemDocument, language="english") MyCorpus<-tm_map(MyCorpus, removeWords, stopwords("english")) dtm <- DocumentTermMatrix(Corpus(VectorSource(MyCorpus)), control = list(removePunctuation = TRUE, stopwords = TRUE)) dtm<-removeSparseTerms(dtm,0.1) #or 0.2 Also you can import text without a package like so: LoadMe<-readLines("out2.txt") #split document by spaces wordList<-strsplit(LoadMe, "\\W+", perl=TRUE) #change into vector from list wordVector<-unlist(wordList) On Monday, October 6, 2014 5:57:26 PM UTC-7, Maureen Kole wrote: > > Hello, > > Thank you Tesseract developers! I really appreciate your work. I am > running: > > Ubuntu 14.04 LTS > Tesseract 3.03.03 > Leptonica 1.70 > > I would like to import data stored in tables into R as a dataframe. This > will be easiest if the output produced by Tesseract is a delimited file. It > is not clear to me if using the hOCR option can produce a delimited file. > If yes, how would one do this for my version? The other option I am looking > at is preserving multiple spaces in the output txt file and using multiple > spaces for the delimiter. If this is possible, how would one do this for my > version? > > I attached the png (ndomprod93), the output produced by Tesseract not > using the hOCR option (out1), and the output produced by Tesseract using > the hOCR option, (out2). > > Cheers and kindly, > Maureen > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1e0823ac-7f19-4e72-8708-06b12a0dee9d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.