[tesseract-ocr] Re: produce delimited output using hOCR or by preserving original document spacing

Andrew Defries Tue, 07 Oct 2014 14:17:36 -0700

Hello,

For an R solution to importing text for text mining us the package tm.


Check out line 6-11 in the following repo:

https://github.com/andrewdefries/CorpusReaders/blob/master/CorpusReader/server.R

Using tm you can import text and perform some operations:

MyCorpus<-tm_map(MyCorpus, tolower)
MyCorpus<-tm_map(MyCorpus, stemDocument, language="english")
MyCorpus<-tm_map(MyCorpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(Corpus(VectorSource(MyCorpus)), control = 
list(removePunctuation = TRUE, stopwords = TRUE))
dtm<-removeSparseTerms(dtm,0.1) #or 0.2


Also you can import text without a package like so:

LoadMe<-readLines("out2.txt")

#split document by spaces
wordList<-strsplit(LoadMe, "\\W+", perl=TRUE)

#change into vector from list
wordVector<-unlist(wordList)



On Monday, October 6, 2014 5:57:26 PM UTC-7, Maureen Kole wrote:
>
> Hello,
>
> Thank you Tesseract developers! I really appreciate your work. I am 
> running:
>
> Ubuntu 14.04 LTS
> Tesseract 3.03.03
> Leptonica 1.70
>
> I would like to import data stored in tables into R as a dataframe. This 
> will be easiest if the output produced by Tesseract is a delimited file. It 
> is not clear to me if using the hOCR option can produce a delimited file. 
> If yes, how would one do this for my version? The other option I am looking 
> at is preserving multiple spaces in the output txt file and using multiple 
> spaces for the delimiter. If this is possible, how would one do this for my 
> version?
>
> I attached the png (ndomprod93), the output produced by Tesseract not 
> using the hOCR option (out1), and the output produced by Tesseract using 
> the hOCR option, (out2).
>
> Cheers and kindly,
> Maureen
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1e0823ac-7f19-4e72-8708-06b12a0dee9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: produce delimited output using hOCR or by preserving original document spacing

Reply via email to