I've used the Java code in 
TextExtractor 
http://stackoverflow.com/questions/10250617/java-apache-poi-can-i-get-clean-text-from-ms-word-doc-files
 
with good success in Clojure projects.  Either throw it in as Java source 
or convert to Clojure code.  You'll probably want the tika-parsers jar 
instead of the tika-app jar, though.

Brendan

On Thursday, January 2, 2014 6:33:11 PM UTC-5, Ron Toland wrote:
>
> If all you need is the text, you could use Apache Tika to extract it: 
> http://tika.apache.org/
>
> There's a simple clojure lib to get you started: 
> https://github.com/alexott/clj-tika
>
> I've used it to pull text out of .doc, .pdf, and .odt files.
>
> Ron
>
> On Wednesday, January 1, 2014 11:49:30 PM UTC-8, Joshua Mendoza wrote:
>>
>> Hi!,
>>
>> I've been looking for libraries or resources to read MS .doc files in 
>> Clojure, but found none. Does anyone have tried, used, encountered or 
>> witnessed such a thing to read them?
>>
>> I found a lot of info publicly available by the government in .doc files 
>> but I want to process them automatically with Clojure.
>>
>> The closest thing I know is using Incanter but to read XLS files, which 
>> is not useful at all for this...
>>
>> Well, any help would be great.
>>
>> Thank you!
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to