Re: Parsing MSWord

2008-11-12 Thread Alexander Aristov
Antiword would be hard to inject into Nutch as it is not Java based. It will reqier native calls. Alexander 2008/11/12 Sertic Mirko, Bedag <[EMAIL PROTECTED]> > Hi > > You can also use a tool called "antiword" to extract the text from a .doc > file, and then > give the text to lucene. > > See he

Re: Parsing MSWord

2008-11-11 Thread dipesh
Thank you, It was really helpful. I also found some similar work being done in the Nutch project. Regards, Dipesh On Wed, Nov 12, 2008 at 12:52 PM, Dave Newton <[EMAIL PROTECTED]> wrote: > --- On Tue, 11/11/08, dipesh wrote: > > I wanted to know if there are classes in Lucene that support > > pa

RE: Parsing MSWord

2008-11-11 Thread John Griffin
Dipesh, Start here. http://poi.apache.org/ John G. -Original Message- From: dipesh [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2008 8:38 PM To: java-user@lucene.apache.org Subject: Parsing MSWord Hello, I wanted to know if there are classes in Lucene that support parsing MSW

Re: Parsing MSWord

2008-11-11 Thread Dave Newton
--- On Tue, 11/11/08, dipesh wrote: > I wanted to know if there are classes in Lucene that support > parsing MSWord documents. Searching the web might help: http://www.google.com/search?q=lucene+%2Bword The Apache Tika project (http://incubator.apache.org/tika/) might also be of interest. Dav