Re: Indexing MSword Documents

jim shirreffs Fri, 08 Jun 2007 11:36:20 -0700

I looked at nutches code but it is too complicated for me to follow.

I do not understand the guts of Lucene and how analyzers, parsers, readers,etc all fit together. I suppose I will be forced to learn it all someday butat the moment I am adhering to KISS, Keep It Simple Stupid.


thanks for taking the time to reply


jim s

----- Original Message -----From: "Mathieu Lecarme" <[EMAIL PROTECTED]>

To: <java-user@lucene.apache.org>
Sent: Friday, June 08, 2007 12:48 PM
Subject: Re: Indexing MSword Documents

Why don't use Document?
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
org/apache/lucene/document/Document.html

HTMLDocument manage HTML stuff like encoding, header, and other
specificity.

Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
not the more difficult part.

M.

Le 8 juin 07 à 19:23, jim shirreffs a écrit :

Hi,
I am trying to index msword documents. I've got things working but I donot think I am doing things properly.
To index msword docs I use an extractor to extract the text. Then I writethe text to a .txt file and index that using an HTMLDocument object.Seems to me that since I have the text I should be able to just do a
       Doc.add("content", the_text_from_the_word_doc, ???, ???);
But looking at Document.java it seems the field "content" requires areader. So I write a temporary file to satified that requirement.
What I would like to have is an MSWORDDocument class that would take theextracted text as a argument to the constructor and create a Ducumentobject that I could get.
If any one has any idea, please let me know.

Here is my code segment. Notice the msword hack,


/*
* make a document
*/

try
{
  if (ftype.startsWith("text"))
  {
     doc = HTMLDocument.Document(f);
  }
  else if (ftype.equals("application/pdf"))
  {
     doc = LucenePDFDocument.getDocument(f);
  }
  else if (ftype.equals("application/msword"))
  {
     FileInputStream fin = new FileInputStream(f.getAbsolutePath());
     WordExtractor extractor = new WordExtractor(fin);
     String content = extractor.getText();
     if(debug) System.out.println(content);
     String tempFileName=f.getAbsolutePath() + ".txt";
BufferedWriter bw = new BufferedWriter(new FileWriter (tempFileName,false));
     bw.write((String) content.toString());
     bw.close();
     File df = new File(tempFileName);
     doc = HTMLDocument.Document(df);
     df.delete();
  }
  else if (ftype.equals("binary"))
  {
     return null;
  }
  else
  {
     if(debug) System.out.println("Unknown file type not ascii or  pdf.");
     doc = HTMLDocument.Document(f);
  }
}
catch(java.lang.InterruptedException ie)
{
  throw ie;
}
catch(java.io.IOException ioe)
{
  throw ioe;
}





Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing MSword Documents

Reply via email to