I am trying to index msword documents. I’ve got things working but I do not think I am doing things properly.

To index msword docs I use an extractor to extract the text. Then I write the text to a .txt file and index that using an HTLMDocument object. Seems to me that since I have the text I should be able to just do a

       Doc.add("content", the_text_from_the_word_doc, ???, ???);

But looking at Document.java it seems the field "content" requires a reader. So I write a temporary file to satified that requirement.

What I would like to have is an MSWORDDocument class that would take the extracted text as a argument to the constructor and create a Ducument object that I could get.

If any one has any idea, please let me know.

Here is a code segment. Notice the msword hack,


/*

* make a document

*/

try

{

  if (ftype.startsWith("text"))

  {

     doc = HTMLDocument.Document(f);

  }

  else if (ftype.equals("application/pdf"))

  {

     doc = LucenePDFDocument.getDocument(f);

  }

  else if (ftype.equals("application/msword"))

  {

     FileInputStream fin = new FileInputStream(f.getAbsolutePath());

     WordExtractor extractor = new WordExtractor(fin);

     String content = extractor.getText();

     if(debug) System.out.println(content);

     String tempFileName=f.getAbsolutePath() + ".txt";

BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName, false));

     bw.write((String) content.toString());

     bw.close();

     File df = new File(tempFileName);

     doc = HTMLDocument.Document(df);

     df.delete();

  }

  else if (ftype.equals("binary"))

  {

     return null;

  }

  else

  {

     if(debug) System.out.println("Unknown file type not ascii or pdf.");

     doc = HTMLDocument.Document(f);

  }

}

catch(java.lang.InterruptedException ie)

{

  throw ie;

}

catch(java.io.IOException ioe)

{

  throw ioe;

}





Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to