Why don't use Document?
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
org/apache/lucene/document/Document.html
HTMLDocument manage HTML stuff like encoding, header, and other
specificity.
Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
not the more difficult part.
M.
Le 8 juin 07 à 19:23, jim shirreffs a écrit :
Hi,
I am trying to index msword documents. I've got things working but
I do not think I am doing things properly.
To index msword docs I use an extractor to extract the text. Then I
write the text to a .txt file and index that using an HTMLDocument
object. Seems to me that since I have the text I should be able to
just do a
Doc.add("content", the_text_from_the_word_doc, ???, ???);
But looking at Document.java it seems the field "content" requires
a reader. So I write a temporary file to satified that requirement.
What I would like to have is an MSWORDDocument class that would
take the extracted text as a argument to the constructor and create
a Ducument object that I could get.
If any one has any idea, please let me know.
Here is my code segment. Notice the msword hack,
/*
* make a document
*/
try
{
if (ftype.startsWith("text"))
{
doc = HTMLDocument.Document(f);
}
else if (ftype.equals("application/pdf"))
{
doc = LucenePDFDocument.getDocument(f);
}
else if (ftype.equals("application/msword"))
{
FileInputStream fin = new FileInputStream(f.getAbsolutePath());
WordExtractor extractor = new WordExtractor(fin);
String content = extractor.getText();
if(debug) System.out.println(content);
String tempFileName=f.getAbsolutePath() + ".txt";
BufferedWriter bw = new BufferedWriter(new FileWriter
(tempFileName, false));
bw.write((String) content.toString());
bw.close();
File df = new File(tempFileName);
doc = HTMLDocument.Document(df);
df.delete();
}
else if (ftype.equals("binary"))
{
return null;
}
else
{
if(debug) System.out.println("Unknown file type not ascii or
pdf.");
doc = HTMLDocument.Document(f);
}
}
catch(java.lang.InterruptedException ie)
{
throw ie;
}
catch(java.io.IOException ioe)
{
throw ioe;
}
Thanks in advance
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]