I thought of one more thing you should be aware of. The
the default field length for any field (no matter which of the
two forms you use) is 10,000 tokens.
This can be easily changed, see
IndexWriter.setMaxFieldLength().
Best
Erick
On Thu, Jul 24, 2008 at 9:25 AM, starz10de <[EMAIL PROTECTED]> w
Dear Erick ,
Thnaks for your answer, I tryed other way , where I read the text files
before i index them. I will try also your solution here.
best regards
Erick Erickson wrote:
>
> OK, I'm finally catching on. You have to change the demo code to
> get the contents into something besides an
OK, I'm finally catching on. You have to change the demo code to
get the contents into something besides an input stream, so you
can use one of the alternate forms of the Field constructor. For
instance, you could read it all into a string and use the form:
doc.add(new Field("content", ,
Hi Erik,
I don't remove the stop words, as I index parallel corpora which is used
for learning the translations between pair of languages. so every word is
important. I even develop my own analyzer for Arabic which is just remove
punctuations and special symbols and it return only Arabic text.
<<>>
This not strictly true. For instance, stop words aren't even indexed.
Reconstructing a document from the index is very expensive
(see Luke for examples of how this is done).
You can get the text back verbatim if you store it in your index. See
Field.Store.YES (or Field.Store.COMPRESS). Stora