Re: storing the contents of a document in the lucene index

2008-07-30 Thread Erick Erickson
I thought of one more thing you should be aware of. The the default field length for any field (no matter which of the two forms you use) is 10,000 tokens. This can be easily changed, see IndexWriter.setMaxFieldLength(). Best Erick On Thu, Jul 24, 2008 at 9:25 AM, starz10de <[EMAIL PROTECTED]> w

Re: storing the contents of a document in the lucene index

2008-07-24 Thread starz10de
Dear Erick , Thnaks for your answer, I tryed other way , where I read the text files before i index them. I will try also your solution here. best regards Erick Erickson wrote: > > OK, I'm finally catching on. You have to change the demo code to > get the contents into something besides an

Re: storing the contents of a document in the lucene index

2008-07-23 Thread Erick Erickson
OK, I'm finally catching on. You have to change the demo code to get the contents into something besides an input stream, so you can use one of the alternate forms of the Field constructor. For instance, you could read it all into a string and use the form: doc.add(new Field("content", ,

Re: storing the contents of a document in the lucene index

2008-07-23 Thread starz10de
Hi Erik, I don't remove the stop words, as I index parallel corpora which is used for learning the translations between pair of languages. so every word is important. I even develop my own analyzer for Arabic which is just remove punctuations and special symbols and it return only Arabic text.

Re: storing the contents of a document in the lucene index

2008-07-22 Thread Erick Erickson
<<>> This not strictly true. For instance, stop words aren't even indexed. Reconstructing a document from the index is very expensive (see Luke for examples of how this is done). You can get the text back verbatim if you store it in your index. See Field.Store.YES (or Field.Store.COMPRESS). Stora