Re: Lucene performance: benchmarktemplate.xml

2008-04-15 Thread Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen resends or posts it somewhere... http://www.casscostello.com/?page_id=28 On Tue, Apr 15, 2008 at 5:18 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > Hi Glen. > can you resend this in plain text? > or put the HTML up on a server s

Re: Lucene performance: benchmarktemplate.xml

2008-04-15 Thread Ian Holsman
Hi Glen. can you resend this in plain text? or put the HTML up on a server somewhere and point to it with a brief summary in the post? I'd love to look and read it, all those tags are making me go blind. Glen Newton wrote: Hardware Environment Dedicated machine for indexing: yes CPU: D

Re: replace field in doc?

2008-04-15 Thread AJ Weber
Characters or "terms"? (And btw: what's the difference?) The javadoc says 10,000 "terms", which I assume generally equates to "words" (and given that the analyzer might use stemming, stop words, etc.). Great info. Thanks again! -AJ - Original Message - From: Erick Erickson To

Re: replace field in doc?

2008-04-15 Thread Erick Erickson
Well, "my way" would certainly be simpler to read six months from now when you look at this code again And I'm quite sure you can add the same field multiple times, so whatever you want. Do note, though, that Lucene defaults to 10,000 characters in any single field no matter which way yo

Re: Which will be faster?

2008-04-15 Thread Michael McCandless
The index should be identical in these two cases as long as the single string yields the same tokens during analysis as the concatenation of the tokens from the separate strings. So index size & search speed would be the same. Mike Darren Govoni wrote: I guess I meant searching the index

Re: Search for phrases

2008-04-15 Thread Daniel Naber
On Dienstag, 15. April 2008, palexv wrote: > I have not tokenized phrases in index. > What query should I use? > Simple TermQuery does not work. Probably PhraseQuery with an argument like "java dev" (no asterisk). > If I try to use QueryParser , what analyzer should I use? Probably KeywordAnaly

Re: replace field in doc?

2008-04-15 Thread AJ Weber
I ended up doing this: String docText = doc.get("body"); Field fCurAll = doc.getField("all"); if ((fCurAll != null) && (docText != null)) { String newAll = fCurAll.stringValue() + "

Re: Which will be faster?

2008-04-15 Thread Darren Govoni
I guess I meant searching the index, size of index etc. So they would search essentially the same? Sorry that wasn't clear from my original email. Darren - Original Message - From: "Erick Erickson" <[EMAIL PROTECTED]> To: Sent: Tuesday, April 15, 2008 1:15 PM Subject: Re: Which will

Test

2008-04-15 Thread Glen Newton
Test -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Which will be faster?

2008-04-15 Thread Erick Erickson
I wouldn't worry about it too much, since there'll be overhead for you building up the string in the first place as well. I suspect that the time difference will be dwarfed by the indexing process. So I'd do what's easiest first... Erick On Tue, Apr 15, 2008 at 10:51 AM, darren <[EMAIL PROTEC

Re: replace field in doc?

2008-04-15 Thread Erick Erickson
You can freely add the same field (with different text) to a doc. For instance Document doc = new Document(); doc.add("field", "this is the first"); doc.add("field", "starting the second "); IndexWriter.addDocument(doc) is functionally the same as Document doc = new Document(); doc.add("field",

Lucene performance: benchmarktemplate.xml

2008-04-15 Thread Glen Newton
Hardware Environment Dedicated machine for indexing: yes CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores RAM: 8GB Drive configuration: Dell EMC AX150 storage array fibre channel Software environment Lucene Version: 2.3.1 Java Version: Java(TM)

Lucene performance: benchmarktemplate.xml

2008-04-15 Thread Glen Newton
Hardware Environment Dedicated machine for indexing: yes CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores RAM: 8GB Drive configuration: Dell EMC AX150 storage array fibre channel Software environment Lucene Version: 2.3.1 Java Versio

Re: Which will be faster?

2008-04-15 Thread Michael McCandless
Most likely B will be somewhat faster. There is some small overhead to each field instance. Mike darren wrote: Hi, Pardon the noob question. But which approach is going to be faster over extremely large document sets. A or B? A) Multiple field values, Stored.NO,TOKENIZED. word: one word: t

Which will be faster?

2008-04-15 Thread darren
Hi, Pardon the noob question. But which approach is going to be faster over extremely large document sets. A or B? A) Multiple field values, Stored.NO,TOKENIZED. word: one word: two word: three B) Single field value, Stored.NO,TOKENIZED word: one two three Thanks for the tip. Darren

Re: Implementing CMS search function using Lucene

2008-04-15 Thread Илья Казначеев
В сообщении от Sunday 13 April 2008 14:20:01 Grant Ingersoll написал(а): Thanks for your reply! > > I don't want it to work more than half second on > > reasonable sized index. Also, I don't want to hard-code exact list > > of fields, > > I might add them as I develop the system. Is this doable,

replace field in doc?

2008-04-15 Thread AJ Weber
I'm curious how people are building the "all" Field (for searching "all of the terms at once"). I understand using store=NO, Index=Tokenized is generally the way to add the field, but what if I need to basically use multiple classes to build my Document before adding it to the index (keeping th

Re: Max length

2008-04-15 Thread Erick Erickson
The default is 10,000 characters, but, as Grant says, you can change it with IndexWriter.setMaxFieldLength(). Erick On Tue, Apr 15, 2008 at 6:31 AM, WATHELET Thomas <[EMAIL PROTECTED]> wrote: > Hi my question is very simple, > Is there a size limitation for the text to index > Becaus I try to i

Re: Search for phrases

2008-04-15 Thread Erick Erickson
It would help a lot if you provided a couple of examples of inputs into your index and expected outputs for queries. For instance, you say: <<>> But then in your follow-up you say <<>> Well, if you haven't tokenized your input streams at index time and query time, you can't get what your first s

Re: Max length

2008-04-15 Thread Grant Ingersoll
On IndexWriter, have a look at the setMaxFieldLength() method. On Apr 15, 2008, at 6:31 AM, WATHELET Thomas wrote: Hi my question is very simple, Is there a size limitation for the text to index Becaus I try to index a long document and the content of this one is stored correctly into the in

Max length

2008-04-15 Thread WATHELET Thomas
Hi my question is very simple, Is there a size limitation for the text to index Becaus I try to index a long document and the content of this one is stored correctly into the index but it seems that the indexation stopp at the middle of the document.I can't find any word located after the middle. A

Re: stemming in Lucene

2008-04-15 Thread Hannu Väisänen
Wojtek H wrote: >Snowball stemmers are part of Lucene, but for few languages only >But maybe there is a better way or there are people working on >something like that? I use Malaga (http://home.arcor.de/bjoern-beutel/malaga/) for lemmatization and index the result. http://joyds1.joensuu.fi/progra

Re: Sorting consumes hundreds of MBytes RAM

2008-04-15 Thread Timo Nentwig
What do you mean by "that's true"? That lucene does read all data available in the index for this field into memory? In this case index sharding should help, right? On Sun, 13 Apr 2008, Otis Gospodnetic wrote: Date: Sun, 13 Apr 2008 20:25:09 -0700 (PDT) From: Otis Gospodnetic <[EMAIL PROTECTED