what i am measuring is this Analyzer analyzer = new StandardAnalyzer(new String[]{});
if(fldArray.length > 1) { BooleanClause.Occur[] flags = {BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD}; query = MultiFieldQueryParser.parse(queryString, fldArray, flags, analyzer); //parse the } else { query = new QueryParser("tags", analyzer).parse(queryString); System.err.println("QUERY IS " + query.toString()); } long ts = System.currentTimeMillis(); hits = searcher.search(query, new Sort("sid", true)); java.util.Set terms = new java.util.HashSet(); query.extractTerms(terms); Object[] strTerms = terms.toArray(); long ta = System.currentTimeMillis(); System.err.println("Retrived Hits in " + (ta - ts)); recent figures for this .. Retrived Hits in 26974 (msec) Retrived Hits in 61415 (msec) The query itself is just a word, eg "apple" The index is contructed as follows .. // this process runs as a daemon process ... iwriter = new IndexWriter(INDEX_PATH, new StandardAnalyzer(new String[]{}), false); Document doc = new Document(); doc.add(new Field("id", String.valueOf(story.getId()), Field.Store.YES, Field.Index.NO)); doc.add(new Field("sid",String.valueOf(story.getDate().getTime()), Field.Store.YES, Field.Index.UN_TOKENIZED)); String tags = getTags(story.getId()); doc.add(new Field("tags", tags, Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("headline", story.getHeadline(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("blurb", story.getBlurb(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("content", story.getContent(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("catid", String.valueOf(story.getCategory()), Field.Store.YES, Field.Index.TOKENIZED)); iwriter.addDocument(doc); // then iwriter.close() optimize just runs once a day, after some deletions . The tags are select words .. in total about 20K different ones in combination .. so story.getTags() -> Returns a string of the type "XXX YYY ZZZ YYY CCC DDD" story.getId() -> returns a long story.sid -> thats a long too story.getContent() -> returns text in most cases sometimes its blank story.getHeadline() -> returns text usually about 512 chars story.getBlurb() -> returns text about 255 chars story.getCatid() -> returns a long that covers both sections i.e. the read and the write .. I did look at luke, but unfortunately the docs dont seem to refer to a commandline interface to it (unless i missed something).. This is running on a headless box .. Cheers Mohammed. On 8/19/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
This is a loooonnnnnggg time, I think you're right, it's excessive. What are you timing? The time to complete the search (i.e. get a Hits object back) or the total time to assemble the response? Why I ask is that the Hits object is designed to return the fir st100 or so docs efficiently. Every 100 docs or so, it re-executes the query. So, if you're returning a large result set, the using the Hits object to iterate over them, this could account for your time. Use a HitCollector instead... But do note this from the javadoc for hitcollector ---- . For good search performance, implementations of this method should not call Searcher.doc(int)< file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29 >or IndexReader.document(int)< file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29 >on every document number encountered. Doing so can slow searches by an order of magnitude or more. ----- FWIW, I have indexes larger that 1G that return in far less time than you are indicating, through three layers and constructing web pages in the meantime. It contains over 800K documents and the response time is around a second (haven't timed it lately). This includes 5-way sorts. You might also either get a copy of Luke and have it explain exactly what the parse does or use one of the query exlain calls (sorry, don't remember them off the top of my head) to see what query is *actually* submitted and whether it's what you expect. Are you using wildcards? They also have an effect on query speed. If none of this applies, perhaps you could post the query and the how the index is constructed. If you haven't already gotten a copy of Luke, I heartily recommend it.... Hope this helps Erick On 8/19/06, M A <[EMAIL PROTECTED]> wrote: > > Hi there, > > I have an index with about 250K document, to be indexed full text. > > there are 2 types of searches carried out, 1. using 1 field, the other > using > 4 .. for a query string ... > > given the nature of the queries required, all stop words are maintained in > the index, thereby allowing for phrasal queries, (this is a requirement) > .. > > So search I am using the following .. > > if(fldArray.length > 1) > { > // use four fields > BooleanClause.Occur[] flags = {BooleanClause.Occur.SHOULD, > BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD, > BooleanClause.Occur.SHOULD}; > query = MultiFieldQueryParser.parse(queryString, fldArray, flags, > analyzer); //parse the > } > else > { > //use only 1 field > query = new QueryParser("tags", analyzer).parse(queryString); > } > > > When i search on the 4 fields the average search time is 16 sec .. > When i search on the 1 field the average search time is 9 secs ... > > The Analyzer used for both searching and indexing is > Analyzer analyzer = new StandardAnalyzer(new String[]{}); > > The index size is about a 1GB .. > > The documents vary in size some are less than 1K max size is about 5k > > > > Is there anything I can do to make this faster.... 16 secs is just not > acceptable .. > > Machine : 512MB, celeron 2600 ... Lucene 2.0 > > I could go for a bigger machine but wanted to make sure that the problem > was > not something I was doing, given 250K is not that large a figure .. > > Please Help > > Thanx > > Mo > >