Re: search problem

2006-04-26 Thread April06

I guess that fixes the problem.
Thanx
--
View this message in context: 
http://www.nabble.com/search-problem-t1506294.html#a4096490
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Cannot save index to 'index' directory, please delete it first

2006-04-26 Thread 一只小蚂蚁
I have met a error like this:"Cannot save index to 'index' directory, please
delete it first" when I run the demo in lucene1.9.1.
please tell me why?
i hava set classpath!

--
『忙忙碌碌   ★   碌碌无为』

一只小蚂蚁
http://blog.csdn.net/qixiang_nj


java.io.IOException: Stale NFS file handle

2006-04-26 Thread Schwenker, Stephen
Hey,
 
I'm running into this exception with my lucene searching.  We have a cluster of 
2 servers that execute searches and one server in the back end that writes to 
the index.  I thought that setting up the external boxes on nfs would be 
alright since searching doesn't require locking.  Can anyone tell me why this 
may be happening and possibly suggest a fix for the solution?  I've already 
tried setting -Dorg.apache.lucene.lockDir=/tmp in the JVM args but it doesn't 
seem to do the job.
 
I have also considdered local filesystems on each cluster member but the index 
is updated frequently and would need to be mirrored too often for it to be 
worth while.  Any suggestions would be helpful.
 
Thank you,
 
 
Steve.
 
 
Here is the stack trace in case you need it.
 
2006-04-26 08:57:36,160 INFO  [STDOUT] java.io.IOException: Stale NFS file 
handle
2006-04-26 08:57:36,163 INFO  [STDOUT]  at 
java.io.RandomAccessFile.readBytes(Native Method)
2006-04-26 08:57:36,164 INFO  [STDOUT]  at 
java.io.RandomAccessFile.read(RandomAccessFile.java:315)
2006-04-26 08:57:36,164 INFO  [STDOUT]  at 
org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:449)
2006-04-26 08:57:36,165 INFO  [STDOUT]  at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:45)
2006-04-26 08:57:36,166 INFO  [STDOUT]  at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:219)
2006-04-26 08:57:36,166 INFO  [STDOUT]  at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
2006-04-26 08:57:36,167 INFO  [STDOUT]  at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
2006-04-26 08:57:36,167 INFO  [STDOUT]  at 
org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
2006-04-26 08:57:36,168 INFO  [STDOUT]  at 
org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62)
2006-04-26 08:57:36,169 INFO  [STDOUT]  at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117)
2006-04-26 08:57:36,170 INFO  [STDOUT]  at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148)
2006-04-26 08:57:36,170 INFO  [STDOUT]  at 
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:157)
2006-04-26 08:57:36,171 INFO  [STDOUT]  at 
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151)
2006-04-26 08:57:36,172 INFO  [STDOUT]  at 
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:300)
2006-04-26 08:57:36,173 INFO  [STDOUT]  at 
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:78)
2006-04-26 08:57:36,173 INFO  [STDOUT]  at 
org.apache.lucene.search.Similarity.idf(Similarity.java:255)
2006-04-26 08:57:36,174 INFO  [STDOUT]  at 
org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:43)
2006-04-26 08:57:36,175 INFO  [STDOUT]  at 
org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:142)
2006-04-26 08:57:36,175 INFO  [STDOUT]  at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:203)
2006-04-26 08:57:36,176 INFO  [STDOUT]  at 
org.apache.lucene.search.BooleanQuery$BooleanWeight2.(BooleanQuery.java:330)
2006-04-26 08:57:36,177 INFO  [STDOUT]  at 
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:372)
2006-04-26 08:57:36,177 INFO  [STDOUT]  at 
org.apache.lucene.search.Query.weight(Query.java:93)
2006-04-26 08:57:36,178 INFO  [STDOUT]  at 
org.apache.lucene.search.Hits.(Hits.java:48)
2006-04-26 08:57:36,179 INFO  [STDOUT]  at 
org.apache.lucene.search.Searcher.search(Searcher.java:53)


Highlight

2006-04-26 Thread anton feldmann

Hi

I wrote a program that make a pdf document to an Lucene document. The 
field ate "contents", "sentence", :


How do i display the sentence the query String is in? and how do I 
Highlight the String?


cheers

anton feldmann

package de.coli.seek.lucene;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Calendar;
import java.util.StringTokenizer;

import java.net.URL;
import java.net.URLConnection;

import java.util.Date;

import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;

import org.pdfbox.exceptions.CryptographyException;
import org.pdfbox.exceptions.InvalidPasswordException;

import org.pdfbox.util.PDFTextStripper;

public final class Sentence2Document
{
private static final char FILE_SEPARATOR = System.getProperty("file.separator").charAt(0);

// given caveat of increased search times when using
//MICROSECOND, only use SECOND by default
private DateTools.Resolution dateTimeResolution = DateTools.Resolution.SECOND;

/**
 * accessor
 * @return current date/time resolution
 */
public DateTools.Resolution getDateTimeResolution()
{
return dateTimeResolution;
}

/**
 * mutator
 * @param resolution set new date/time resolution
 */
public void setDateTimeResolution( DateTools.Resolution resolution )
{
dateTimeResolution = resolution;
}

//
// compatibility methods for lucene-1.9+
//
private String timeToString( long time )
{
return DateTools.timeToString( time, dateTimeResolution );
}

private static void addKeywordField( Document document, String name, String value )
{
if ( value != null ) 
{
document.add( new Field( name, value, Field.Store.YES, Field.Index.UN_TOKENIZED ) );
}
}

private static void addTextField( Document document, String name, Reader value )
{
if ( value != null ) 
{
document.add( new Field( name, value ) );
}
}

private static void addTextField( Document document, String name, String value )
{
if ( value != null ) 
{
document.add( new Field( name, value, Field.Store.YES, Field.Index.TOKENIZED ) );
}
}

private void addTextField( Document document, String name, Date value )
{
if ( value != null ) 
{
addTextField( document, name, DateTools.dateToString( value, dateTimeResolution ) );
}
}

private void addTextField( Document document, String name, Calendar value )
{
if ( value != null ) 
{
addTextField( document, name, value.getTime() );
}
}

private static void addUnindexedField( Document document, String name, String value )
{
if ( value != null ) 
{
document.add( new Field( name, value, Field.Store.YES, Field.Index.NO ) );
}
}

private static void addUnstoredKeywordField( Document document, String name, String value )
{
if ( value != null ) 
{
document.add( new Field( name, value, Field.Store.NO, Field.Index.UN_TOKENIZED ) );
}
}

/**
 * private constructor because there are only static methods.
 */
private Sentence2Document()
{
//utility class should not be instantiated
}

/**
 * This will get a lucene document from a PDF file.
 *
 * @param is The stream to read the PDF from.
 *
 * @return The lucene document.
 *
 * @throws IOException If there is an error parsing or indexing the document.
 */
public static Document getDocument( InputStream is ) throws IOException
{
	Sentence2Document converter = new Sentence2Document();
return converter.convertDocument( is );
}

/**
 * Convert the PDF stream to a lucene document.
 * 
 * @param is The input stream.
 * @return The input stream converted to a lucene document.
 * @throws IOException If there is an error converting the PDF.
 */
public Document convertDocument( InputStream is ) throws IOException
{
Document document = new Document();
addContent( document, is, "" );
return document;

}

/**
 * This will take a reference to a PDF document and create a lucene document.
 * 
 * @param file A reference to a PDF document.
 * @return The converted lucene document.
 * 
 * @throws IOException If there is an exception while converting the document.
 */
public Document convertDocument( File file ) throws IOException
{
Doc

RAM Directory / querying Performance issue

2006-04-26 Thread zzzzz shalev
I've rewritten the RAM DIR to supprt 64 bit (still havent had time to add this 
to lucene, hopefully in the coming months when i have a free second)
   
  My question:
   
  i have a machine with 4 GB RAM
   
  i have a 3GB index file,
   
  i successfully load the 3GB index into memory,
   
  the first few queries run with normal response time,
   
  but very quickly response time becomes unbearably slow (webloading with 1 con 
user), 
   
  how are queries expanded in memory when run (how much memory do they use up)?
   
  could this be an issue of the queries themselves talking up large chunks of 
RAM?
   
   
   
   
   


-
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! 
Messenger with Voice.

MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception

2006-04-26 Thread jm
Hi,

I have encountered an issue with lucene1.9.1. It involves
MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The
following code throws  java.lang.UnsupportedOperationException.

If I remove the MatchAllDocsQuery  condition (comment whole //1
block), or if I dont use the custom hitcollector (ms.search(mbq);
instead of ms.search(mbq, allcoll);) the exception goes away. By
stepping into the source I can see it seems due to MatchAllDocsQuery
no implementing extractTerms()
I never looked at lucene internals before, any help as to what
extractTerms() should do, or any other hint to overcome this?

thanks,


Searcher searcher = new
IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\");
Searchable[] indexes = new IndexSearcher[1];
indexes[0] = searcher;
MultiSearcher ms = new MultiSearcher(indexes);

AllCollector allcoll = new AllCollector(ms);

BooleanQuery mbq = new BooleanQuery();
mbq.add(new TermQuery(new Term("body", "value1")),
BooleanClause.Occur.MUST_NOT);
// 1
MatchAllDocsQuery alld = new MatchAllDocsQuery();
mbq.add(alld, BooleanClause.Occur.MUST);
//

System.out.println("Query: " + mbq.toString());

// 2
ms.search(mbq, allcoll);
//ms.search(mbq);

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception

2006-04-26 Thread Yonik Seeley
Hi Jim,

This went to the old mailing list...
Could you email this to java-user@lucene.apache.org
and maybe open a JIRA bug for it?

-Yonik

On 4/26/06, jm <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have encountered an issue with lucene1.9.1. It involves
> MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The
> following code throws  java.lang.UnsupportedOperationException.
>
> If I remove the MatchAllDocsQuery  condition (comment whole //1
> block), or if I dont use the custom hitcollector (ms.search(mbq);
> instead of ms.search(mbq, allcoll);) the exception goes away. By
> stepping into the source I can see it seems due to MatchAllDocsQuery
> no implementing extractTerms()
> I never looked at lucene internals before, any help as to what
> extractTerms() should do, or any other hint to overcome this?
>
> thanks,
>
>
> Searcher searcher = new
> IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\");
> Searchable[] indexes = new IndexSearcher[1];
> indexes[0] = searcher;
> MultiSearcher ms = new MultiSearcher(indexes);
>
> AllCollector allcoll = new AllCollector(ms);
>
> BooleanQuery mbq = new BooleanQuery();
> mbq.add(new TermQuery(new Term("body", "value1")),
> BooleanClause.Occur.MUST_NOT);
> // 1
> MatchAllDocsQuery alld = new MatchAllDocsQuery();
> mbq.add(alld, BooleanClause.Occur.MUST);
> //
>
> System.out.println("Query: " + mbq.toString());
>
> // 2
> ms.search(mbq, allcoll);
> //ms.search(mbq);
>


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception

2006-04-26 Thread jm
On 4/26/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Hi Jim,
>
> This went to the old mailing list...
> Could you email this to java-user@lucene.apache.org
> and maybe open a JIRA bug for it?
>
> -Yonik
>
> On 4/26/06, jm <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I have encountered an issue with lucene1.9.1. It involves
> > MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The
> > following code throws  java.lang.UnsupportedOperationException.
> >
> > If I remove the MatchAllDocsQuery  condition (comment whole //1
> > block), or if I dont use the custom hitcollector (ms.search(mbq);
> > instead of ms.search(mbq, allcoll);) the exception goes away. By
> > stepping into the source I can see it seems due to MatchAllDocsQuery
> > no implementing extractTerms()
> > I never looked at lucene internals before, any help as to what
> > extractTerms() should do, or any other hint to overcome this?
> >
> > thanks,
> >
> >
> > Searcher searcher = new
> > IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\");
> > Searchable[] indexes = new IndexSearcher[1];
> > indexes[0] = searcher;
> > MultiSearcher ms = new MultiSearcher(indexes);
> >
> > AllCollector allcoll = new AllCollector(ms);
> >
> > BooleanQuery mbq = new BooleanQuery();
> > mbq.add(new TermQuery(new Term("body", "value1")),
> > BooleanClause.Occur.MUST_NOT);
> > // 1
> > MatchAllDocsQuery alld = new MatchAllDocsQuery();
> > mbq.add(alld, BooleanClause.Occur.MUST);
> > //
> >
> > System.out.println("Query: " + mbq.toString());
> >
> > // 2
> > ms.search(mbq, allcoll);
> > //ms.search(mbq);
> >
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception

2006-04-26 Thread jm
ok, thanks for letting me know.

I entered a bug, 556.
javi

On 4/26/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Hi Jim,
>
> This went to the old mailing list...
> Could you email this to java-user@lucene.apache.org
> and maybe open a JIRA bug for it?
>
> -Yonik
>
> On 4/26/06, jm <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I have encountered an issue with lucene1.9.1. It involves
> > MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The
> > following code throws  java.lang.UnsupportedOperationException.
> >
> > If I remove the MatchAllDocsQuery  condition (comment whole //1
> > block), or if I dont use the custom hitcollector (ms.search(mbq);
> > instead of ms.search(mbq, allcoll);) the exception goes away. By
> > stepping into the source I can see it seems due to MatchAllDocsQuery
> > no implementing extractTerms()
> > I never looked at lucene internals before, any help as to what
> > extractTerms() should do, or any other hint to overcome this?
> >
> > thanks,
> >
> >
> > Searcher searcher = new
> > IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\");
> > Searchable[] indexes = new IndexSearcher[1];
> > indexes[0] = searcher;
> > MultiSearcher ms = new MultiSearcher(indexes);
> >
> > AllCollector allcoll = new AllCollector(ms);
> >
> > BooleanQuery mbq = new BooleanQuery();
> > mbq.add(new TermQuery(new Term("body", "value1")),
> > BooleanClause.Occur.MUST_NOT);
> > // 1
> > MatchAllDocsQuery alld = new MatchAllDocsQuery();
> > mbq.add(alld, BooleanClause.Occur.MUST);
> > //
> >
> > System.out.println("Query: " + mbq.toString());
> >
> > // 2
> > ms.search(mbq, allcoll);
> > //ms.search(mbq);
> >
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Partial token matches

2006-04-26 Thread Eric Isakson
Hi All,

Just wanted to throw out something I'm working on. It is working well for me, 
but I wanted to see if anyone can suggest any other alternatives that might 
perform better than what I'm doing now.

I have a field in my index that contains keywords (back of the book index 
terms) and a UI feature that allows the user to find documents that contain a 
partial keyword supplied by the user. So a particular document in my index 
might have the token "informat" in the keywords field and the user may supply 
"form" in the UI and I should get a match.

My old implementation does not use Lucene and just uses String.matches with a 
regular expression that looks like ".*form.*". I reimplemented using Lucene and 
just tokenize the field so I get the tokens

informat
nformat
format
ormat
rmat
mat
at
t

Then I use a prefix query to find hits. Both implementations ignore case in the 
search and the hit order is controlled by another field that I'm sorting on, so 
relevance ranking is not important in this use case. Search time performance is 
crucial, time to create the index and index size are not really important. The 
index is created statically at application startup or possibly delivered to the 
application and is not updated while the application is using it.

Thanks for any suggestions,
Eric

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: java.io.IOException: Stale NFS file handle

2006-04-26 Thread Otis Gospodnetic
Steve,

There are some locks involved in search, like the one that gets written to the 
FS before the readers reads all the segment/index files listed in segments 
file.  Once they are all read, the lock is released.  Setting lock dir to the 
local /tmp doesn't sound good, as locks have to be in the common location in 
order for them to have the desired locking effect.

As for suggestion for large frequently updated indices - have you considered 
NAS?

Otis

- Original Message 
From: "Schwenker, Stephen" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, April 26, 2006 9:19:55 AM
Subject: java.io.IOException: Stale NFS file handle

Hey,
 
I'm running into this exception with my lucene searching.  We have a cluster of 
2 servers that execute searches and one server in the back end that writes to 
the index.  I thought that setting up the external boxes on nfs would be 
alright since searching doesn't require locking.  Can anyone tell me why this 
may be happening and possibly suggest a fix for the solution?  I've already 
tried setting -Dorg.apache.lucene.lockDir=/tmp in the JVM args but it doesn't 
seem to do the job.
 
I have also considdered local filesystems on each cluster member but the index 
is updated frequently and would need to be mirrored too often for it to be 
worth while.  Any suggestions would be helpful.
 
Thank you,
 
 
Steve.
 
 
Here is the stack trace in case you need it.
 
2006-04-26 08:57:36,160 INFO  [STDOUT] java.io.IOException: Stale NFS file 
handle
2006-04-26 08:57:36,163 INFO  [STDOUT]  at 
java.io.RandomAccessFile.readBytes(Native Method)
2006-04-26 08:57:36,164 INFO  [STDOUT]  at 
java.io.RandomAccessFile.read(RandomAccessFile.java:315)
2006-04-26 08:57:36,164 INFO  [STDOUT]  at 
org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:449)
2006-04-26 08:57:36,165 INFO  [STDOUT]  at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:45)
2006-04-26 08:57:36,166 INFO  [STDOUT]  at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:219)
2006-04-26 08:57:36,166 INFO  [STDOUT]  at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
2006-04-26 08:57:36,167 INFO  [STDOUT]  at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
2006-04-26 08:57:36,167 INFO  [STDOUT]  at 
org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
2006-04-26 08:57:36,168 INFO  [STDOUT]  at 
org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62)
2006-04-26 08:57:36,169 INFO  [STDOUT]  at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117)
2006-04-26 08:57:36,170 INFO  [STDOUT]  at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148)
2006-04-26 08:57:36,170 INFO  [STDOUT]  at 
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:157)
2006-04-26 08:57:36,171 INFO  [STDOUT]  at 
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151)
2006-04-26 08:57:36,172 INFO  [STDOUT]  at 
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:300)
2006-04-26 08:57:36,173 INFO  [STDOUT]  at 
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:78)
2006-04-26 08:57:36,173 INFO  [STDOUT]  at 
org.apache.lucene.search.Similarity.idf(Similarity.java:255)
2006-04-26 08:57:36,174 INFO  [STDOUT]  at 
org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:43)
2006-04-26 08:57:36,175 INFO  [STDOUT]  at 
org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:142)
2006-04-26 08:57:36,175 INFO  [STDOUT]  at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:203)
2006-04-26 08:57:36,176 INFO  [STDOUT]  at 
org.apache.lucene.search.BooleanQuery$BooleanWeight2.(BooleanQuery.java:330)
2006-04-26 08:57:36,177 INFO  [STDOUT]  at 
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:372)
2006-04-26 08:57:36,177 INFO  [STDOUT]  at 
org.apache.lucene.search.Query.weight(Query.java:93)
2006-04-26 08:57:36,178 INFO  [STDOUT]  at 
org.apache.lucene.search.Hits.(Hits.java:48)
2006-04-26 08:57:36,179 INFO  [STDOUT]  at 
org.apache.lucene.search.Searcher.search(Searcher.java:53)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene search benchmark/stress test tool

2006-04-26 Thread Otis Gospodnetic
Hi,

I'm about to write a little command-line Lucene search benchmark tool.  I'm 
interested in benchmarking search performance and the ability to specify 
concurrency level (# of parallel search threads) and response timing, so I can 
calculate min, max, average, and mean times.  Something like 'ab' (Apache 
Benchmark) tool, but for Lucene.

Has anyone already written something like this?

Thanks,
Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Partial token matches

2006-04-26 Thread Erick Erickson
I'm sure the guys will chime in, but I think you're in significant danger of
getting a "too many clauses" exception thrown. Try searching on, say, "an".
Under the covers, Lucene expands your query to have a clause for *every*
item in your index that starts with "an", so there's a clause for "an" "ana"
"anb", "anaa", "anab", ... The shorter your term, the more there'll be,
and if there are more than 1024, you'll get the exception above. You can set
the number of clauses to a bigger number, but that may not scale well.

Consider writing a filter (see Lucene In Action). The filter will return a
bitset with a bit turned on for each potential match, and avoid this issue.
RegexTermEnum helps a lot here.

Try searching the archive for a thread started by me, titled "I just don't
get wildcards at all" for an exposition by the guys on this sort of thing.
That thread centers on wildcard queries, but I'm pretty sure PrefixQuery
suffers from the same issue.

Chris, Erik, Yonik... Do I have this right

Erick


DateTools question

2006-04-26 Thread Bill Snyder
Hello,

Why does DateTools.dateToString() return a String representation of my Date,
but in a different TimeZone. Does it use its own Calendar/TimeZone settings?

F.I.

DateFormat format = new SimpleDateFormat("-MM-dd hh:mm:ss.SSS");
System.out.println(DateTools.dateToString(format.parse("2006-04-26 07:29:
52.581"),DateTools.Resolution.MINUTE));

will print out

200604261129

Why the 4 hour difference?

Thanks!

--Bill


Re: DateTools question

2006-04-26 Thread Chris Hostetter

: Why does DateTools.dateToString() return a String representation of my Date,
: but in a different TimeZone. Does it use its own Calendar/TimeZone settings?

Yes, DateTime is hardcoded to use GMT for it's string representations.

It wouldn't be safe for DateTools to use your current TimeZone/Locale,
because once you've indexed the value, your index might be used by another
application (or another instance of your application) running in a
differnet locale.

The important thing is not what string DateTools.dateToString returns,
it's whether you get an equivilent date back (based on the resolution you
specified)) when you do something like this...

  Date a = ...;
  DateTools.Resolution r = ...;
  Date b = DateTools.stringToDate(DateTools.dateToString(a,r));
  System.out.println("Is '"+a+"' the same as '"+b+"' with "+r+" resolution?");


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Partial token matches

2006-04-26 Thread Chris Hostetter

: I'm sure the guys will chime in, but I think you're in significant danger of
: getting a "too many clauses" exception thrown. Try searching on, say, "an".
: Under the covers, Lucene expands your query to have a clause for *every*
: item in your index that starts with "an", so there's a clause for "an" "ana"
: "anb", "anaa", "anab", ... The shorter your term, the more there'll be,
: and if there are more than 1024, you'll get the exception above. You can set
: the number of clauses to a bigger number, but that may not scale well.

When using any of the queries that expand into a BooleanQuery, there is
almost allways the possibility of hitting TooManyClauses -- but this
approach of using PrefixQuery is definitely safer/faster then a straight
use of WildCardQuery -- at the expense of a Bigger index.

The idea mentioned in this thread is basically the same as an idea Erik
Hatcher has suggested in the past, which i've taken to refering to as
"wildcard term rotating"...
  http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12261.html

: Consider writing a filter (see Lucene In Action). The filter will return a
: bitset with a bit turned on for each potential match, and avoid this issue.

very true -- but at the expense of scoring information (ie: how many times
does the term appear in the document?) ... it's all a question of
priorities.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Filter operation

2006-04-26 Thread Tom Emerson
Greetings,

If I write a filter, does this run over the documents in the index *before*
a search is made (i.e., every document in the index is touched) or on the
result set after the search? If it is run over all of the documents, doesn't
this become a performance bottleneck on any non-trivial filter?

--
Tom Emerson
[EMAIL PROTECTED]
http://www.dreamersrealm.net/~tree


Dealing with acronyms

2006-04-26 Thread Hannes Carl Meyer

Hi All,

I would like enable users to do an acronym search on my index.
My idea is the following:

1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which 
is going to be indexed)


2.) Store the extracted acronyms in a field, for example called "case"

3.) On search, asking the user to use case:"ABS" to search for acronyms

Any experience with this kind of pattern? Other ideas or best practices?

Thank you in advance and best regards

Hannes

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to display a field value

2006-04-26 Thread anton feldmann

Hi

how do i display the whole field value of an document the query string 
is found?


cheers

anton

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dealing with acronyms

2006-04-26 Thread Stefan Will
This makes perfect sense to me. Of course the hard part will be how to 
extract the acronyms.


-- Stefan

Hannes Carl Meyer wrote:

Hi All,

I would like enable users to do an acronym search on my index.
My idea is the following:

1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document 
(which is going to be indexed)


2.) Store the extracted acronyms in a field, for example called "case"

3.) On search, asking the user to use case:"ABS" to search for acronyms

Any experience with this kind of pattern? Other ideas or best practices?

Thank you in advance and best regards

Hannes

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RAM Directory / querying Performance issue

2006-04-26 Thread Doug Cutting
Is this markedly faster than using an MMapDirectory?  Copying all this 
data into the Java heap (as RAMDirectory does) puts a tremendous burden 
on the garbage collector.  MMapDirectory should be nearly as fast, but 
keeps the index out of the Java heap.


Doug

z shalev wrote:

I've rewritten the RAM DIR to supprt 64 bit (still havent had time to add this 
to lucene, hopefully in the coming months when i have a free second)
   
  My question:
   
  i have a machine with 4 GB RAM
   
  i have a 3GB index file,
   
  i successfully load the 3GB index into memory,
   
  the first few queries run with normal response time,
   
  but very quickly response time becomes unbearably slow (webloading with 1 con user), 
   
  how are queries expanded in memory when run (how much memory do they use up)?
   
  could this be an issue of the queries themselves talking up large chunks of RAM?
   
   
   
   
   



-
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! 
Messenger with Voice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dealing with acronyms

2006-04-26 Thread Rajesh Munavalli
On 4/26/06, Hannes Carl Meyer <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
> I would like enable users to do an acronym search on my index.
> My idea is the following:
>
> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
> is going to be indexed)


In case you havent already looked at, you might find this useful.
http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf


2.) Store the extracted acronyms in a field, for example called "case"
>
> 3.) On search, asking the user to use case:"ABS" to search for acronyms


I would rather store them in the same field with others, so that you can do
phrase queries. Store the acronyms just like you would store synonyms. More
information on how to store synonyms is in "Lucene in Action" book. This
would facilitate queries like "USA President". If you store "USA" in a
separate field, you wouldn't be able to match this query.

Any experience with this kind of pattern? Other ideas or best practices?

I would also look at HMMs/CRFs to extract acronyms. You need to come up with
a list of features to identify a potential acronym. For ex:
- All Caps
- The acronym appears repeatedly in the rest of the text
- Found in the acronym dictionary...etc

Hope this helps,

--Rajesh Munavalli
Blog: http://munavalli.blogspot.com


Re: Dealing with acronyms

2006-04-26 Thread Hannes Carl Meyer

Rajesh Munavalli schrieb:

On 4/26/06, Hannes Carl Meyer <[EMAIL PROTECTED]> wrote:
  

Hi All,

I would like enable users to do an acronym search on my index.
My idea is the following:

1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
is going to be indexed)




In case you havent already looked at, you might find this useful.
http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf


2.) Store the extracted acronyms in a field, for example called "case"
  

3.) On search, asking the user to use case:"ABS" to search for acronyms




I would rather store them in the same field with others, so that you can do
phrase queries. Store the acronyms just like you would store synonyms. More
information on how to store synonyms is in "Lucene in Action" book. This
would facilitate queries like "USA President". If you store "USA" in a
separate field, you wouldn't be able to match this query.

Any experience with this kind of pattern? Other ideas or best practices?

I would also look at HMMs/CRFs to extract acronyms. You need to come up with
a list of features to identify a potential acronym. For ex:
- All Caps
- The acronym appears repeatedly in the rest of the text
- Found in the acronym dictionary...etc

Hope this helps,

--Rajesh Munavalli
Blog: http://munavalli.blogspot.com

  

Hi,

thank you, thats a good advice - I don't have the Lucene in Action Book, 
but I think its worth taking a look at it.


So I guess its done by writing or extending an anylzer?

H.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dealing with acronyms

2006-04-26 Thread Rajesh Munavalli
>
>
> So I guess its done by writing or extending an anylzer?
>
Yes...thats correct.

--Rajesh Munavalli
Blog: http://munavalli.blogspot.com


Re: performance differences between 1.4.3 and 1.9.1

2006-04-26 Thread Daniel Naber
On Mittwoch 26 April 2006 01:22, RONALD MANTAY wrote:

>   However when searching muliple indexes with multiSearcher and with a
> FuzzyQuery with a prefixLength of 1. The search against 3.7m documents
> spread over 23 indexes (due to the natural grouping of the data) the
> time changed from 800ms to 4500 ms.

MultiSearcher in Lucene 1.4 had a broken ranking implementation. This has 
been fixed in Lucene 1.9, but this might have bad effects on performance. 
23 indexes is quite much, maybe you can speed up things greatly be using a 
smaller number of indexes.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to serach in sentence and dispaly the whole sentence

2006-04-26 Thread anton feldmann
Are the names of a field in a document unique or can i make a field with 
the name "sentence" for each sentence in an text document?


Grant Ingersoll schrieb:

Anton,

I think there are at least a couple of ways of doing this.  I assume 
you have a program that does sentence detection already, as Lucene 
does not provide this.  If not, I am sure a search of the web will 
find one that has high accuracy.

You can:
1. Index each sentence as a separate Document.  You will need a field 
on the Document relating it back to the overall file so you can 
reconstruct it.
2. As you index, insert sentence/paragraph boundary markers into your 
index and then use the SpanQuery functionality.  Search this mail 
archive for sentence boundary detection and Span Query (try the dev 
list too).  I think there was a discussion between me, Doug and Hoss 
on how to do this.
3. Do search as you do now and then post process to figure out what 
sentence it came from.  This will be inefficient, but I don't know 
what your requirements are that way, so it may work for you.


There are probably other ways too.

anton feldmann wrote:

I intend, to make a search, to find a word or a word pair
in  a sentence or a paragraph. But then the sentence should be indicated
as a whole. The question relates to the fact, that I need to extend 
Lucene
in such a way that this is possible. But where to I make a start, 
because
I have no idea, how I have to change the IndexFile, whether that 
conforms with the Lucene Team.


cheers

anton feldmann


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: performance differences between 1.4.3 and 1.9.1

2006-04-26 Thread Andy Goodell
For my application we have several hundred indexes, different subsets
of which are searched depending on the situation.  Aside from not
upgrading to lucene 1.9, or making a big index for every possible
subset, do you have any ideas for how can we maintain fast
performance?

- andy g

On 4/26/06, Daniel Naber <[EMAIL PROTECTED]> wrote:
> MultiSearcher in Lucene 1.4 had a broken ranking implementation. This has
> been fixed in Lucene 1.9, but this might have bad effects on performance.
> 23 indexes is quite much, maybe you can speed up things greatly be using a
> smaller number of indexes.
>
> Regards
>  Daniel
>
> --
> http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: performance differences between 1.4.3 and 1.9.1

2006-04-26 Thread karl wettin


27 apr 2006 kl. 02.18 skrev Andy Goodell:


For my application we have several hundred indexes, different subsets
of which are searched depending on the situation.  Aside from not
upgrading to lucene 1.9, or making a big index for every possible
subset, do you have any ideas for how can we maintain fast
performance?


You probably need to explain the reason for splitting them up in  
order to get a good answer to that. And how big are they?


Without knowing anything about your application I say: merge them all  
to one and add a field you apply to a boolean clause.
But with a few hundred indices it sounds like you have a design plan  
that don't work with above.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools question

2006-04-26 Thread Bill Snyder
Makes sense. Thanks for the response!

--Bill

On 4/26/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Why does DateTools.dateToString() return a String representation of my
> Date,
> : but in a different TimeZone. Does it use its own Calendar/TimeZone
> settings?
>
> Yes, DateTime is hardcoded to use GMT for it's string representations.
>
> It wouldn't be safe for DateTools to use your current TimeZone/Locale,
> because once you've indexed the value, your index might be used by another
> application (or another instance of your application) running in a
> differnet locale.
>
> The important thing is not what string DateTools.dateToString returns,
> it's whether you get an equivilent date back (based on the resolution you
> specified)) when you do something like this...
>
>   Date a = ...;
>   DateTools.Resolution r = ...;
>   Date b = DateTools.stringToDate(DateTools.dateToString(a,r));
>   System.out.println("Is '"+a+"' the same as '"+b+"' with "+r+"
> resolution?");
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Lucene search benchmark/stress test tool

2006-04-26 Thread Sunil Kumar PK
Hi,

I have added some code in the Lucene 1.9 - source code for Lucene
RemoteParallelMultisearcher performance benchmark.

 I have recorded the time to execute the  'searchables[i].docFreq(term)' (in
MultiSearcher.java) method in both client and server, and for  '
searchable.search' (in ParallelMultiSearcher.java) method also.i have also
recorded the total time taken to get hits object.

I have tested different complex boolean queries and taken the average time
for each queries.  But while doing this i am stucked with some doubts.
Please find my doubts listed below.

What I have understood from Lucene Remote Parallel Multi Searcher Search
Procedure is first compute the weight for the Query in each Index
sequentially (one by one, eg: - calculate "query weight" of index1 first and
then index2) and then perform searching of each index one by one and merge
the results.

I want to know is there any possibility or method to merge the weight
calculation of index 1 and its search in a single RPC instead of doing the
both function in separate steps.

Another query I have to clear is In RemoteParallelMultiSearcher the method
"docFreq (Term term)" is not parallelized, why it is not
parallelized, and please specify any reason for that.


Regards

Sunil


On 4/26/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I'm about to write a little command-line Lucene search benchmark
> tool.  I'm interested in benchmarking search performance and the ability to
> specify concurrency level (# of parallel search threads) and response
> timing, so I can calculate min, max, average, and mean times.  Something
> like 'ab' (Apache Benchmark) tool, but for Lucene.
>
> Has anyone already written something like this?
>
> Thanks,
> Otis
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>