Re: simple (?) question about scoring

2006-11-03 Thread Paul Elschot
Michele,

On Friday 03 November 2006 07:07, Michele Amoretti wrote:
> I have a question: is the score for a document different if I have
> only that document in my index, or if I have N documents?
> If the answer is yes, I will put all N documents together, otherwise I
> will evaluate them one by one.
> 
> Btw, I will ask the ws develepoer about how queries are interpreted by
> the search engine.

To compute the score for only a subset of the lucene documents
one normally uses a Filter. Assuming you get the primary keys of
the docs to be scored, you can look them up in the lucene index
and use their internal lucene document numbers to create the Filter.
Then search your query with this filter.
Have a look at the source code of RangeFilter.bits() to see how to
get to the internal document numbers from a set of terms.

Btw. when your database uses the same query to obtain this set of
documents, you might consider to move this function into Lucene
completely, because this will allow you to avoid using a filter alltogether.

Regards,
Paul Elschot


> 
> Thanks
> 
> On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > : le list is not ordered (I do not know the details of the search
> > : angine, I only have its result for a query)
> > :
> > : then I have this list of documents, which represents a subset of the 
corpus
> > :
> > : I have to rank the documents of the list, using your scoring algorithm
> >
> > In other words, out of a large copus C, this webservice hase
> > told you that the documents comprising subset S is the top N matching
> > documents for your query Q (where N << sizeof(C))
> >
> > your goal is to sort S as best as possible.
> >
> > You could try indexing all the docs in S in a Lucene RAMDirectory and then
> > search on them, but my orriginal point about the score being
> > fairly meaningless in an index of only 1 document still applies somewhat
> > ... if all of the documents you get back allready have a lot in common
> > 9they must have something in common or hte webservice wouldn't havre
> > returned them in response to your query) it may be hard to get a
> > meaningful document frequency of the words in your query.
> >
> > you also may run into confusion about what exactly your query "is" and
> > wether or not your interpretation matches that of the webservice ... at a
> > very simplisitc level, if the query is "Java Lucene" and your webservice
> > interprets that as an "OR" query and you interpret that as an "AND" query,
> > you might find that the scores you compute for all the docs are 0.
> >
> > If i were in your shoes, i'd try to work with whoever runs this webservice
> > to make it return more usefull informatoin -- at the very least to return
> > results in sorted order.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



for admins: mailing list like spam

2006-11-03 Thread Michele Amoretti

Hi,

why not to put a [LUCENE USER] automatic tag at the beginning of
e-mails subjects?

It will make mails list more easy to read (I am using gmail and I do
not have client-side filters).

--
Michele Amoretti, Ph.D.
Distributed Systems Group
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Parma
http://www.ce.unipr.it/people/amoretti

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search within search

2006-11-03 Thread spinergywmy

Hi,

   Doron, good call, thanks.

   I have another problem is I do not perform the real search within search
feature which according to the way that I have coded, because for the second
time searching, I actually go back to the index directory to search the
entire indeces again rather then cached the first time search result.

   How can I solve this problem? Do I need to use queryFilter and
reconstruct the codes again, and that is time consuming, is there any how I
can get it done without reconstruct. Or do I need to use bitSet within my
existing codes.

   Thanks.


regards,
Wooi Meng
-- 
View this message in context: 
http://www.nabble.com/search-within-search-tf2558237.html#a7153721
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Paramasivam Srinivasan
Hi Peter

When I use the CustomHitCollector, it affect the application performance. 
Also how you accomplish the grouping the results with out affecting 
performance. Also If possible give some code snippet for custome 
hitcollector.

TIA

Sri

"Peter Keegan" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> Joe,
>
> Fields with numeric values are stored in a separate file as binary values 
> in
> an internal format. Lucene is unaware of this file and unaware of the 
> range
> expression in the query. The range expression is parsed outside of Lucene
> and used in a custom HitCollector to filter out documents that aren't in 
> the
> requested range(s). A goal was to do this without having to modify Lucene.
> Our scheme is pretty efficient, but not very general purpose in its 
> current
> form, though.
>
> Peter
>
>
> On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
>> > Numeric range search is one of Lucene's weak points (performance-wise)
>> so we
>> > have implemented this with a custom HitCollector and an extension to 
>> > the
>> > Lucene index files that stores the numeric field values for all
>> documents.
>> >
>> > It is important to point out that this has all been implemented with 
>> > the
>> > stock Lucene 2.0 library. No code changes were made to the Lucene core.
>>
>> Can you give some technical details on the extension to the Lucene index
>> files?  How did you do it without making any changes to the Lucene core?
>>
>> Thanks,
>> Joe
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: simple (?) question about scoring

2006-11-03 Thread Michele Amoretti

Ok, sorry I did not read it in depth.

Now, where can I find an example of:

- building the RAMDirectory
- scoring all documents against the query?

thanks

On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: I have a question: is the score for a document different if I have
: only that document in my index, or if I have N documents?
: If the answer is yes, I will put all N documents together, otherwise I
: will evaluate them one by one.

as i said before, yes it does...

>> For most of the various types of Queries that exist in Lucene, the
>> score is very dependent on how common the Terms involved are in the
>> Corpus as a whole -- if your Corpus consists of only 1 Document, then
>> your scores are going to be relatively meaningless.

...you will see a big difference between an index containing 1 doc, and an
index containing 10 docs which all match your query, and an index
containing 10 docs.

I believe Doron already suggested you take a look at the Document
explainaing how Lucene's Scoring works correct? ...

   http://lucene.apache.org/java/docs/scoring.html


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Michele Amoretti, Ph.D.
Distributed Systems Group
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Parma
http://www.ce.unipr.it/people/amoretti

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: simple (?) question about scoring

2006-11-03 Thread Michele Amoretti

http://javatechniques.com/public/java/docs/basics/lucene-memory-search.html

is this good? it seems to be good..

On 11/3/06, Michele Amoretti <[EMAIL PROTECTED]> wrote:

Ok, sorry I did not read it in depth.

Now, where can I find an example of:

- building the RAMDirectory
- scoring all documents against the query?

thanks

On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : I have a question: is the score for a document different if I have
> : only that document in my index, or if I have N documents?
> : If the answer is yes, I will put all N documents together, otherwise I
> : will evaluate them one by one.
>
> as i said before, yes it does...
>
> >> For most of the various types of Queries that exist in Lucene, the
> >> score is very dependent on how common the Terms involved are in the
> >> Corpus as a whole -- if your Corpus consists of only 1 Document, then
> >> your scores are going to be relatively meaningless.
>
> ...you will see a big difference between an index containing 1 doc, and an
> index containing 10 docs which all match your query, and an index
> containing 10 docs.
>
> I believe Doron already suggested you take a look at the Document
> explainaing how Lucene's Scoring works correct? ...
>
>http://lucene.apache.org/java/docs/scoring.html
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
Michele Amoretti, Ph.D.
Distributed Systems Group
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Parma
http://www.ce.unipr.it/people/amoretti




--
Michele Amoretti, Ph.D.
Distributed Systems Group
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Parma
http://www.ce.unipr.it/people/amoretti

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: for admins: mailing list like spam

2006-11-03 Thread Erik Hatcher


On Nov 3, 2006, at 3:20 AM, Michele Amoretti wrote:

why not to put a [LUCENE USER] automatic tag at the beginning of
e-mails subjects?


Because the To and Reply-to headers indicate the list.  All Apache e- 
mail lists operate the same, and we are not going to change this  
behavior.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: simple (?) question about scoring

2006-11-03 Thread Michele Amoretti

Yes! I modified the example to be compliant with 2.1 api, and I added
the hits.score() call, for each discovered results.

It works!

[java] Hits for "freedom" were found in quotes by:
[java]   1. Mohandas Gandhi with score = 0.53033006
[java]   2. Ayn Rand with score = 0.25
[java]   3. Friedrich Hayek with score = 0.1875

[java] Hits for "free" were found in quotes by:
[java]   1. Ayn Rand with score = 0.5986179

[java] Hits for "progress or achievements" were found in quotes by:
[java]   1. Theodore Roosevelt with score = 0.14965448
[java]   2. Friedrich Hayek with score = 0.11224086


I will start from this, for my purposes.

Thank you for all the hints.

Michele



On 11/3/06, Michele Amoretti <[EMAIL PROTECTED]> wrote:

http://javatechniques.com/public/java/docs/basics/lucene-memory-search.html

is this good? it seems to be good..

On 11/3/06, Michele Amoretti <[EMAIL PROTECTED]> wrote:
> Ok, sorry I did not read it in depth.
>
> Now, where can I find an example of:
>
> - building the RAMDirectory
> - scoring all documents against the query?
>
> thanks
>
> On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > : I have a question: is the score for a document different if I have
> > : only that document in my index, or if I have N documents?
> > : If the answer is yes, I will put all N documents together, otherwise I
> > : will evaluate them one by one.
> >
> > as i said before, yes it does...
> >
> > >> For most of the various types of Queries that exist in Lucene, the
> > >> score is very dependent on how common the Terms involved are in the
> > >> Corpus as a whole -- if your Corpus consists of only 1 Document, then
> > >> your scores are going to be relatively meaningless.
> >
> > ...you will see a big difference between an index containing 1 doc, and an
> > index containing 10 docs which all match your query, and an index
> > containing 10 docs.
> >
> > I believe Doron already suggested you take a look at the Document
> > explainaing how Lucene's Scoring works correct? ...
> >
> >http://lucene.apache.org/java/docs/scoring.html
> >
> >
> > -Hoss
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> --
> Michele Amoretti, Ph.D.
> Distributed Systems Group
> Dipartimento di Ingegneria dell'Informazione
> Università degli Studi di Parma
> http://www.ce.unipr.it/people/amoretti
>


--
Michele Amoretti, Ph.D.
Distributed Systems Group
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Parma
http://www.ce.unipr.it/people/amoretti




--
Michele Amoretti, Ph.D.
Distributed Systems Group
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Parma
http://www.ce.unipr.it/people/amoretti

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Suspected problem in the QueryParser

2006-11-03 Thread Lucifer Hammer

Hi,

I recently stumbled across what I think might be a bug in the QueryParser.
Before I enter it as a bug, I wanted to run it by this group to see if I'm
just not looking at the boolean expression correctly.

Here's the issue:

I created an index with 5 documents, all have one field: "text", with the
following contents:
doc1:text:"Table Chair Spoon"
doc2:text:"Table Chair Spoon Fork"
doc3:text:"Table Spoon Fork"
doc4:text:"Chair Spoon Fork"
doc5:text:"Spoon Fork"

When I enter the query: "Table AND NOT Chair"  I get one hit, doc3
When I enter the query: "Table AND (NOT Chair)" I get 0 hits.

I had thought that both queries would return the same results.  Is this a
bug, or, am I not understanding the query language correctly?

I'm attaching test code.  The program creates an index in the directory
which you pass into the main program.

Thanks!
L
--

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.index.Term;



import java.io.File;
import java.io.IOException;
import java.io.FileReader;

public class IndexTest {
   public static void create(File indexDir) throws IOException {

   IndexWriter writer = new IndexWriter(indexDir, new
   WhitespaceAnalyzer(), true);
   Document doc = new Document();
   doc.add(new Field("text",
   "Table Chair Spoon",
   Field.Store.YES,
   Field.Index.TOKENIZED,
   Field.TermVector.NO));

   writer.addDocument(doc);
   doc = new Document();
   doc.add(new Field("text",
   "Table Chair Spoon Fork",
   Field.Store.YES,
   Field.Index.TOKENIZED,
   Field.TermVector.NO));
   writer.addDocument(doc);
   doc = new Document();
   doc.add(new Field("text",
   "Table Spoon Fork!",
   Field.Store.YES,
   Field.Index.TOKENIZED,
   Field.TermVector.NO));
   writer.addDocument(doc);
   doc = new Document();
   doc.add(new Field("text",
   "Chair Spoon Fork",
   Field.Store.YES,
   Field.Index.TOKENIZED,
   Field.TermVector.NO));
   writer.addDocument(doc);
   doc = new Document();
   doc.add(new Field("text",
   "Spoon Fork",
   Field.Store.YES,
   Field.Index.TOKENIZED,
   Field.TermVector.NO));
   writer.addDocument(doc);

   writer.close();
   }


   public static void query(File indexDir, String queryString) throws
IOException
   {
   Query query = null;
   Hits hits = null;

   try {
   QueryParser qp = new QueryParser("text",new
WhitespaceAnalyzer());
   qp.setDefaultOperator(QueryParser.OR_OPERATOR);
   query = qp.parse(queryString);
   } catch (Exception qe) {System.out.println(qe.toString());}
   if (query == null) return;
   System.out.println("Query: " + query.toString());
   IndexReader reader = IndexReader.open(indexDir);
   IndexSearcher searcher = new IndexSearcher(reader);

   hits = searcher.search(query);
   System.out.println("Hits: " + hits.length());

   for (int i = 0; i < hits.length(); i++)
   {
   System.out.println( hits.doc(i).get("text") + " ");
   }

   searcher.close();
   reader.close();

   }
   public static void main(String[] args) throws Exception {
   if (args.length != 1) {
   throw new Exception("Usage: " + IndexTest.class.getName() +
"");
   }
   File indexDir = new File(args[0]);
   create(indexDir);
   query(indexDir,"Table AND NOT Chair");
   query(indexDir,"Table AND (NOT Chair)");
   }
}


Re: Modelling relational data in Lucene Index?

2006-11-03 Thread Erick Erickson

One thing it took me a while to grasp, and is not automatic for folks with
significant database backgrounds is that the fields in a Lucene document are
only related to those of any other document by the meaning you, as a
programmer, understand. That is, document 1 may have fields a, b, c.
Document 2 may have fields b, e, g. There is no requirement that, in this
example, document 1 has fields e and g for instance. and vice-versa. In
other words, Lucene documents don't fit into a table model.

The reason I mention that is that I'm extremely leery of packing data in a
field that really doesn't belong together. Plus, your searching becomes more
complicated.

In your example above, what happens if the file name and image are similar
enough to produce false hits? Whereas if you stored them as separate fields
in a document, you don't have this kind of problem.

So, if you can cleverly de-normalize your data in such a way as to satisfy
all the searches you'll ever want to perform, you can store it all in a
Lucene index and be happy. If you can't, you could use Lucene to search the
parts you *do* care about and store the rest in a database. Or, you could
just use a database. I believe it all hinges on whether you have a fixed set
of queries you can anticipate (and thus reflect in a Lucene index) or not.

Best
Erick

On 11/2/06, Rajesh parab <[EMAIL PROTECTED]> wrote:


Thanks for feedback Chris.

I agree with you. The data set should be flattened out to store inside
Lucene index. The Folder-File was just an example. As you know, in
relational database, we can have more complex relationships. I understand
that this model may not work for deeper relationships.

What I am mainly interested in is just one level deep relationship. But, I
would like to search on the additional attributes of the related object. For
example, in the relationship for Folder-File, I would like to use additional
file attributes as search criteria along with file name while searching for
folders.

The way I see is having single filed for the related object and all its
additional attributes and use some separator while capturing this data
inside Lucene Field object. For example -

new Field("file", "abc.txtimage");

But, I am not quite sure if this model will work.

BTW. I did not understand what you meant by the detached approach. Can you
please elaborate?

Regards,
Rajesh

- Original Message 
From: Chris Lu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, November 2, 2006 7:57:46 PM
Subject: Re: Modelling relational data in Lucene Index?


For this specific question, you can create index on files, search
files that of type image, and from matched files, find the unique
directories(can be done in lucene or you can do it via java).

Of course this does not scale to deeper relationships. Usually you do
need to flattern the database objects in order to use lucene. It's
just trading space for speed.

I would prefer a detached approach instead of Hibernate or EJB's
approach, which is kind of too tightly coupled with any system. How to
rebuild if the index is corrupted, or you have a new Analyzer, or
schema evolves? How to make it multi-thread safe?

--
Chris Lu
-
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 11/2/06, Mark Miller <[EMAIL PROTECTED]> wrote:
> Lucene is probably not the solution if you are looking for a relational
> model. You should be using a database for that. If you want to combine
> Lucene with a relational model, check out Hibernate and the new EJB
> annotations that it supports...there is a cool little Lucene add-on that
> lets you declare fields to be indexed (and how) with annotations.
>
> - Mark
>
> Rajesh parab wrote:
> > Hi,
> >
> > As I understand, Lucene has a flat structure where you can define
multiple fields inside the document. There is no relationship between any
field.
> >
> > I would like to enable index based search for some of the components
inside relational database. For exmaple, let say "Folder" Object. The Folder
object can have relationship with File object. The File object, in turn, can
have attributes like is image, is text file, etc. So, the stricture is
> >
> > Folder -- > File
> >  |
> >  --- > is image, is text file, ..
> >
> >
> > I would like to enable a search to find a Folder with File of type
image. How can we model such relational data inside Lucene index?
> >
> > Regards,
> > Rajesh
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To 

Re: for admins: mailing list like spam

2006-11-03 Thread Patrick Turcotte


It will make mails list more easy to read (I am using gmail and I do
not have client-side filters).



That is not true.

You can have labels, and, if you look at the top of the page, right beside
the  "Search the Web" button, you have a "create filter" link.

Patrick


Re: experiences with lingpipe

2006-11-03 Thread Breck Baldwin



Martin Braun wrote:

Hi Breck,

i have tried your tutorial and built (hopefully) a successful
SpellCheck.model File with
49M.
My Lucene Index directory is 2,4G. When I try to read the Model with the
readmodel function,
i get an "Exception in thread "main" java.lang.OutOfMemoryError: Java
heap space", though I started java with -Xms1024m -Xmx1024m.

How many RAM will I need for the Model (I only have 2 GB of physical
RAM, and lucene's also using some memory).


You need to increase the memory for java. I think 32-bit jave is limited 
to a 1.3 gig heap but could be wrong. No heuristics at the tip of my 
fingers.


To make the spell checker smaller you can prune the tokens using the
pruneLM method in the TrainSpellChecker. Pruning the 1 counts should 
make a big difference and not hurt spelling too much (depends on how 
things are paramterized). Probably up to 5 counts won't matter.


Also look at my tuning tutorial that is in very rough shape but will
get you going on tuning at:

cvs -d:pserver:[EMAIL PROTECTED]:/usr/local/sandbox co 
querySpellCheckTuner


I will try to get another pass at it over the weekend.

b reck




Is there a "rule of thumb" to calculate the needed amount of memory of
the model?

thanks in advance,

martin




Tuning params dominate the performance space. A small beam (16 active
hypotheses) will be quite snappy (I have 200 queries/sec with a 32 beam.
over a 80 gig text collection that with some pruning was 5 gig in memory
running an 8 gram model)





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: experiences with lingpipe

2006-11-03 Thread Vladimir Olenin
> You need to increase the memory for java. I think 32-bit jave is
limited to a 1.3 gig heap but
> could be wrong. No heuristics at the tip of my fingers.

32-bit JVM under Linux/Windows. Solaris runs OK. Limit on the heap is
~1.7 - 1.8Gb.

-Original Message-
From: Breck Baldwin [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 03, 2006 9:59 AM
To: java-user@lucene.apache.org
Subject: Re: experiences with lingpipe



Martin Braun wrote:
> Hi Breck,
> 
> i have tried your tutorial and built (hopefully) a successful 
> SpellCheck.model File with 49M.
> My Lucene Index directory is 2,4G. When I try to read the Model with 
> the readmodel function, i get an "Exception in thread "main" 
> java.lang.OutOfMemoryError: Java heap space", though I started java 
> with -Xms1024m -Xmx1024m.
> 
> How many RAM will I need for the Model (I only have 2 GB of physical 
> RAM, and lucene's also using some memory).

You need to increase the memory for java. I think 32-bit jave is limited
to a 1.3 gig heap but could be wrong. No heuristics at the tip of my
fingers.

To make the spell checker smaller you can prune the tokens using the
pruneLM method in the TrainSpellChecker. Pruning the 1 counts should
make a big difference and not hurt spelling too much (depends on how
things are paramterized). Probably up to 5 counts won't matter.

Also look at my tuning tutorial that is in very rough shape but will get
you going on tuning at:

cvs -d:pserver:[EMAIL PROTECTED]:/usr/local/sandbox co
querySpellCheckTuner

I will try to get another pass at it over the weekend.

b reck


> 
> Is there a "rule of thumb" to calculate the needed amount of memory of

> the model?
> 
> thanks in advance,
> 
> martin
> 
> 
> 
Tuning params dominate the performance space. A small beam (16 
active
hypotheses) will be quite snappy (I have 200 queries/sec with a 32
beam.
over a 80 gig text collection that with some pruning was 5 gig in 
memory running an 8 gram model)

> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Any experience with spring's lucene support?

2006-11-03 Thread Vladimir Olenin
Haven't used them, but had a look at them some time ago. Seems like a
nice set of helper factory classes to manage Lucene engine through
Spring IoC. Can't do much wrong in here I guess... If you'd be using
Spring in your app, you'd have to come up with similar factories either
way, so probably it'd make sense to reuse the ones in springmodules. The
only 'non-factory' classes I noticed is 'DB indexing'. The only problem
(from my estimations) is that the DB Access layer is fixed to Spring SQL
classes (ie, you probably wouldn't be able to use iBatis or Hibernate
easily).

As to compass, probably these guys have similar Spring classes as well
as some other stuff. One person on the list used it (Compass) in
production environment and says he's quite happy with it.

But generally, it's probably worthwhile to go to SpringModules forum and
Compass forum accordingly for more info...

Vlad

-Original Message-
From: lude [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 03, 2006 1:36 AM
To: java-user
Subject: Re: Any experience with spring's lucene support?

Nobody here, who is using spring-modules?

On 11/1/06, lude <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> while starting a new project we are thinking about using the 
> spring-modules for working with lucene. See:
> https://springmodules.dev.java.net/
>
> Does anybody has experience with this higher level lucene API?
> How does it compare to Compass?
> (Dis-)Advantages of using spring-modules lucene support?
>
> Thanks
> lude
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Daniel Rosher

Hi Peter,

Does this mean you are calculating the euclidean distance twice ... once for
the HitCollecter to filter
'out of range' documents, and then again for the custom Comparator to sort
the returned documents?
especially since the filtering is done outside Lucene?

Regards,
Dan



Joe,

Fields with numeric values are stored in a separate file as binary values

in

an internal format. Lucene is unaware of this file and unaware of the range
expression in the query. The range expression is parsed outside of Lucene
and used in a custom HitCollector to filter out documents that aren't in

the

requested range(s). A goal was to do this without having to modify Lucene.
Our scheme is pretty efficient, but not very general purpose in its current
form, though.

Peter


On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote:


Hi Peter,

On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
> Numeric range search is one of Lucene's weak points (performance-wise)
so we
> have implemented this with a custom HitCollector and an extension to

the

> Lucene index files that stores the numeric field values for all
documents.
>
> It is important to point out that this has all been implemented with

the

> stock Lucene 2.0 library. No code changes were made to the Lucene core.

Can you give some technical details on the extension to the Lucene index
files?  How did you do it without making any changes to the Lucene core?

Thanks,
Joe


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




TooManyClauses with MultiTermQueries

2006-11-03 Thread Eric Louvard

Hello, in working with Lucene since several years.
One of my biggest problem was the unability of lucene to search with 
wildcard. Also I have develop my own MultiTermQueries.


Now there's a standard class for this, but you'll allways become an 
exception if your search is to generic, 'a*' for exemple.
I can't solve this problem, but I making it acceptable with the follwing 
allgorithm:

- getting all possible terms.
- sort them (actualy with the length difference beetween search term (if 
you search 'TooMany*' then 'TooManyDog' has a better range than 
'TooManyClauses')).
- get the allowed (I want my BooleanQuery not to overwrite 100 terms for 
example).

- search this.

for this Query I can call call:
.getWarnnigs() give me a string with a description of the limitation 
("Have found 265654 terms for you search please be more precise.")

.getTermsList() the list of all searched terms (usefull too for the user).

So I can allways have a result. Mostly, with the sorting I am getting 
the searched term (You can use another sort). I can limit maxClauseCount 
to few values (avoid out of memory and better performance).


Hope this can help someone. I think it would be a nice feature to 
implements in lucene.



PS: sorry for my poor english.

--
Mit freundlichen Grüßen

i. A. Éric Louvard
HAUK & SASKO Ingenieurgesellschaft mbH
Zettachring 2
D-70567 Stuttgart

Phone: +49 7 11 7 25 89 - 19
Fax: +49 7 11 7 25 89 - 50
E-Mail: [EMAIL PROTECTED]
www: www.hauk-sasko.de





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multi valued fields

2006-11-03 Thread Seeta Somagani
Hi all,

 

Our company has a set of assets and we use meta-data (XML files) to
describe each asset. My job is to index and search over the meta-data
associated with the assets. The interesting aspect of my problem is that
an asset can have more than one meta-data file associated with it,
depending on the context that the asset lies in. The search result must
display an asset only once. If more than one meta-data associated with
it match the search query, we need to display the different meta-data
associated with the asset in order of relevance as part of one hit to be
able to show the user the various contexts that this asset occurs in. 

 

My first idea was to index each meta-data file into its own document and
merge the documents with the same asset_id on search. But, there are
hundreds of thousands of meta-data and the search results can run into
hundreds. 

 

My next idea was to index all the meta-data associated with an asset
into multi-valued fields. But, I cannot see a way to rank within the
multi-valued fields. 

 

Another crazy idea that crossed my mind - how about building a separate
index that indexes document ids of the documents associated with an
asset, so that I can look it up to merge the hits?

 

Any thoughts?

 

Seeta



Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Peter Keegan

Paramasivam,

Take a look at Solr, in particular the DocSetHitCollector class. The
collector simply sets a bit in a BitSet, or saves the docIds in an array
(for low hit counts). Solr's BitSet was optimized (by Yonik, I believe) to
be faster than Java's BitSet, so this HitCollector is very fast. This is
essentially what we are doing for counting.

Peter

On 11/2/06, Paramasivam Srinivasan <[EMAIL PROTECTED]> wrote:


Hi Peter

When I use the CustomHitCollector, it affect the application performance.
Also how you accomplish the grouping the results with out affecting
performance. Also If possible give some code snippet for custome
hitcollector.

TIA

Sri

"Peter Keegan" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Joe,
>
> Fields with numeric values are stored in a separate file as binary
values
> in
> an internal format. Lucene is unaware of this file and unaware of the
> range
> expression in the query. The range expression is parsed outside of
Lucene
> and used in a custom HitCollector to filter out documents that aren't in
> the
> requested range(s). A goal was to do this without having to modify
Lucene.
> Our scheme is pretty efficient, but not very general purpose in its
> current
> form, though.
>
> Peter
>
>
> On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
>> > Numeric range search is one of Lucene's weak points
(performance-wise)
>> so we
>> > have implemented this with a custom HitCollector and an extension to
>> > the
>> > Lucene index files that stores the numeric field values for all
>> documents.
>> >
>> > It is important to point out that this has all been implemented with
>> > the
>> > stock Lucene 2.0 library. No code changes were made to the Lucene
core.
>>
>> Can you give some technical details on the extension to the Lucene
index
>> files?  How did you do it without making any changes to the Lucene
core?
>>
>> Thanks,
>> Joe
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: search within search

2006-11-03 Thread Doron Cohen
spinergywmy <[EMAIL PROTECTED]> wrote on 03/11/2006 00:40:42:

>I have another problem is I do not perform the real search within
search
> feature which according to the way that I have coded, because for the
second
> time searching, I actually go back to the index directory to search the
> entire indeces again rather then cached the first time search result.
>
>How can I solve this problem? Do I need to use queryFilter and
> reconstruct the codes again, and that is time consuming, is there any how
I
> can get it done without reconstruct. Or do I need to use bitSet within my
> existing codes.

This was the recommendation you got on this in the list (forgot who it
was): submit query1 ANDed with query2. True, this is searching again the
"entire" index. In particular, it is re-doing work already done for query1.
However this is the simplest approach, with equivalent results. Unless you
are facing performance problems this should be sufficient.

If however, you are facing performance issues, say, if the queries are very
large, and the index is large as well, and you have more than 2 stages
(search within (search within (search within search (..., and
resubmitting a larger and larger boolean query is out of the question - you
can go with the filter approach. For this you can use your hit collector,
which, while query1 is processed, would populate a bitset to be used for a
filter for query2, should that be requested by the user. But I wouldn't go
there if I don't have to.

Doron



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Peter Keegan

Daniel,
Yes, this is correct if you happen to be doing a radius search and sorting
by mileage.
Peter

On 11/3/06, Daniel Rosher <[EMAIL PROTECTED]> wrote:


Hi Peter,

Does this mean you are calculating the euclidean distance twice ... once
for
the HitCollecter to filter
'out of range' documents, and then again for the custom Comparator to sort
the returned documents?
especially since the filtering is done outside Lucene?

Regards,
Dan


>Joe,
>
>Fields with numeric values are stored in a separate file as binary values
in
>an internal format. Lucene is unaware of this file and unaware of the
range
>expression in the query. The range expression is parsed outside of Lucene
>and used in a custom HitCollector to filter out documents that aren't in
the
>requested range(s). A goal was to do this without having to modify
Lucene.
>Our scheme is pretty efficient, but not very general purpose in its
current
>form, though.
>
>Peter
>
>
>On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
>> > Numeric range search is one of Lucene's weak points
(performance-wise)
>> so we
>> > have implemented this with a custom HitCollector and an extension to
the
>> > Lucene index files that stores the numeric field values for all
>> documents.
>> >
>> > It is important to point out that this has all been implemented with
the
>> > stock Lucene 2.0 library. No code changes were made to the Lucene
core.
>>
>> Can you give some technical details on the extension to the Lucene
index
>> files?  How did you do it without making any changes to the Lucene
core?
>>
>> Thanks,
>> Joe
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>




Re: How to get Term Weights (document term matrix)?

2006-11-03 Thread Chris Hostetter

I don't really know what a "term matrix" is, but when you ask about
"weight' is it possible you are just looking for the TermDoc.freq() of the
term/doc pair?


: Date: Thu, 02 Nov 2006 12:45:30 +0100
: From: Soeren Pekrul <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: How to get Term Weights (document term matrix)?
:
: Hello,
:
: I would like to extract and store the document term matrix externally. I
: iterate the terms and the documents for each term:
: TermEnum terms=IndexReader.terms();
: while(terms.next()) {
:   TermDocs docs=IndexReader.termDocs(terms.term());
:   while(docs.next()) {
:   //store the term, the document and the weight
:   }
: }
:
: How can I get the term weight for a document?
:
: Thanks. Sören
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Intermittent search performance problem

2006-11-03 Thread Ben Dotte

Hi,

I'm trying to figure out a way to troubleshoot a performance problem
we're seeing when searching against a memory-based index. What happens
is we will run a search against the index and it generally returns in
1 second or less. But every once in a while it takes 15-20 seconds for
the exact same search for no apparent reason. There is nothing else
going on in the system to cause this behavior.

I have tried hooking up YourKit profiler to see where the time is
going but it doesn't even record the extra time being taken up, even
when I ask for method invocation counts.

This is very strange, we have been using Lucene for years in
production and I've never seen a problem like it. It is also only
exhibited in one particular index, we cannot reproduce the problem
with other indexes. This index has around 170,000 documents in it and
does not have a particularly large amount of data relative to our
other indexes.

I would really appreciate any suggestions for tracking down the
culprit. Since YourKit is missing the extra time it seems like some
sort of lock/synchronized method issue but I've only really seen that
type of problem using disk indexing when the indexes aren't optimized.
We're currently on Lucene 2.0 but I had the same problem with 1.9.1.

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modelling relational data in Lucene Index?

2006-11-03 Thread Emmanuel Bernard

Hi
No, he is talking about
http://www.hibernate.org/hib_docs/annotations/reference/en/html/lucene.html

Also note that I'm about to release a new version much more flexible
http://www.mail-archive.com/hibernate-dev%40lists.jboss.org/msg00392.html
and for the future (but flexible)
http://www.mail-archive.com/hibernate-dev%40lists.jboss.org/msg00393.html

Note that Compass is an alternative approach. I haven't really looked at 
the project in detail, the main drawback for me and some other people 
who compared the 2 were

- it requires you to deal with a different API than your ORM
- it does not give you back a managed (ORM) object on query results
- it abstracts quite a lot Lucene

I guess you need to check by yourself

Emmanuel

Rajesh parab wrote:

Thanks Mark.

Can you please tell me more about the Lucene add-on you are talking about? Are 
you talking about Compass?

Regards,
Rajesh

- Original Message 
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, November 2, 2006 7:29:10 PM
Subject: Re: Modelling relational data in Lucene Index?

Lucene is probably not the solution if you are looking for a relational 
model. You should be using a database for that. If you want to combine 
Lucene with a relational model, check out Hibernate and the new EJB 
annotations that it supports...there is a cool little Lucene add-on that 
lets you declare fields to be indexed (and how) with annotations.


- Mark

Rajesh parab wrote:
  

Hi,

As I understand, Lucene has a flat structure where you can define multiple 
fields inside the document. There is no relationship between any field.

I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is 

Folder -- > File

 |
 --- > is image, is text file, ..


I would like to enable a search to find a Folder with File of type image. How 
can we model such relational data inside Lucene index?

Regards,
Rajesh




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: for admins: mailing list like spam

2006-11-03 Thread Mike Klaas

On 11/3/06, Patrick Turcotte <[EMAIL PROTECTED]> wrote:

>
> It will make mails list more easy to read (I am using gmail and I do
> not have client-side filters).


That is not true.

You can have labels, and, if you look at the top of the page, right beside
the  "Search the Web" button, you have a "create filter" link.


"Skip Inbox" is particularly important when doing this.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modelling relational data in Lucene Index?

2006-11-03 Thread Emmanuel Bernard

Hi,
What exactly are your concerned about the "non-detached" approach (see 
below)?


Chris Lu wrote:


I would prefer a detached approach instead of Hibernate or EJB's
approach, which is kind of too tightly coupled with any system. How to

it is probably going to be couple with yours ;-)

rebuild if the index is corrupted, or you have a new Analyzer, or
I've introduced a session.index() which forces the (re)indexing of the 
document

schema evolves? How to make it multi-thread safe?

What do you mean by multithread safe? The indexing?
the indexing is multithread safe in the Hibernate Lucene integration

The query process?
the query doesn't have to since you query on a give session (aka user 
conversation), so no multithread threat here.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suspected problem in the QueryParser

2006-11-03 Thread Chris Hostetter

: When I enter the query: "Table AND NOT Chair"  I get one hit, doc3
: When I enter the query: "Table AND (NOT Chair)" I get 0 hits.
:
: I had thought that both queries would return the same results.  Is this a
: bug, or, am I not understanding the query language correctly?

it's a confusing eccentricity of the QueryParser syntax ... as a general
rule, thing in parens need to be self contained, effective, queries ... if
you have something in parens which would not make sense as a query by
itself, then it won't make any more sense as part of a larger query.

In your case, the query "  NOT Chair " is the problem ... you can't have
a negative clause in isolation by itself -- it doesn't make sense because
there isn't anything positively selecting results for you to then exclude
results from.


As a side not: i strongly encourage you to train yourself to think in
terms of MUST, MUST_NOT and SHOULD (which are represented in the query
parser as the prefixes "+", "-" and the default) instead of in terms of
AND, OR, and NOT ... Lucene's BooleanQuery (and thus Lucene's QueryParser)
is not a strict Boolean Logic system, so it's best not to try and think
of it like one.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modelling relational data in Lucene Index?

2006-11-03 Thread Chris Lu

I personally like your effort, but technically I would  disagree.

The SOLR project, and the project I am working on, DBSight, have an
detached approach which is implementation agnostic, no matter if it's
java, ruby, php, .net. The return results can be a rendered HTML,
JSON, XML. I don't think you can be more flexible than that. If
creating an new index takes 5 minutes without any coding, you can
create something more creative.


From business side, you don't need to worry about indexing when

designing a system. New requirement may come. It's very hard trying to
anticipate all the needs.

Technically, detached approach gives more flexible on resources like
CPU, memory, hard drive. For example, if your index grows large, say
1G, indexing can take hours with merging, I am not sure how compass or
hibernate/lucene handles it. Need to re-write code at that time? I
actually feel it's a dangerous trap.


I've introduced a session.index() which forces the (re)indexing of the
document

So does it mean you need to write some code to fix the index if it's crashed?


What do you mean by multithread safe? The indexing?
the indexing is multithread safe in the Hibernate Lucene integration

The indexing can be threadsafe. But will it affect the searching? With
many files changing and merging, if you cache the searcher. the
searching will have "read passed EOF" exceptions. If you don't cache
the searcher, you will loose the built-in caching, FieldCacheImpl, in
Lucene.



The query process?
the query doesn't have to since you query on a give session (aka user
conversation), so no multithread threat here.

So you are not caching searcher.

--
Chris Lu
-
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 11/3/06, Emmanuel Bernard <[EMAIL PROTECTED]> wrote:

Hi,
What exactly are your concerned about the "non-detached" approach (see
below)?

Chris Lu wrote:
>
> I would prefer a detached approach instead of Hibernate or EJB's
> approach, which is kind of too tightly coupled with any system. How to
it is probably going to be couple with yours ;-)
> rebuild if the index is corrupted, or you have a new Analyzer, or
I've introduced a session.index() which forces the (re)indexing of the
document
> schema evolves? How to make it multi-thread safe?
What do you mean by multithread safe? The indexing?
the indexing is multithread safe in the Hibernate Lucene integration

The query process?
the query doesn't have to since you query on a give session (aka user
conversation), so no multithread threat here.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TooManyClauses with MultiTermQueries

2006-11-03 Thread Silvy Mathews
Hi All,
I also need to resolve this issue. What is the best way to catch this exception?
Thanks
Mathews

-Original Message-
From: Eric Louvard [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 03, 2006 8:36 AM
To: java-user@lucene.apache.org
Subject: TooManyClauses with MultiTermQueries

Hello, in working with Lucene since several years.
One of my biggest problem was the unability of lucene to search with 
wildcard. Also I have develop my own MultiTermQueries.

Now there's a standard class for this, but you'll allways become an 
exception if your search is to generic, 'a*' for exemple.
I can't solve this problem, but I making it acceptable with the follwing 
allgorithm:
- getting all possible terms.
- sort them (actualy with the length difference beetween search term (if 
you search 'TooMany*' then 'TooManyDog' has a better range than 
'TooManyClauses')).
- get the allowed (I want my BooleanQuery not to overwrite 100 terms for 
example).
- search this.

 for this Query I can call call:
.getWarnnigs() give me a string with a description of the limitation 
("Have found 265654 terms for you search please be more precise.")
.getTermsList() the list of all searched terms (usefull too for the user).

So I can allways have a result. Mostly, with the sorting I am getting 
the searched term (You can use another sort). I can limit maxClauseCount 
to few values (avoid out of memory and better performance).

Hope this can help someone. I think it would be a nice feature to 
implements in lucene.


PS: sorry for my poor english.

-- 
Mit freundlichen Grüßen

i. A. Éric Louvard
HAUK & SASKO Ingenieurgesellschaft mbH
Zettachring 2
D-70567 Stuttgart

Phone: +49 7 11 7 25 89 - 19
Fax: +49 7 11 7 25 89 - 50
E-Mail: [EMAIL PROTECTED]
www: www.hauk-sasko.de





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Intermittent search performance problem

2006-11-03 Thread Yonik Seeley

On 11/3/06, Ben Dotte <[EMAIL PROTECTED]> wrote:

I'm trying to figure out a way to troubleshoot a performance problem
we're seeing when searching against a memory-based index. What happens
is we will run a search against the index and it generally returns in
1 second or less. But every once in a while it takes 15-20 seconds for
the exact same search for no apparent reason.


Are you sure it's not just a big GC?  How big is your heap?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Intermittent search performance problem

2006-11-03 Thread Ben Dotte

Good suggestion, I tried watching the GCs in YourKit while testing but
unfortunately they don't seem to line up with the searches that take
forever. They also don't last long enough to make up that kind of
time. I have our heap limited to 1GB right now and its using around
768MB of that.

On 11/3/06, Ben Dotte <[EMAIL PROTECTED]> wrote:

I'm trying to figure out a way to troubleshoot a performance problem
we're seeing when searching against a memory-based index. What happens
is we will run a search against the index and it generally returns in
1 second or less. But every once in a while it takes 15-20 seconds for
the exact same search for no apparent reason.


Are you sure it's not just a big GC?  How big is your heap?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get Term Weights (document term matrix)?

2006-11-03 Thread Soeren Pekrul

Chris Hostetter wrote:

I don't really know what a "term matrix" is, but when you ask about
"weight' is it possible you are just looking for the TermDoc.freq() of the
term/doc pair?


Thank you Chris,

that was also my first idea. I wanted to get the document frequency
indexreader.docFreq(term)
and the term frequency
termdoc.freq()
to calculate the term weight by my self.
If I change the scoring by sub classing the Similarity class I have to 
change the code for the term weight calculation as well. The better way 
would be to use the same scoring engine for a single term weight and the 
ranking of search results.


It seems that there is no simple function to ask the weight for a term 
in a document directly. So I decide not to iterate the documents of a 
term or the terms of a document. I'm iterating the terms of the index, 
searching for the term, iterating the result documents and using the 
score as my term weight for the document term matrix:


TermEnum terms=indexreader.terms();
while(terms.next()) {
  Term term=terms.term();
  // write the term to the external document term matrix
  Hits hits=indexsearcher.search(new TermQuery(term));
  for(int i=0; i// write the document id (key, URL or index number) to the document 
term matrix

float weight=hits.score(i);
// write the term weight to the document term matrix
  }
}

Sören

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get Term Weights (document term matrix)?

2006-11-03 Thread Chris Hostetter

: It seems that there is no simple function to ask the weight for a term
: in a document directly. So I decide not to iterate the documents of a

as i said: it depends on what you mean by "term weight" ...

: term or the terms of a document. I'm iterating the terms of the index,
: searching for the term, iterating the result documents and using the
: score as my term weight for the document term matrix:

...okay, so it sounds like your defining term weight of a doc/term to be
the score of that document when searching for that term.

You really, *REALLY* don't wnat to be doing this using the "Hits" class
like in your example ...
   1) this will re-execute your search behind the scenes many many times
   2) the scores returnd by "Hits" are psuedo-normalized ... they will be
  meaningless for any sort of comparison.

if your concern is making sure that the score you get back matches the
score you would get from executing a search even if you change the
Similarity, you could just make sure you use the lengthNorm and tf
functions from the SImilarity class just like TermScorer does ... or you
could keep executing a TermQuery for each term like you are now, but using
a HitCollector so you get the raw score)

take a look at the Searcher.search methods that take in a HitCollector.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search within search

2006-11-03 Thread spinergywmy

Hi,

   Doron, thanks for the advice.

regards,
Wooi Meng
-- 
View this message in context: 
http://www.nabble.com/search-within-search-tf2558237.html#a7171019
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]