Thanks again

I am not really sure, why it is enough for you to sort the first 50 highest
ranking hits, but if you only want to do this, sorting afterwards is quite
straightforward.
Just to clarify.. And I know it may seem strange...
But I'll mostly be conducting "long" phrase (3 or 4 word) searches on the index. So I don't expect that many hits to appear given that all I have are 5-grams.

If I have above 50 hits then problem solved, need not do anything else.
If I have below 50 hits then I want to look at the frequency count of the hits I got.

Anyway.... I was just curious about the behavior.
My fault. Thanks for all the ideas.

Another idea is to not index the count itself, but more use the count as a
boost factor for each document. The ranking algorithm of lucene will then
boost docs with higher counts to the top results. But notice: The boost is
combined with the already available boots (multiplied), so the sorting will
be a combination of boost and native lucene rank. You can try to overcome
that by scaling the boost very large (factor 10 or so should be enough for
5-grams).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

-----Original Message-----
From: Nuno Seco [mailto:ns...@dei.uc.pt]
Sent: Thursday, November 12, 2009 6:08 PM
To: java-user@lucene.apache.org
Subject: Re: OutOfMemoryError when using Sort

Ok. Thanks.

The doc. says:
"Finds the top |n| hits for |query|, applying |filter| if non-null, and
sorting the hits by the criteria in |sort|."

I understood that only the hits (50 in this) for the current search
would be sorted...
I'll just do the ordering afterwards. Thank you for clarifying this issue.


--
Nuno Seco


Jake Mannix wrote:
Sorting utilizes a FieldCache: the forward lookup - the value a document
has
for a
particular field (as opposed to the usual "inverted" way of looking at
all
documents
which contains a given term), which lives in memory, and takes up as
much
space
as one 4-bytes * numDocs.

If you've indexed the entire 5Gram data set on one index, on one
machine,
you've
got how many documents?  Billions?  This will take up billions*4bytes of
RAM
to
do this sort.

You need to shard your index (break it up onto multiple machines, do
your
sort
distributed, and merge the results) if you want to do this sorting with
any
kind
of performance.

  -jake

On Thu, Nov 12, 2009 at 7:57 AM, Nuno Seco <ns...@dei.uc.pt> wrote:


Hello List.

I'm having a problem when I add a Sort object to my searcher:
  docs = searcher.search(parser.parse(search), null, 50, sort);

Every time I execute a query I get an OutOfMemoryError exception.
But if I execute the query without the Sort object it works fine

Let me briefly explain how my index is structured.
I'm indexing the Google 5Grams
(
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-
to-you.html).
The index just has two fields:
  data = new Field("data", tokens[0], Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.NO);
  count = new Field("count", tokens[1], Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);

the data corresponds to the 5 gram; e.g.: "my business manager informed
me"
and the count is simply an integer that represents the frequency of the
ngram.

The index size after optimization is 63G.

If I do not store the data field using:
  data = new Field("data", tokens[0], Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.NO);
the total size drops to 32G


But using either index with the Sort object causes the exception. I'm
creating the Sort object like:
  Sort sort = new Sort(new SortField("count", SortField.INT));

Note: That even with out using the Sort object I still need to pump the
jvm to 2G (-Xmx2048m). But thats ok...


So.... Basically what I want is to order those first 50 hits I get
according to their frequency counts (count field).


I'm using:
java version "1.6.0_16" (64 bit)
lucene 2.9.1
linux ext3 FS
linux kernel 2.6.31-15

Can anybody help me or redirect me in the right direction?

Thanks

--
Nuno Seco

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to