Re: Distinct search

Erick Erickson Wed, 11 Oct 2006 08:31:08 -0700

There's no real group_by functionality in Lucene. I'd have to ask, though,
"why organize your index this way"? I'm guessing that you're approaching
this from a database perspective, and if that's so, you may want to re-think
some things. Although see below for my contradicting myself.....


Lucene excels as a text search engine, NOT a RDMBS. It's almost a sure bet
that when you find yourself trying to do DB like things in lucene, you
should either
1> re-think how you use Lucene
2> use a database or
3> use a hybrid solution, using Lucene for your text searches and a DB for
the DB-like things you want to do.

In your example (and I understand that you've perhaps simplified it enough
for brevity that the following is inapplicable), instead of indexing these
records, why not put all the text in a single field for each doc ID? e.g.
Document doc = new Document();
doc.add(new Field("id", "10", ....);
doc.add(new Field("some_text", "some text here",.....));
doc.add(new Field("some_text", "some another text here",......));  // NOTE,
the field is exactly the same as the previous line.
writer.write(doc);

This will create one lucene document, with an id of 10, and text "some text
here some another text here". ( I left out the storage and indexing flags
above).

Now, when you search your hits object will have one and only one entry for
doc ID 10. It'll have relevance scores, and should fix you right up. This
assumes that you're breaking your some_text up into tokens using the
appropriate  tokenizer.


Note: it didn't occur to me until I'd used Lucene for some time, but
according to a discussion a while back, the above is exactly equivalent to
doc.add(new Field("id", "10", ....));
doc.add(new Field("some_text", "some text here some another text here",
......));
writer.write(doc);

Of course, how this applies to your paging issues is another story. I'm also
dealing with trying to get a mapping between offsets into a document and the
corresponding pages. It's interesting, especially when it comes to wildcard
queries, and I haven't found a satisfactory soulution yet. One "interesting"
issue if you choose to consider each page (record) as a lucene document is
how you deal with relevancy. That is, how do 10 hits on 3 pages of a 100
page book rank compared to 25 hits on 15 pages of a 900 page book? Which is
"more relevant"? This may be completely irrelevant to your problem, but I'm
inferring that your records correspond to a page......

Eric Hatcher suggested re-casting all the queries into Span queries and then
using a Spans object. This, together with perhaps bumping the offsets of the
first term of each page by, say, 10.000 might work for me. I'll know more in
a day or two....

Hope this helps
Erick

On 10/11/06, Eugeny N Dzhurinsky <[EMAIL PROTECTED]> wrote:


Hi there!

I have a index structure like this:

document_id
some_text
.....

when searching for some set of documents, there could be a case when
several
comments for the same document match the search criteria. In such case I
need
to get single hit for all such cases, in other word - perform a "group
by"-like
operation based on document_id. For example, if I have records

1 : 10 : some text here
2 : 10 : some another text here

and the search string was "+some_text:some" - I need get only one hit for
both
these records (return only document_id).

I know I could collect all hits and then filter them, but I need also
paging
functionality, so if I need to collect 1.000.000 hits into 50.000 of
records -
I need to traverse all 1.000.000 of records, put 50.000 of unique items
into
helper array, then get last page with 10 results - and it will take a lot
of
time.

--
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distinct search

Reply via email to