Re: Distinct search

Eugeny N Dzhurinsky Wed, 11 Oct 2006 08:45:15 -0700

On Wed, Oct 11, 2006 at 11:30:03AM -0400, Erick Erickson wrote:
> There's no real group_by functionality in Lucene. I'd have to ask, though,
> "why organize your index this way"? I'm guessing that you're approaching
> this from a database perspective, and if that's so, you may want to re-think
> some things. Although see below for my contradicting myself.....
> 
> Lucene excels as a text search engine, NOT a RDMBS. It's almost a sure bet
> that when you find yourself trying to do DB like things in lucene, you
> should either
> 1> re-think how you use Lucene
> 2> use a database or
> 3> use a hybrid solution, using Lucene for your text searches and a DB for
> the DB-like things you want to do.
> 
> In your example (and I understand that you've perhaps simplified it enough
> for brevity that the following is inapplicable), instead of indexing these
> records, why not put all the text in a single field for each doc ID? e.g.
> Document doc = new Document();
> doc.add(new Field("id", "10", ....);
> doc.add(new Field("some_text", "some text here",.....));
> doc.add(new Field("some_text", "some another text here",......));  // NOTE,
> the field is exactly the same as the previous line.
> writer.write(doc);
> 
> This will create one lucene document, with an id of 10, and text "some text
> here some another text here". ( I left out the storage and indexing flags
> above).
> 
> Now, when you search your hits object will have one and only one entry for
> doc ID 10. It'll have relevance scores, and should fix you right up. This
> assumes that you're breaking your some_text up into tokens using the
> appropriate  tokenizer.
> 
> 
> Note: it didn't occur to me until I'd used Lucene for some time, but
> according to a discussion a while back, the above is exactly equivalent to
> doc.add(new Field("id", "10", ....));
> doc.add(new Field("some_text", "some text here some another text here",
> ......));
> writer.write(doc);
> 
> Of course, how this applies to your paging issues is another story. I'm also
> dealing with trying to get a mapping between offsets into a document and the
> corresponding pages. It's interesting, especially when it comes to wildcard
> queries, and I haven't found a satisfactory soulution yet. One "interesting"
> issue if you choose to consider each page (record) as a lucene document is
> how you deal with relevancy. That is, how do 10 hits on 3 pages of a 100
> page book rank compared to 25 hits on 15 pages of a 900 page book? Which is
> "more relevant"? This may be completely irrelevant to your problem, but I'm
> inferring that your records correspond to a page......
> 
> Eric Hatcher suggested re-casting all the queries into Span queries and then
> using a Spans object. This, together with perhaps bumping the offsets of the
> first term of each page by, say, 10.000 might work for me. I'll know more in
> a day or two....
> 
> Hope this helps
> Erick



Thank you very juch for a prompt response!
I really trying to apply full-text search to an existing database, so I might
think in RDBMS domain way. In general, I didn't think of adding several fields
with the same name to a Document object, which is to be idexed, this migh help
me, but:

* if something (for instance - a comment to a document) is changed, I need to
change the index somehow
* if some new comment added - i need to add it to a index record
* if some comment is deleted - I need to remove it from an index record

As far as I remember, Lucene doesn't allow to modify index records, so I need
to delete the record and add it again, am I correct?

-- 
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distinct search

Reply via email to