Re: Distinct search

Erick Erickson Wed, 11 Oct 2006 09:25:04 -0700

No problem. Partly, it's helping me clarify my current problem <G>....


Yes, you must delete and re-add a document to change it. You might want to
look at the IndexModifier class. Be aware of some things:
1> Lucene doc IDs may change when the index is changed, I think after
optimization. So, in order to find specific docs you'll want to add a unique
ID field.
2> deleteing a document only marks it as deleted. It's still actually in the
index (although you won't be able to see it). Optimizing an index actually
physically removes it (and shrinks the index size).
3> after changing an index, you must open a new IndexReader before you see
the changes.
4> The only information you can reliably get out of an index is data you've
stored (e.g. Field.Store.YES). This only matters if you're trying to read
the doc *from* the index and change it. If you're just replacing it with a
fresh copy from somewhere else (e.g. re-reading the doc from the database),
it's not a problem.
5> Lucene has no notion of constraints. That is, Lucene is perfectly happy
with two identical documents in the index, so be sure to delete before you
add <G>....

Erick

On 10/11/06, Eugeny N Dzhurinsky <[EMAIL PROTECTED]> wrote:


On Wed, Oct 11, 2006 at 11:30:03AM -0400, Erick Erickson wrote:
> There's no real group_by functionality in Lucene. I'd have to ask,
though,
> "why organize your index this way"? I'm guessing that you're approaching
> this from a database perspective, and if that's so, you may want to
re-think
> some things. Although see below for my contradicting myself.....
>
> Lucene excels as a text search engine, NOT a RDMBS. It's almost a sure
bet
> that when you find yourself trying to do DB like things in lucene, you
> should either
> 1> re-think how you use Lucene
> 2> use a database or
> 3> use a hybrid solution, using Lucene for your text searches and a DB
for
> the DB-like things you want to do.
>
> In your example (and I understand that you've perhaps simplified it
enough
> for brevity that the following is inapplicable), instead of indexing
these
> records, why not put all the text in a single field for each doc ID? e.g
.
> Document doc = new Document();
> doc.add(new Field("id", "10", ....);
> doc.add(new Field("some_text", "some text here",.....));
> doc.add(new Field("some_text", "some another text here",......));  //
NOTE,
> the field is exactly the same as the previous line.
> writer.write(doc);
>
> This will create one lucene document, with an id of 10, and text "some
text
> here some another text here". ( I left out the storage and indexing
flags
> above).
>
> Now, when you search your hits object will have one and only one entry
for
> doc ID 10. It'll have relevance scores, and should fix you right up.
This
> assumes that you're breaking your some_text up into tokens using the
> appropriate  tokenizer.
>
>
> Note: it didn't occur to me until I'd used Lucene for some time, but
> according to a discussion a while back, the above is exactly equivalent
to
> doc.add(new Field("id", "10", ....));
> doc.add(new Field("some_text", "some text here some another text here",
> ......));
> writer.write(doc);
>
> Of course, how this applies to your paging issues is another story. I'm
also
> dealing with trying to get a mapping between offsets into a document and
the
> corresponding pages. It's interesting, especially when it comes to
wildcard
> queries, and I haven't found a satisfactory soulution yet. One
"interesting"
> issue if you choose to consider each page (record) as a lucene document
is
> how you deal with relevancy. That is, how do 10 hits on 3 pages of a 100
> page book rank compared to 25 hits on 15 pages of a 900 page book? Which
is
> "more relevant"? This may be completely irrelevant to your problem, but
I'm
> inferring that your records correspond to a page......
>
> Eric Hatcher suggested re-casting all the queries into Span queries and
then
> using a Spans object. This, together with perhaps bumping the offsets of
the
> first term of each page by, say, 10.000 might work for me. I'll know
more in
> a day or two....
>
> Hope this helps
> Erick


Thank you very juch for a prompt response!
I really trying to apply full-text search to an existing database, so I
might
think in RDBMS domain way. In general, I didn't think of adding several
fields
with the same name to a Document object, which is to be idexed, this migh
help
me, but:

* if something (for instance - a comment to a document) is changed, I need
to
change the index somehow
* if some new comment added - i need to add it to a index record
* if some comment is deleted - I need to remove it from an index record

As far as I remember, Lucene doesn't allow to modify index records, so I
need
to delete the record and add it again, am I correct?

--
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distinct search

Reply via email to