On Wed, Oct 11, 2006 at 11:30:03AM -0400, Erick Erickson wrote: > There's no real group_by functionality in Lucene. I'd have to ask, though, > "why organize your index this way"? I'm guessing that you're approaching > this from a database perspective, and if that's so, you may want to re-think > some things. Although see below for my contradicting myself..... > > Lucene excels as a text search engine, NOT a RDMBS. It's almost a sure bet > that when you find yourself trying to do DB like things in lucene, you > should either > 1> re-think how you use Lucene > 2> use a database or > 3> use a hybrid solution, using Lucene for your text searches and a DB for > the DB-like things you want to do. > > In your example (and I understand that you've perhaps simplified it enough > for brevity that the following is inapplicable), instead of indexing these > records, why not put all the text in a single field for each doc ID? e.g. > Document doc = new Document(); > doc.add(new Field("id", "10", ....); > doc.add(new Field("some_text", "some text here",.....)); > doc.add(new Field("some_text", "some another text here",......)); // NOTE, > the field is exactly the same as the previous line. > writer.write(doc); > > This will create one lucene document, with an id of 10, and text "some text > here some another text here". ( I left out the storage and indexing flags > above). > > Now, when you search your hits object will have one and only one entry for > doc ID 10. It'll have relevance scores, and should fix you right up. This > assumes that you're breaking your some_text up into tokens using the > appropriate tokenizer. > > > Note: it didn't occur to me until I'd used Lucene for some time, but > according to a discussion a while back, the above is exactly equivalent to > doc.add(new Field("id", "10", ....)); > doc.add(new Field("some_text", "some text here some another text here", > ......)); > writer.write(doc); > > Of course, how this applies to your paging issues is another story. I'm also > dealing with trying to get a mapping between offsets into a document and the > corresponding pages. It's interesting, especially when it comes to wildcard > queries, and I haven't found a satisfactory soulution yet. One "interesting" > issue if you choose to consider each page (record) as a lucene document is > how you deal with relevancy. That is, how do 10 hits on 3 pages of a 100 > page book rank compared to 25 hits on 15 pages of a 900 page book? Which is > "more relevant"? This may be completely irrelevant to your problem, but I'm > inferring that your records correspond to a page...... > > Eric Hatcher suggested re-casting all the queries into Span queries and then > using a Spans object. This, together with perhaps bumping the offsets of the > first term of each page by, say, 10.000 might work for me. I'll know more in > a day or two.... > > Hope this helps > Erick
Thank you very juch for a prompt response! I really trying to apply full-text search to an existing database, so I might think in RDBMS domain way. In general, I didn't think of adding several fields with the same name to a Document object, which is to be idexed, this migh help me, but: * if something (for instance - a comment to a document) is changed, I need to change the index somehow * if some new comment added - i need to add it to a index record * if some comment is deleted - I need to remove it from an index record As far as I remember, Lucene doesn't allow to modify index records, so I need to delete the record and add it again, am I correct? -- Eugene N Dzhurinsky --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]