There's no real group_by functionality in Lucene. I'd have to ask, though, "why organize your index this way"? I'm guessing that you're approaching this from a database perspective, and if that's so, you may want to re-think some things. Although see below for my contradicting myself.....
Lucene excels as a text search engine, NOT a RDMBS. It's almost a sure bet that when you find yourself trying to do DB like things in lucene, you should either 1> re-think how you use Lucene 2> use a database or 3> use a hybrid solution, using Lucene for your text searches and a DB for the DB-like things you want to do. In your example (and I understand that you've perhaps simplified it enough for brevity that the following is inapplicable), instead of indexing these records, why not put all the text in a single field for each doc ID? e.g. Document doc = new Document(); doc.add(new Field("id", "10", ....); doc.add(new Field("some_text", "some text here",.....)); doc.add(new Field("some_text", "some another text here",......)); // NOTE, the field is exactly the same as the previous line. writer.write(doc); This will create one lucene document, with an id of 10, and text "some text here some another text here". ( I left out the storage and indexing flags above). Now, when you search your hits object will have one and only one entry for doc ID 10. It'll have relevance scores, and should fix you right up. This assumes that you're breaking your some_text up into tokens using the appropriate tokenizer. Note: it didn't occur to me until I'd used Lucene for some time, but according to a discussion a while back, the above is exactly equivalent to doc.add(new Field("id", "10", ....)); doc.add(new Field("some_text", "some text here some another text here", ......)); writer.write(doc); Of course, how this applies to your paging issues is another story. I'm also dealing with trying to get a mapping between offsets into a document and the corresponding pages. It's interesting, especially when it comes to wildcard queries, and I haven't found a satisfactory soulution yet. One "interesting" issue if you choose to consider each page (record) as a lucene document is how you deal with relevancy. That is, how do 10 hits on 3 pages of a 100 page book rank compared to 25 hits on 15 pages of a 900 page book? Which is "more relevant"? This may be completely irrelevant to your problem, but I'm inferring that your records correspond to a page...... Eric Hatcher suggested re-casting all the queries into Span queries and then using a Spans object. This, together with perhaps bumping the offsets of the first term of each page by, say, 10.000 might work for me. I'll know more in a day or two.... Hope this helps Erick On 10/11/06, Eugeny N Dzhurinsky <[EMAIL PROTECTED]> wrote:
Hi there! I have a index structure like this: document_id some_text ..... when searching for some set of documents, there could be a case when several comments for the same document match the search criteria. In such case I need to get single hit for all such cases, in other word - perform a "group by"-like operation based on document_id. For example, if I have records 1 : 10 : some text here 2 : 10 : some another text here and the search string was "+some_text:some" - I need get only one hit for both these records (return only document_id). I know I could collect all hits and then filter them, but I need also paging functionality, so if I need to collect 1.000.000 hits into 50.000 of records - I need to traverse all 1.000.000 of records, put 50.000 of unique items into helper array, then get last page with 10 results - and it will take a lot of time. -- Eugene N Dzhurinsky --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]