This isn't a "How do I index a zip file?" question. It's a bit more complicated than that.
We have an index where zip files are broken apart and the contained files are indexed. The index also contains a doc for the zip file itself. The user has the option of (A) querying for the contained files that match the query (a vanilla query), or (B) querying for the unique set of zip files that have contained files that match the query. My question is how to *efficiently* accomplish option (B) in Lucene. In case it helps, here's another way to explain the requirement in a relational model. If you had a table of docs with these columns: MyDocs table ========= Docid ZipfileName Filename Other columns to match on... then option (B) can be returned with a simple join: select distinct zip.docid, zip.other-columns, ... from mydocs zip, mydocs contained where contained.zipfilename = zip.filename and contained.docid matches lucene query... In lucene, the conceptual, straght-forward solution is something like this: Do a lucene query to get the matching contained docs. For each matching doc: Look up the zip filename via a field on the doc. If the zip file is not part of our zipfile result set yet, then Save the zip filename in the result set. Run another lucene query to look up the zipfile docids in the zipfile result set. Read any required fields for each zipfile doc. Return the zipfile result set with the required fields. The trouble with this solution is that it is very slow and a memory hog. Does anyone have any nifty ideas that beat this straight-forward approach? We would also entertain alternative indexing approaches. We even considered concatenating all the text of the contained docs into a doc indexed as the zipfile, but lucene only indexes part of a large file and even if that were resolved, proximity searches can return false positives. And FYI, scoring is not an issue on the zip file. It's purely match or no-match semantics. Thanks, - Eric Scott --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]