oh...@cox.net wrote:
- I'd have to create a (very small) index, for each sub-document, where I do 
the Document.add() with just the (for example) two terms, then
- Run a query against the 1-entry index, which
- Would either give me a "yes" or "no" (for that sub-document)

As I said, I'm concerned about overhead.  Some of the documents are quite large, 
containing >20K sub-documents.  That means that, for such a document, I'd have to 
create >20K indexes.

No, I'm talking about a separate document in the same index.

There are a few approaches here:

1) Index each sub-document separately. So if you have fields 'doc#', 'docname', 'subdoc#', and 'subdocterms', you might do:

   for (Doc parent : Docs) {
     for (SubDoc child : parent.subDocs()) {
       Document luceneDoc = new Document();
       doc.add(new Field("doc#", parent.number()));
       doc.add(new Field("docname", parent.name()));
       doc.add(new Field("subdoc#", child.number()));
       doc.add(new Field("subdocterms", child.data()));
     }
   }

This means that in your index after indexing 2 docs with 2 subdocs each, you'll have
   (Lucene #)   doc#   docname   subdoc#   subdocterms
   ----------------------------------------------------
   0            100    Foo       101       subdoc1 terms here
   1            100    Foo       102       subdoc2 terms
   2            200    Bar       201       subdoc1 terms from doc2
   3            200    Bar       202       some more subdoc text

So the search you're doing is actually on the subdoc level. This can get complicated, especially as subdocs from the same parent doc may come back out of order, etc, depending on scoring/sorting.

Also, if there is a lot of data at the parent level, you're obviously duplicating it. This can get nasty.

2) Maintain a (logically) separate subdoc index. You could have something like:
   doc#   docname  bigblobofdocdata
   ---------------------------------
   100    Foo      lots of data here...
   200    Bar      and lots more here..
in one index, and
   doc#   subdoc#  subdocterms
   ---------------------------------
   100    101       subdoc1 terms here
   100    102       subdoc2 terms
   200    201       subdoc1 terms from doc2
   200    202       some more subdoc text

Then you can FIRST search on the doc index to do any matches on 'docname' etc, then use the IDs you find to filter the subdoc index -- so if the user searches for 'docname=foo' and 'subdocterms=text', you first do the docname search to get the docname-matching doc (100), then do a search on the second index for 'subdocterms', but also filter where doc#=100.

Note they don't HAVE to be separate indexes -- you can actually keep these in the same physical index, with some sort of discriminator (all docs in an index don't have to have the same fields).

3) Do some really hardcore tricks with spanqueries. This is what I'm working on at the moment, so it's near and dear to my heart. It's not for the faint-hearted, though, and if you're new to Lucene may scare you off, sorry! Basically Lucene has the concept of 'positions' for terms -- metadata about where in the document the term can be found. This lets you do 'near' queries, etc.

We're taking advantage of that to do some many-to-one stuff like your problem. Using the first example, with term positions indicated in [], we position terms from different subdocs with a large gap between them, like so:

   (Lucene #)   doc#   docname   subdoc#   subdocterms
   ----------------------------------------------------
   0            100    Foo       101[0]    subdoc1[0] terms[1] here[2]
                                 102[100]  subdoc2[100] terms[101]

   1            200    Bar       201[0]    subdoc1[0] terms[1] from[2]
                                 202[100]  doc2[3] some[100] more[101]
                                           subdoc[102] text[103]

So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200, etc. Then when we search we can say 'the terms you're looking for must be in the same 100-position block' to find only subdocs that match all subdoc-related subqueries. This is pretty hairy but is working well for us -- massively reduces our indexing and search times compared to the duplicated document way I mentioned above.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to