Re: Possible to invoke same Lucene query on a String?

Paul Cowan Thu, 20 Aug 2009 21:19:28 -0700

oh...@cox.net wrote:

- I'd have to create a (very small) index, for each sub-document, where I do 
the Document.add() with just the (for example) two terms, then
- Run a query against the 1-entry index, which
- Would either give me a "yes" or "no" (for that sub-document)


As I said, I'm concerned about overhead.  Some of the documents are quite large, 
containing >20K sub-documents.  That means that, for such a document, I'd have to 
create >20K indexes.


No, I'm talking about a separate document in the same index.

There are a few approaches here:

1) Index each sub-document separately. So if you have fields 'doc#','docname', 'subdoc#', and 'subdocterms', you might do:


   for (Doc parent : Docs) {
     for (SubDoc child : parent.subDocs()) {
       Document luceneDoc = new Document();
       doc.add(new Field("doc#", parent.number()));
       doc.add(new Field("docname", parent.name()));
       doc.add(new Field("subdoc#", child.number()));
       doc.add(new Field("subdocterms", child.data()));
     }
   }

This means that in your index after indexing 2 docs with 2 subdocs each,you'll have

   (Lucene #)   doc#   docname   subdoc#   subdocterms
   ----------------------------------------------------
   0            100    Foo       101       subdoc1 terms here
   1            100    Foo       102       subdoc2 terms
   2            200    Bar       201       subdoc1 terms from doc2
   3            200    Bar       202       some more subdoc text

So the search you're doing is actually on the subdoc level. This can getcomplicated, especially as subdocs from the same parent doc may comeback out of order, etc, depending on scoring/sorting.

Also, if there is a lot of data at the parent level, you're obviouslyduplicating it. This can get nasty.

2) Maintain a (logically) separate subdoc index. You could havesomething like:

   doc#   docname  bigblobofdocdata
   ---------------------------------
   100    Foo      lots of data here...
   200    Bar      and lots more here..
in one index, and
   doc#   subdoc#  subdocterms
   ---------------------------------
   100    101       subdoc1 terms here
   100    102       subdoc2 terms
   200    201       subdoc1 terms from doc2
   200    202       some more subdoc text

Then you can FIRST search on the doc index to do any matches on'docname' etc, then use the IDs you find to filter the subdoc index --so if the user searches for 'docname=foo' and 'subdocterms=text', youfirst do the docname search to get the docname-matching doc (100), thendo a search on the second index for 'subdocterms', but also filter wheredoc#=100.

Note they don't HAVE to be separate indexes -- you can actually keepthese in the same physical index, with some sort of discriminator (alldocs in an index don't have to have the same fields).

3) Do some really hardcore tricks with spanqueries. This is what I'mworking on at the moment, so it's near and dear to my heart. It's notfor the faint-hearted, though, and if you're new to Lucene may scare youoff, sorry! Basically Lucene has the concept of 'positions' for terms --metadata about where in the document the term can be found. This letsyou do 'near' queries, etc.

We're taking advantage of that to do some many-to-one stuff like yourproblem. Using the first example, with term positions indicated in [],we position terms from different subdocs with a large gap between them,like so:


   (Lucene #)   doc#   docname   subdoc#   subdocterms
   ----------------------------------------------------
   0            100    Foo       101[0]    subdoc1[0] terms[1] here[2]
                                 102[100]  subdoc2[100] terms[101]

   1            200    Bar       201[0]    subdoc1[0] terms[1] from[2]
                                 202[100]  doc2[3] some[100] more[101]
                                           subdoc[102] text[103]

So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200,etc. Then when we search we can say 'the terms you're looking for mustbe in the same 100-position block' to find only subdocs that match allsubdoc-related subqueries. This is pretty hairy but is working well forus -- massively reduces our indexing and search times compared to theduplicated document way I mentioned above.


Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Possible to invoke same Lucene query on a String?

Reply via email to