oh...@cox.net wrote:
- I'd have to create a (very small) index, for each sub-document, where I do
the Document.add() with just the (for example) two terms, then
- Run a query against the 1-entry index, which
- Would either give me a "yes" or "no" (for that sub-document)
As I said, I'm concerned about overhead. Some of the documents are quite large,
containing >20K sub-documents. That means that, for such a document, I'd have to
create >20K indexes.
No, I'm talking about a separate document in the same index.
There are a few approaches here:
1) Index each sub-document separately. So if you have fields 'doc#',
'docname', 'subdoc#', and 'subdocterms', you might do:
for (Doc parent : Docs) {
for (SubDoc child : parent.subDocs()) {
Document luceneDoc = new Document();
doc.add(new Field("doc#", parent.number()));
doc.add(new Field("docname", parent.name()));
doc.add(new Field("subdoc#", child.number()));
doc.add(new Field("subdocterms", child.data()));
}
}
This means that in your index after indexing 2 docs with 2 subdocs each,
you'll have
(Lucene #) doc# docname subdoc# subdocterms
----------------------------------------------------
0 100 Foo 101 subdoc1 terms here
1 100 Foo 102 subdoc2 terms
2 200 Bar 201 subdoc1 terms from doc2
3 200 Bar 202 some more subdoc text
So the search you're doing is actually on the subdoc level. This can get
complicated, especially as subdocs from the same parent doc may come
back out of order, etc, depending on scoring/sorting.
Also, if there is a lot of data at the parent level, you're obviously
duplicating it. This can get nasty.
2) Maintain a (logically) separate subdoc index. You could have
something like:
doc# docname bigblobofdocdata
---------------------------------
100 Foo lots of data here...
200 Bar and lots more here..
in one index, and
doc# subdoc# subdocterms
---------------------------------
100 101 subdoc1 terms here
100 102 subdoc2 terms
200 201 subdoc1 terms from doc2
200 202 some more subdoc text
Then you can FIRST search on the doc index to do any matches on
'docname' etc, then use the IDs you find to filter the subdoc index --
so if the user searches for 'docname=foo' and 'subdocterms=text', you
first do the docname search to get the docname-matching doc (100), then
do a search on the second index for 'subdocterms', but also filter where
doc#=100.
Note they don't HAVE to be separate indexes -- you can actually keep
these in the same physical index, with some sort of discriminator (all
docs in an index don't have to have the same fields).
3) Do some really hardcore tricks with spanqueries. This is what I'm
working on at the moment, so it's near and dear to my heart. It's not
for the faint-hearted, though, and if you're new to Lucene may scare you
off, sorry! Basically Lucene has the concept of 'positions' for terms --
metadata about where in the document the term can be found. This lets
you do 'near' queries, etc.
We're taking advantage of that to do some many-to-one stuff like your
problem. Using the first example, with term positions indicated in [],
we position terms from different subdocs with a large gap between them,
like so:
(Lucene #) doc# docname subdoc# subdocterms
----------------------------------------------------
0 100 Foo 101[0] subdoc1[0] terms[1] here[2]
102[100] subdoc2[100] terms[101]
1 200 Bar 201[0] subdoc1[0] terms[1] from[2]
202[100] doc2[3] some[100] more[101]
subdoc[102] text[103]
So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200,
etc. Then when we search we can say 'the terms you're looking for must
be in the same 100-position block' to find only subdocs that match all
subdoc-related subqueries. This is pretty hairy but is working well for
us -- massively reduces our indexing and search times compared to the
duplicated document way I mentioned above.
Cheers,
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org