Let me see if I have a clue what you're trying to do. Warning: I'm a bit
confused since "filter" has a very specific meaning in Lucene, so when you
talk about filters I'm assuming that you're NOT talking about Lucene
filters, but rather just a set of flags you're associating with each
document, and then a set of flags with the same semantics at search time.

If that's all true, here's the first approach I would take...

Don't store a bitmask in Lucene. Rather, index a field for each flag with
each document. NOTE: you do NOT have to have the same fields for all
documents....

Something like
Document doc = new Document();
doc.add("flag1", "Y");
doc.add("flag2", "Y");
IndexWriter.add(doc);

Document doc = new Document();
doc.add("flag1", "Y");   // NOTE: no "flag2" field. No problem.
IndexWriter.add(doc);

Now your searches are simple. Just search for the "or" of the fields with
the flags you're interested in. i.e. "flag1=Y" or "flag2="Y".....

If you want to get fancy, you can use TermDocs to enumerate the document IDs
with values for specified fields and perhaps even create a Lucene filter
based on that enumeration. This is much faster than you may think.

You probably want to index but not store such flags.

I suspect that this will be waaaaay faster than trying to inspect a binary
field in each document and then see if the bits were set, because that would
require you to read each document rather than just look at the terms in the
index.

I doubt that this will add much in the way of size to your index, and
anyway, disk space is cheap.

NOTE: the down side here is that you must delete and re-add a document to
modify it, which may be slow. But you'd have to do that when you updated
your bit mask anyway.....


Another approach: create a set of Lucene Filters (really, these are just
Java bitsets), one for each flag. All this is a bitset with one bit for each
document, or about 1M of memory per flag with 8M docs. So you'd populate
flag1Filter, flag2Filter... and have these ready whenever you needed them.

You should very rapidly be able to do any of the logical operations on these
bitsets (OR in your case) and use the resulting Lucene filter in your query.
It's up to you whether you create these filters as part of server warm-up or
just create them when needed, letting the first user who encounters them pay
the price for creating them. This is kind of the Solr warm-up idea. The
CachingWrapperFilter class should keep these around for you. Creating a
Lucene filter is much faster than I thought. See Lucene In Action for a
sample.

Of course, I may be entirely misunderstanding your problem, in which case
I'd ask you to explain a bit more <G>.

Best
Erick

On 11/9/06, Larry Taylor <[EMAIL PROTECTED]> wrote:

Hello,

I am currently evaluating Lucene to see if it would be appropriate to
replace my company's current search software. So far everything has been
looking great, however there is one requirement that I am not too
certain about.

What we need to do is to be able to store a bit mask specifying various
filter flags for a document in the index and then search this field by
specifying another bit mask with desired filters, returning documents
that have any of the specified flags set. In other words, we are doing a
bitwise OR on the stored filter bit mask and the specified filter bit
mask and if it is non-zero, we want to return the document.

Before I started toying around with various options myself, I wanted to
see if any of you good folks in the Lucene community had some
suggestions for an efficient way to implement this.

We currently need to index ~8,000,000 documents. We have several filter
flag fields, the most important of which currently has 7 possible flags
with any combination of the flags being valid. The number of flags is
expected to increase rather rapidly in the near future.

My preemptive thanks for your suggestions,


Lawrence Taylor
Senior Software Engineer
Employon
Message was edited by: ltaylor.employon



Reply via email to