Trying to generate a list of DISTINCT field names from all documents in an index

todd.hunt Thu, 15 Dec 2011 06:31:48 -0800

Hi,
 
I have come across a problem with our code that is not scaling well and I'm
hoping there is a way I can tweak our existing code to run faster.
 
We are indexing on a Java object called "Node".  A "Node" can have one or
more "Attributes".  The "Attributes" consist of a key / value pair and the
index value of the Node they are associated with.  The Attributes are
basically meta data about the Node.  We are using a FieldBridge to add the
Attribute keys and values to the Node "document" in Lucene.
 
Our current logic uses a Collector to find all of the "Attributes"
associated with a Node document and put them into a Set.  That Set then is
returned to the UI so that the user can have a drop down list of choices to
search on.
 
Here is part of the Collector code:
 
                searcher.search(query, new Collector() {
                    private int docBase;
 
                    @Override
                    public void setScorer(Scorer scorer) throws IOException
{
                        //No Op
                    }
 
                    @Override
                    public void collect(int docId) {
                        int doc = docId + docBase;
                        try {
                            Document document = searcher.doc(doc);
                            List fieldList = document.getFields();
                            for (Object fieldObj : fieldList) {
                                if (fieldObj instanceof Fieldable) {
                                    Fieldable field = (Fieldable) fieldObj;
                                    String fieldName = field.name();
                                    if
(!excludedFieldNameSet.contains(fieldName)) {
                                        results.add(fieldName);
                                    }
                                }
                            }
                        } catch (IOException e) {
                            throw JavaUtils.asRuntimeException(e);
                        }
                    }
 
                    @Override
                    public void setNextReader(IndexReader indexReader, int
docBase) throws IOException {
                        this.docBase = docBase;
                    }
 
                    @Override
                    public boolean acceptsDocsOutOfOrder() {
                        return true;
                    }
                });
            } catch (IOException e) {
                throw JavaUtils.asRuntimeException(e);
            }
        }


This logic was very fast with our customers who had ten's of thousands of
Nodes with 2 or more Attributes per node.  But now we have a customer with
over a million nodes and at least 5 attributes per node.  So it is taking 10
to 20 seconds to generate this list, which is way too slow.
 
My "Plan B" is to cache the list of unique attribute fields either in
another Lucene index, EHCache, or in memory on the server.
 
The reason we started down this path is because the attributes that can be
added to a node are dynamic.  So initially, going through all the documents
looking for unique attributes seemed like a good solution.  
 
I've read through the Lucene In Action book and various postings online. 
Maybe I'm not looking for the correct terms, but I can't find anything that
will return and cache a list of unique field names.  If anyone can help
point me towards a better solution, please let me know.  Like I stated
before, I'd like to be able to keep most of what we have now, but if I need
to scrap this code and do something different, I'm all for it.  I'd even
change the way our Node document is stored in Lucene if that would make a
difference.
 
Thank you,
 
Todd

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Trying-to-generate-a-list-of-DISTINCT-field-names-from-all-documents-in-an-index-tp3588729p3588729.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Trying to generate a list of DISTINCT field names from all documents in an index

Reply via email to