Hi, I have come across a problem with our code that is not scaling well and I'm hoping there is a way I can tweak our existing code to run faster. We are indexing on a Java object called "Node". A "Node" can have one or more "Attributes". The "Attributes" consist of a key / value pair and the index value of the Node they are associated with. The Attributes are basically meta data about the Node. We are using a FieldBridge to add the Attribute keys and values to the Node "document" in Lucene. Our current logic uses a Collector to find all of the "Attributes" associated with a Node document and put them into a Set. That Set then is returned to the UI so that the user can have a drop down list of choices to search on. Here is part of the Collector code: searcher.search(query, new Collector() { private int docBase; @Override public void setScorer(Scorer scorer) throws IOException { //No Op } @Override public void collect(int docId) { int doc = docId + docBase; try { Document document = searcher.doc(doc); List fieldList = document.getFields(); for (Object fieldObj : fieldList) { if (fieldObj instanceof Fieldable) { Fieldable field = (Fieldable) fieldObj; String fieldName = field.name(); if (!excludedFieldNameSet.contains(fieldName)) { results.add(fieldName); } } } } catch (IOException e) { throw JavaUtils.asRuntimeException(e); } } @Override public void setNextReader(IndexReader indexReader, int docBase) throws IOException { this.docBase = docBase; } @Override public boolean acceptsDocsOutOfOrder() { return true; } }); } catch (IOException e) { throw JavaUtils.asRuntimeException(e); } }
This logic was very fast with our customers who had ten's of thousands of Nodes with 2 or more Attributes per node. But now we have a customer with over a million nodes and at least 5 attributes per node. So it is taking 10 to 20 seconds to generate this list, which is way too slow. My "Plan B" is to cache the list of unique attribute fields either in another Lucene index, EHCache, or in memory on the server. The reason we started down this path is because the attributes that can be added to a node are dynamic. So initially, going through all the documents looking for unique attributes seemed like a good solution. I've read through the Lucene In Action book and various postings online. Maybe I'm not looking for the correct terms, but I can't find anything that will return and cache a list of unique field names. If anyone can help point me towards a better solution, please let me know. Like I stated before, I'd like to be able to keep most of what we have now, but if I need to scrap this code and do something different, I'm all for it. I'd even change the way our Node document is stored in Lucene if that would make a difference. Thank you, Todd -- View this message in context: http://lucene.472066.n3.nabble.com/Trying-to-generate-a-list-of-DISTINCT-field-names-from-all-documents-in-an-index-tp3588729p3588729.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org