Re: simple (?) question about scoring
Michele, On Friday 03 November 2006 07:07, Michele Amoretti wrote: > I have a question: is the score for a document different if I have > only that document in my index, or if I have N documents? > If the answer is yes, I will put all N documents together, otherwise I > will evaluate them one by one. > > Btw, I will ask the ws develepoer about how queries are interpreted by > the search engine. To compute the score for only a subset of the lucene documents one normally uses a Filter. Assuming you get the primary keys of the docs to be scored, you can look them up in the lucene index and use their internal lucene document numbers to create the Filter. Then search your query with this filter. Have a look at the source code of RangeFilter.bits() to see how to get to the internal document numbers from a set of terms. Btw. when your database uses the same query to obtain this set of documents, you might consider to move this function into Lucene completely, because this will allow you to avoid using a filter alltogether. Regards, Paul Elschot > > Thanks > > On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > > : le list is not ordered (I do not know the details of the search > > : angine, I only have its result for a query) > > : > > : then I have this list of documents, which represents a subset of the corpus > > : > > : I have to rank the documents of the list, using your scoring algorithm > > > > In other words, out of a large copus C, this webservice hase > > told you that the documents comprising subset S is the top N matching > > documents for your query Q (where N << sizeof(C)) > > > > your goal is to sort S as best as possible. > > > > You could try indexing all the docs in S in a Lucene RAMDirectory and then > > search on them, but my orriginal point about the score being > > fairly meaningless in an index of only 1 document still applies somewhat > > ... if all of the documents you get back allready have a lot in common > > 9they must have something in common or hte webservice wouldn't havre > > returned them in response to your query) it may be hard to get a > > meaningful document frequency of the words in your query. > > > > you also may run into confusion about what exactly your query "is" and > > wether or not your interpretation matches that of the webservice ... at a > > very simplisitc level, if the query is "Java Lucene" and your webservice > > interprets that as an "OR" query and you interpret that as an "AND" query, > > you might find that the scores you compute for all the docs are 0. > > > > If i were in your shoes, i'd try to work with whoever runs this webservice > > to make it return more usefull informatoin -- at the very least to return > > results in sorted order. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
for admins: mailing list like spam
Hi, why not to put a [LUCENE USER] automatic tag at the beginning of e-mails subjects? It will make mails list more easy to read (I am using gmail and I do not have client-side filters). -- Michele Amoretti, Ph.D. Distributed Systems Group Dipartimento di Ingegneria dell'Informazione Università degli Studi di Parma http://www.ce.unipr.it/people/amoretti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search within search
Hi, Doron, good call, thanks. I have another problem is I do not perform the real search within search feature which according to the way that I have coded, because for the second time searching, I actually go back to the index directory to search the entire indeces again rather then cached the first time search result. How can I solve this problem? Do I need to use queryFilter and reconstruct the codes again, and that is time consuming, is there any how I can get it done without reconstruct. Or do I need to use bitSet within my existing codes. Thanks. regards, Wooi Meng -- View this message in context: http://www.nabble.com/search-within-search-tf2558237.html#a7153721 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Hi Peter When I use the CustomHitCollector, it affect the application performance. Also how you accomplish the grouping the results with out affecting performance. Also If possible give some code snippet for custome hitcollector. TIA Sri "Peter Keegan" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Joe, > > Fields with numeric values are stored in a separate file as binary values > in > an internal format. Lucene is unaware of this file and unaware of the > range > expression in the query. The range expression is parsed outside of Lucene > and used in a custom HitCollector to filter out documents that aren't in > the > requested range(s). A goal was to do this without having to modify Lucene. > Our scheme is pretty efficient, but not very general purpose in its > current > form, though. > > Peter > > > On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote: >> >> Hi Peter, >> >> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: >> > Numeric range search is one of Lucene's weak points (performance-wise) >> so we >> > have implemented this with a custom HitCollector and an extension to >> > the >> > Lucene index files that stores the numeric field values for all >> documents. >> > >> > It is important to point out that this has all been implemented with >> > the >> > stock Lucene 2.0 library. No code changes were made to the Lucene core. >> >> Can you give some technical details on the extension to the Lucene index >> files? How did you do it without making any changes to the Lucene core? >> >> Thanks, >> Joe >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: simple (?) question about scoring
Ok, sorry I did not read it in depth. Now, where can I find an example of: - building the RAMDirectory - scoring all documents against the query? thanks On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I have a question: is the score for a document different if I have : only that document in my index, or if I have N documents? : If the answer is yes, I will put all N documents together, otherwise I : will evaluate them one by one. as i said before, yes it does... >> For most of the various types of Queries that exist in Lucene, the >> score is very dependent on how common the Terms involved are in the >> Corpus as a whole -- if your Corpus consists of only 1 Document, then >> your scores are going to be relatively meaningless. ...you will see a big difference between an index containing 1 doc, and an index containing 10 docs which all match your query, and an index containing 10 docs. I believe Doron already suggested you take a look at the Document explainaing how Lucene's Scoring works correct? ... http://lucene.apache.org/java/docs/scoring.html -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michele Amoretti, Ph.D. Distributed Systems Group Dipartimento di Ingegneria dell'Informazione Università degli Studi di Parma http://www.ce.unipr.it/people/amoretti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: simple (?) question about scoring
http://javatechniques.com/public/java/docs/basics/lucene-memory-search.html is this good? it seems to be good.. On 11/3/06, Michele Amoretti <[EMAIL PROTECTED]> wrote: Ok, sorry I did not read it in depth. Now, where can I find an example of: - building the RAMDirectory - scoring all documents against the query? thanks On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : I have a question: is the score for a document different if I have > : only that document in my index, or if I have N documents? > : If the answer is yes, I will put all N documents together, otherwise I > : will evaluate them one by one. > > as i said before, yes it does... > > >> For most of the various types of Queries that exist in Lucene, the > >> score is very dependent on how common the Terms involved are in the > >> Corpus as a whole -- if your Corpus consists of only 1 Document, then > >> your scores are going to be relatively meaningless. > > ...you will see a big difference between an index containing 1 doc, and an > index containing 10 docs which all match your query, and an index > containing 10 docs. > > I believe Doron already suggested you take a look at the Document > explainaing how Lucene's Scoring works correct? ... > >http://lucene.apache.org/java/docs/scoring.html > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Michele Amoretti, Ph.D. Distributed Systems Group Dipartimento di Ingegneria dell'Informazione Università degli Studi di Parma http://www.ce.unipr.it/people/amoretti -- Michele Amoretti, Ph.D. Distributed Systems Group Dipartimento di Ingegneria dell'Informazione Università degli Studi di Parma http://www.ce.unipr.it/people/amoretti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: for admins: mailing list like spam
On Nov 3, 2006, at 3:20 AM, Michele Amoretti wrote: why not to put a [LUCENE USER] automatic tag at the beginning of e-mails subjects? Because the To and Reply-to headers indicate the list. All Apache e- mail lists operate the same, and we are not going to change this behavior. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: simple (?) question about scoring
Yes! I modified the example to be compliant with 2.1 api, and I added the hits.score() call, for each discovered results. It works! [java] Hits for "freedom" were found in quotes by: [java] 1. Mohandas Gandhi with score = 0.53033006 [java] 2. Ayn Rand with score = 0.25 [java] 3. Friedrich Hayek with score = 0.1875 [java] Hits for "free" were found in quotes by: [java] 1. Ayn Rand with score = 0.5986179 [java] Hits for "progress or achievements" were found in quotes by: [java] 1. Theodore Roosevelt with score = 0.14965448 [java] 2. Friedrich Hayek with score = 0.11224086 I will start from this, for my purposes. Thank you for all the hints. Michele On 11/3/06, Michele Amoretti <[EMAIL PROTECTED]> wrote: http://javatechniques.com/public/java/docs/basics/lucene-memory-search.html is this good? it seems to be good.. On 11/3/06, Michele Amoretti <[EMAIL PROTECTED]> wrote: > Ok, sorry I did not read it in depth. > > Now, where can I find an example of: > > - building the RAMDirectory > - scoring all documents against the query? > > thanks > > On 11/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > > : I have a question: is the score for a document different if I have > > : only that document in my index, or if I have N documents? > > : If the answer is yes, I will put all N documents together, otherwise I > > : will evaluate them one by one. > > > > as i said before, yes it does... > > > > >> For most of the various types of Queries that exist in Lucene, the > > >> score is very dependent on how common the Terms involved are in the > > >> Corpus as a whole -- if your Corpus consists of only 1 Document, then > > >> your scores are going to be relatively meaningless. > > > > ...you will see a big difference between an index containing 1 doc, and an > > index containing 10 docs which all match your query, and an index > > containing 10 docs. > > > > I believe Doron already suggested you take a look at the Document > > explainaing how Lucene's Scoring works correct? ... > > > >http://lucene.apache.org/java/docs/scoring.html > > > > > > -Hoss > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > Michele Amoretti, Ph.D. > Distributed Systems Group > Dipartimento di Ingegneria dell'Informazione > Università degli Studi di Parma > http://www.ce.unipr.it/people/amoretti > -- Michele Amoretti, Ph.D. Distributed Systems Group Dipartimento di Ingegneria dell'Informazione Università degli Studi di Parma http://www.ce.unipr.it/people/amoretti -- Michele Amoretti, Ph.D. Distributed Systems Group Dipartimento di Ingegneria dell'Informazione Università degli Studi di Parma http://www.ce.unipr.it/people/amoretti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Suspected problem in the QueryParser
Hi, I recently stumbled across what I think might be a bug in the QueryParser. Before I enter it as a bug, I wanted to run it by this group to see if I'm just not looking at the boolean expression correctly. Here's the issue: I created an index with 5 documents, all have one field: "text", with the following contents: doc1:text:"Table Chair Spoon" doc2:text:"Table Chair Spoon Fork" doc3:text:"Table Spoon Fork" doc4:text:"Chair Spoon Fork" doc5:text:"Spoon Fork" When I enter the query: "Table AND NOT Chair" I get one hit, doc3 When I enter the query: "Table AND (NOT Chair)" I get 0 hits. I had thought that both queries would return the same results. Is this a bug, or, am I not understanding the query language correctly? I'm attaching test code. The program creates an index in the directory which you pass into the main program. Thanks! L -- import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.search.Query; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.index.IndexReader; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.index.Term; import java.io.File; import java.io.IOException; import java.io.FileReader; public class IndexTest { public static void create(File indexDir) throws IOException { IndexWriter writer = new IndexWriter(indexDir, new WhitespaceAnalyzer(), true); Document doc = new Document(); doc.add(new Field("text", "Table Chair Spoon", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("text", "Table Chair Spoon Fork", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("text", "Table Spoon Fork!", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("text", "Chair Spoon Fork", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO)); writer.addDocument(doc); doc = new Document(); doc.add(new Field("text", "Spoon Fork", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO)); writer.addDocument(doc); writer.close(); } public static void query(File indexDir, String queryString) throws IOException { Query query = null; Hits hits = null; try { QueryParser qp = new QueryParser("text",new WhitespaceAnalyzer()); qp.setDefaultOperator(QueryParser.OR_OPERATOR); query = qp.parse(queryString); } catch (Exception qe) {System.out.println(qe.toString());} if (query == null) return; System.out.println("Query: " + query.toString()); IndexReader reader = IndexReader.open(indexDir); IndexSearcher searcher = new IndexSearcher(reader); hits = searcher.search(query); System.out.println("Hits: " + hits.length()); for (int i = 0; i < hits.length(); i++) { System.out.println( hits.doc(i).get("text") + " "); } searcher.close(); reader.close(); } public static void main(String[] args) throws Exception { if (args.length != 1) { throw new Exception("Usage: " + IndexTest.class.getName() + ""); } File indexDir = new File(args[0]); create(indexDir); query(indexDir,"Table AND NOT Chair"); query(indexDir,"Table AND (NOT Chair)"); } }
Re: Modelling relational data in Lucene Index?
One thing it took me a while to grasp, and is not automatic for folks with significant database backgrounds is that the fields in a Lucene document are only related to those of any other document by the meaning you, as a programmer, understand. That is, document 1 may have fields a, b, c. Document 2 may have fields b, e, g. There is no requirement that, in this example, document 1 has fields e and g for instance. and vice-versa. In other words, Lucene documents don't fit into a table model. The reason I mention that is that I'm extremely leery of packing data in a field that really doesn't belong together. Plus, your searching becomes more complicated. In your example above, what happens if the file name and image are similar enough to produce false hits? Whereas if you stored them as separate fields in a document, you don't have this kind of problem. So, if you can cleverly de-normalize your data in such a way as to satisfy all the searches you'll ever want to perform, you can store it all in a Lucene index and be happy. If you can't, you could use Lucene to search the parts you *do* care about and store the rest in a database. Or, you could just use a database. I believe it all hinges on whether you have a fixed set of queries you can anticipate (and thus reflect in a Lucene index) or not. Best Erick On 11/2/06, Rajesh parab <[EMAIL PROTECTED]> wrote: Thanks for feedback Chris. I agree with you. The data set should be flattened out to store inside Lucene index. The Folder-File was just an example. As you know, in relational database, we can have more complex relationships. I understand that this model may not work for deeper relationships. What I am mainly interested in is just one level deep relationship. But, I would like to search on the additional attributes of the related object. For example, in the relationship for Folder-File, I would like to use additional file attributes as search criteria along with file name while searching for folders. The way I see is having single filed for the related object and all its additional attributes and use some separator while capturing this data inside Lucene Field object. For example - new Field("file", "abc.txtimage"); But, I am not quite sure if this model will work. BTW. I did not understand what you meant by the detached approach. Can you please elaborate? Regards, Rajesh - Original Message From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, November 2, 2006 7:57:46 PM Subject: Re: Modelling relational data in Lucene Index? For this specific question, you can create index on files, search files that of type image, and from matched files, find the unique directories(can be done in lucene or you can do it via java). Of course this does not scale to deeper relationships. Usually you do need to flattern the database objects in order to use lucene. It's just trading space for speed. I would prefer a detached approach instead of Hibernate or EJB's approach, which is kind of too tightly coupled with any system. How to rebuild if the index is corrupted, or you have a new Analyzer, or schema evolves? How to make it multi-thread safe? -- Chris Lu - Instant Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com On 11/2/06, Mark Miller <[EMAIL PROTECTED]> wrote: > Lucene is probably not the solution if you are looking for a relational > model. You should be using a database for that. If you want to combine > Lucene with a relational model, check out Hibernate and the new EJB > annotations that it supports...there is a cool little Lucene add-on that > lets you declare fields to be indexed (and how) with annotations. > > - Mark > > Rajesh parab wrote: > > Hi, > > > > As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field. > > > > I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is > > > > Folder -- > File > > | > > --- > is image, is text file, .. > > > > > > I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index? > > > > Regards, > > Rajesh > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To
Re: for admins: mailing list like spam
It will make mails list more easy to read (I am using gmail and I do not have client-side filters). That is not true. You can have labels, and, if you look at the top of the page, right beside the "Search the Web" button, you have a "create filter" link. Patrick
Re: experiences with lingpipe
Martin Braun wrote: Hi Breck, i have tried your tutorial and built (hopefully) a successful SpellCheck.model File with 49M. My Lucene Index directory is 2,4G. When I try to read the Model with the readmodel function, i get an "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space", though I started java with -Xms1024m -Xmx1024m. How many RAM will I need for the Model (I only have 2 GB of physical RAM, and lucene's also using some memory). You need to increase the memory for java. I think 32-bit jave is limited to a 1.3 gig heap but could be wrong. No heuristics at the tip of my fingers. To make the spell checker smaller you can prune the tokens using the pruneLM method in the TrainSpellChecker. Pruning the 1 counts should make a big difference and not hurt spelling too much (depends on how things are paramterized). Probably up to 5 counts won't matter. Also look at my tuning tutorial that is in very rough shape but will get you going on tuning at: cvs -d:pserver:[EMAIL PROTECTED]:/usr/local/sandbox co querySpellCheckTuner I will try to get another pass at it over the weekend. b reck Is there a "rule of thumb" to calculate the needed amount of memory of the model? thanks in advance, martin Tuning params dominate the performance space. A small beam (16 active hypotheses) will be quite snappy (I have 200 queries/sec with a 32 beam. over a 80 gig text collection that with some pruning was 5 gig in memory running an 8 gram model) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: experiences with lingpipe
> You need to increase the memory for java. I think 32-bit jave is limited to a 1.3 gig heap but > could be wrong. No heuristics at the tip of my fingers. 32-bit JVM under Linux/Windows. Solaris runs OK. Limit on the heap is ~1.7 - 1.8Gb. -Original Message- From: Breck Baldwin [mailto:[EMAIL PROTECTED] Sent: Friday, November 03, 2006 9:59 AM To: java-user@lucene.apache.org Subject: Re: experiences with lingpipe Martin Braun wrote: > Hi Breck, > > i have tried your tutorial and built (hopefully) a successful > SpellCheck.model File with 49M. > My Lucene Index directory is 2,4G. When I try to read the Model with > the readmodel function, i get an "Exception in thread "main" > java.lang.OutOfMemoryError: Java heap space", though I started java > with -Xms1024m -Xmx1024m. > > How many RAM will I need for the Model (I only have 2 GB of physical > RAM, and lucene's also using some memory). You need to increase the memory for java. I think 32-bit jave is limited to a 1.3 gig heap but could be wrong. No heuristics at the tip of my fingers. To make the spell checker smaller you can prune the tokens using the pruneLM method in the TrainSpellChecker. Pruning the 1 counts should make a big difference and not hurt spelling too much (depends on how things are paramterized). Probably up to 5 counts won't matter. Also look at my tuning tutorial that is in very rough shape but will get you going on tuning at: cvs -d:pserver:[EMAIL PROTECTED]:/usr/local/sandbox co querySpellCheckTuner I will try to get another pass at it over the weekend. b reck > > Is there a "rule of thumb" to calculate the needed amount of memory of > the model? > > thanks in advance, > > martin > > > Tuning params dominate the performance space. A small beam (16 active hypotheses) will be quite snappy (I have 200 queries/sec with a 32 beam. over a 80 gig text collection that with some pruning was 5 gig in memory running an 8 gram model) > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Any experience with spring's lucene support?
Haven't used them, but had a look at them some time ago. Seems like a nice set of helper factory classes to manage Lucene engine through Spring IoC. Can't do much wrong in here I guess... If you'd be using Spring in your app, you'd have to come up with similar factories either way, so probably it'd make sense to reuse the ones in springmodules. The only 'non-factory' classes I noticed is 'DB indexing'. The only problem (from my estimations) is that the DB Access layer is fixed to Spring SQL classes (ie, you probably wouldn't be able to use iBatis or Hibernate easily). As to compass, probably these guys have similar Spring classes as well as some other stuff. One person on the list used it (Compass) in production environment and says he's quite happy with it. But generally, it's probably worthwhile to go to SpringModules forum and Compass forum accordingly for more info... Vlad -Original Message- From: lude [mailto:[EMAIL PROTECTED] Sent: Friday, November 03, 2006 1:36 AM To: java-user Subject: Re: Any experience with spring's lucene support? Nobody here, who is using spring-modules? On 11/1/06, lude <[EMAIL PROTECTED]> wrote: > > Hello, > > while starting a new project we are thinking about using the > spring-modules for working with lucene. See: > https://springmodules.dev.java.net/ > > Does anybody has experience with this higher level lucene API? > How does it compare to Compass? > (Dis-)Advantages of using spring-modules lucene support? > > Thanks > lude > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Hi Peter, Does this mean you are calculating the euclidean distance twice ... once for the HitCollecter to filter 'out of range' documents, and then again for the custom Comparator to sort the returned documents? especially since the filtering is done outside Lucene? Regards, Dan Joe, Fields with numeric values are stored in a separate file as binary values in an internal format. Lucene is unaware of this file and unaware of the range expression in the query. The range expression is parsed outside of Lucene and used in a custom HitCollector to filter out documents that aren't in the requested range(s). A goal was to do this without having to modify Lucene. Our scheme is pretty efficient, but not very general purpose in its current form, though. Peter On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote: Hi Peter, On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: > Numeric range search is one of Lucene's weak points (performance-wise) so we > have implemented this with a custom HitCollector and an extension to the > Lucene index files that stores the numeric field values for all documents. > > It is important to point out that this has all been implemented with the > stock Lucene 2.0 library. No code changes were made to the Lucene core. Can you give some technical details on the extension to the Lucene index files? How did you do it without making any changes to the Lucene core? Thanks, Joe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TooManyClauses with MultiTermQueries
Hello, in working with Lucene since several years. One of my biggest problem was the unability of lucene to search with wildcard. Also I have develop my own MultiTermQueries. Now there's a standard class for this, but you'll allways become an exception if your search is to generic, 'a*' for exemple. I can't solve this problem, but I making it acceptable with the follwing allgorithm: - getting all possible terms. - sort them (actualy with the length difference beetween search term (if you search 'TooMany*' then 'TooManyDog' has a better range than 'TooManyClauses')). - get the allowed (I want my BooleanQuery not to overwrite 100 terms for example). - search this. for this Query I can call call: .getWarnnigs() give me a string with a description of the limitation ("Have found 265654 terms for you search please be more precise.") .getTermsList() the list of all searched terms (usefull too for the user). So I can allways have a result. Mostly, with the sorting I am getting the searched term (You can use another sort). I can limit maxClauseCount to few values (avoid out of memory and better performance). Hope this can help someone. I think it would be a nice feature to implements in lucene. PS: sorry for my poor english. -- Mit freundlichen Grüßen i. A. Éric Louvard HAUK & SASKO Ingenieurgesellschaft mbH Zettachring 2 D-70567 Stuttgart Phone: +49 7 11 7 25 89 - 19 Fax: +49 7 11 7 25 89 - 50 E-Mail: [EMAIL PROTECTED] www: www.hauk-sasko.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multi valued fields
Hi all, Our company has a set of assets and we use meta-data (XML files) to describe each asset. My job is to index and search over the meta-data associated with the assets. The interesting aspect of my problem is that an asset can have more than one meta-data file associated with it, depending on the context that the asset lies in. The search result must display an asset only once. If more than one meta-data associated with it match the search query, we need to display the different meta-data associated with the asset in order of relevance as part of one hit to be able to show the user the various contexts that this asset occurs in. My first idea was to index each meta-data file into its own document and merge the documents with the same asset_id on search. But, there are hundreds of thousands of meta-data and the search results can run into hundreds. My next idea was to index all the meta-data associated with an asset into multi-valued fields. But, I cannot see a way to rank within the multi-valued fields. Another crazy idea that crossed my mind - how about building a separate index that indexes document ids of the documents associated with an asset, so that I can look it up to merge the hits? Any thoughts? Seeta
Re: Announcement: Lucene powering Monster job search index (Beta)
Paramasivam, Take a look at Solr, in particular the DocSetHitCollector class. The collector simply sets a bit in a BitSet, or saves the docIds in an array (for low hit counts). Solr's BitSet was optimized (by Yonik, I believe) to be faster than Java's BitSet, so this HitCollector is very fast. This is essentially what we are doing for counting. Peter On 11/2/06, Paramasivam Srinivasan <[EMAIL PROTECTED]> wrote: Hi Peter When I use the CustomHitCollector, it affect the application performance. Also how you accomplish the grouping the results with out affecting performance. Also If possible give some code snippet for custome hitcollector. TIA Sri "Peter Keegan" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Joe, > > Fields with numeric values are stored in a separate file as binary values > in > an internal format. Lucene is unaware of this file and unaware of the > range > expression in the query. The range expression is parsed outside of Lucene > and used in a custom HitCollector to filter out documents that aren't in > the > requested range(s). A goal was to do this without having to modify Lucene. > Our scheme is pretty efficient, but not very general purpose in its > current > form, though. > > Peter > > > On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote: >> >> Hi Peter, >> >> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: >> > Numeric range search is one of Lucene's weak points (performance-wise) >> so we >> > have implemented this with a custom HitCollector and an extension to >> > the >> > Lucene index files that stores the numeric field values for all >> documents. >> > >> > It is important to point out that this has all been implemented with >> > the >> > stock Lucene 2.0 library. No code changes were made to the Lucene core. >> >> Can you give some technical details on the extension to the Lucene index >> files? How did you do it without making any changes to the Lucene core? >> >> Thanks, >> Joe >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search within search
spinergywmy <[EMAIL PROTECTED]> wrote on 03/11/2006 00:40:42: >I have another problem is I do not perform the real search within search > feature which according to the way that I have coded, because for the second > time searching, I actually go back to the index directory to search the > entire indeces again rather then cached the first time search result. > >How can I solve this problem? Do I need to use queryFilter and > reconstruct the codes again, and that is time consuming, is there any how I > can get it done without reconstruct. Or do I need to use bitSet within my > existing codes. This was the recommendation you got on this in the list (forgot who it was): submit query1 ANDed with query2. True, this is searching again the "entire" index. In particular, it is re-doing work already done for query1. However this is the simplest approach, with equivalent results. Unless you are facing performance problems this should be sufficient. If however, you are facing performance issues, say, if the queries are very large, and the index is large as well, and you have more than 2 stages (search within (search within (search within search (..., and resubmitting a larger and larger boolean query is out of the question - you can go with the filter approach. For this you can use your hit collector, which, while query1 is processed, would populate a bitset to be used for a filter for query2, should that be requested by the user. But I wouldn't go there if I don't have to. Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Daniel, Yes, this is correct if you happen to be doing a radius search and sorting by mileage. Peter On 11/3/06, Daniel Rosher <[EMAIL PROTECTED]> wrote: Hi Peter, Does this mean you are calculating the euclidean distance twice ... once for the HitCollecter to filter 'out of range' documents, and then again for the custom Comparator to sort the returned documents? especially since the filtering is done outside Lucene? Regards, Dan >Joe, > >Fields with numeric values are stored in a separate file as binary values in >an internal format. Lucene is unaware of this file and unaware of the range >expression in the query. The range expression is parsed outside of Lucene >and used in a custom HitCollector to filter out documents that aren't in the >requested range(s). A goal was to do this without having to modify Lucene. >Our scheme is pretty efficient, but not very general purpose in its current >form, though. > >Peter > > >On 10/30/06, Joe Shaw <[EMAIL PROTECTED]> wrote: >> >> Hi Peter, >> >> On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: >> > Numeric range search is one of Lucene's weak points (performance-wise) >> so we >> > have implemented this with a custom HitCollector and an extension to the >> > Lucene index files that stores the numeric field values for all >> documents. >> > >> > It is important to point out that this has all been implemented with the >> > stock Lucene 2.0 library. No code changes were made to the Lucene core. >> >> Can you give some technical details on the extension to the Lucene index >> files? How did you do it without making any changes to the Lucene core? >> >> Thanks, >> Joe >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >>
Re: How to get Term Weights (document term matrix)?
I don't really know what a "term matrix" is, but when you ask about "weight' is it possible you are just looking for the TermDoc.freq() of the term/doc pair? : Date: Thu, 02 Nov 2006 12:45:30 +0100 : From: Soeren Pekrul <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: How to get Term Weights (document term matrix)? : : Hello, : : I would like to extract and store the document term matrix externally. I : iterate the terms and the documents for each term: : TermEnum terms=IndexReader.terms(); : while(terms.next()) { : TermDocs docs=IndexReader.termDocs(terms.term()); : while(docs.next()) { : //store the term, the document and the weight : } : } : : How can I get the term weight for a document? : : Thanks. Sören : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Intermittent search performance problem
Hi, I'm trying to figure out a way to troubleshoot a performance problem we're seeing when searching against a memory-based index. What happens is we will run a search against the index and it generally returns in 1 second or less. But every once in a while it takes 15-20 seconds for the exact same search for no apparent reason. There is nothing else going on in the system to cause this behavior. I have tried hooking up YourKit profiler to see where the time is going but it doesn't even record the extra time being taken up, even when I ask for method invocation counts. This is very strange, we have been using Lucene for years in production and I've never seen a problem like it. It is also only exhibited in one particular index, we cannot reproduce the problem with other indexes. This index has around 170,000 documents in it and does not have a particularly large amount of data relative to our other indexes. I would really appreciate any suggestions for tracking down the culprit. Since YourKit is missing the extra time it seems like some sort of lock/synchronized method issue but I've only really seen that type of problem using disk indexing when the indexes aren't optimized. We're currently on Lucene 2.0 but I had the same problem with 1.9.1. Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Modelling relational data in Lucene Index?
Hi No, he is talking about http://www.hibernate.org/hib_docs/annotations/reference/en/html/lucene.html Also note that I'm about to release a new version much more flexible http://www.mail-archive.com/hibernate-dev%40lists.jboss.org/msg00392.html and for the future (but flexible) http://www.mail-archive.com/hibernate-dev%40lists.jboss.org/msg00393.html Note that Compass is an alternative approach. I haven't really looked at the project in detail, the main drawback for me and some other people who compared the 2 were - it requires you to deal with a different API than your ORM - it does not give you back a managed (ORM) object on query results - it abstracts quite a lot Lucene I guess you need to check by yourself Emmanuel Rajesh parab wrote: Thanks Mark. Can you please tell me more about the Lucene add-on you are talking about? Are you talking about Compass? Regards, Rajesh - Original Message From: Mark Miller <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, November 2, 2006 7:29:10 PM Subject: Re: Modelling relational data in Lucene Index? Lucene is probably not the solution if you are looking for a relational model. You should be using a database for that. If you want to combine Lucene with a relational model, check out Hibernate and the new EJB annotations that it supports...there is a cool little Lucene add-on that lets you declare fields to be indexed (and how) with annotations. - Mark Rajesh parab wrote: Hi, As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field. I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is Folder -- > File | --- > is image, is text file, .. I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index? Regards, Rajesh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: for admins: mailing list like spam
On 11/3/06, Patrick Turcotte <[EMAIL PROTECTED]> wrote: > > It will make mails list more easy to read (I am using gmail and I do > not have client-side filters). That is not true. You can have labels, and, if you look at the top of the page, right beside the "Search the Web" button, you have a "create filter" link. "Skip Inbox" is particularly important when doing this. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Modelling relational data in Lucene Index?
Hi, What exactly are your concerned about the "non-detached" approach (see below)? Chris Lu wrote: I would prefer a detached approach instead of Hibernate or EJB's approach, which is kind of too tightly coupled with any system. How to it is probably going to be couple with yours ;-) rebuild if the index is corrupted, or you have a new Analyzer, or I've introduced a session.index() which forces the (re)indexing of the document schema evolves? How to make it multi-thread safe? What do you mean by multithread safe? The indexing? the indexing is multithread safe in the Hibernate Lucene integration The query process? the query doesn't have to since you query on a give session (aka user conversation), so no multithread threat here. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suspected problem in the QueryParser
: When I enter the query: "Table AND NOT Chair" I get one hit, doc3 : When I enter the query: "Table AND (NOT Chair)" I get 0 hits. : : I had thought that both queries would return the same results. Is this a : bug, or, am I not understanding the query language correctly? it's a confusing eccentricity of the QueryParser syntax ... as a general rule, thing in parens need to be self contained, effective, queries ... if you have something in parens which would not make sense as a query by itself, then it won't make any more sense as part of a larger query. In your case, the query " NOT Chair " is the problem ... you can't have a negative clause in isolation by itself -- it doesn't make sense because there isn't anything positively selecting results for you to then exclude results from. As a side not: i strongly encourage you to train yourself to think in terms of MUST, MUST_NOT and SHOULD (which are represented in the query parser as the prefixes "+", "-" and the default) instead of in terms of AND, OR, and NOT ... Lucene's BooleanQuery (and thus Lucene's QueryParser) is not a strict Boolean Logic system, so it's best not to try and think of it like one. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Modelling relational data in Lucene Index?
I personally like your effort, but technically I would disagree. The SOLR project, and the project I am working on, DBSight, have an detached approach which is implementation agnostic, no matter if it's java, ruby, php, .net. The return results can be a rendered HTML, JSON, XML. I don't think you can be more flexible than that. If creating an new index takes 5 minutes without any coding, you can create something more creative. From business side, you don't need to worry about indexing when designing a system. New requirement may come. It's very hard trying to anticipate all the needs. Technically, detached approach gives more flexible on resources like CPU, memory, hard drive. For example, if your index grows large, say 1G, indexing can take hours with merging, I am not sure how compass or hibernate/lucene handles it. Need to re-write code at that time? I actually feel it's a dangerous trap. I've introduced a session.index() which forces the (re)indexing of the document So does it mean you need to write some code to fix the index if it's crashed? What do you mean by multithread safe? The indexing? the indexing is multithread safe in the Hibernate Lucene integration The indexing can be threadsafe. But will it affect the searching? With many files changing and merging, if you cache the searcher. the searching will have "read passed EOF" exceptions. If you don't cache the searcher, you will loose the built-in caching, FieldCacheImpl, in Lucene. The query process? the query doesn't have to since you query on a give session (aka user conversation), so no multithread threat here. So you are not caching searcher. -- Chris Lu - Instant Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com On 11/3/06, Emmanuel Bernard <[EMAIL PROTECTED]> wrote: Hi, What exactly are your concerned about the "non-detached" approach (see below)? Chris Lu wrote: > > I would prefer a detached approach instead of Hibernate or EJB's > approach, which is kind of too tightly coupled with any system. How to it is probably going to be couple with yours ;-) > rebuild if the index is corrupted, or you have a new Analyzer, or I've introduced a session.index() which forces the (re)indexing of the document > schema evolves? How to make it multi-thread safe? What do you mean by multithread safe? The indexing? the indexing is multithread safe in the Hibernate Lucene integration The query process? the query doesn't have to since you query on a give session (aka user conversation), so no multithread threat here. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TooManyClauses with MultiTermQueries
Hi All, I also need to resolve this issue. What is the best way to catch this exception? Thanks Mathews -Original Message- From: Eric Louvard [mailto:[EMAIL PROTECTED] Sent: Friday, November 03, 2006 8:36 AM To: java-user@lucene.apache.org Subject: TooManyClauses with MultiTermQueries Hello, in working with Lucene since several years. One of my biggest problem was the unability of lucene to search with wildcard. Also I have develop my own MultiTermQueries. Now there's a standard class for this, but you'll allways become an exception if your search is to generic, 'a*' for exemple. I can't solve this problem, but I making it acceptable with the follwing allgorithm: - getting all possible terms. - sort them (actualy with the length difference beetween search term (if you search 'TooMany*' then 'TooManyDog' has a better range than 'TooManyClauses')). - get the allowed (I want my BooleanQuery not to overwrite 100 terms for example). - search this. for this Query I can call call: .getWarnnigs() give me a string with a description of the limitation ("Have found 265654 terms for you search please be more precise.") .getTermsList() the list of all searched terms (usefull too for the user). So I can allways have a result. Mostly, with the sorting I am getting the searched term (You can use another sort). I can limit maxClauseCount to few values (avoid out of memory and better performance). Hope this can help someone. I think it would be a nice feature to implements in lucene. PS: sorry for my poor english. -- Mit freundlichen Grüßen i. A. Éric Louvard HAUK & SASKO Ingenieurgesellschaft mbH Zettachring 2 D-70567 Stuttgart Phone: +49 7 11 7 25 89 - 19 Fax: +49 7 11 7 25 89 - 50 E-Mail: [EMAIL PROTECTED] www: www.hauk-sasko.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Intermittent search performance problem
On 11/3/06, Ben Dotte <[EMAIL PROTECTED]> wrote: I'm trying to figure out a way to troubleshoot a performance problem we're seeing when searching against a memory-based index. What happens is we will run a search against the index and it generally returns in 1 second or less. But every once in a while it takes 15-20 seconds for the exact same search for no apparent reason. Are you sure it's not just a big GC? How big is your heap? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Intermittent search performance problem
Good suggestion, I tried watching the GCs in YourKit while testing but unfortunately they don't seem to line up with the searches that take forever. They also don't last long enough to make up that kind of time. I have our heap limited to 1GB right now and its using around 768MB of that. On 11/3/06, Ben Dotte <[EMAIL PROTECTED]> wrote: I'm trying to figure out a way to troubleshoot a performance problem we're seeing when searching against a memory-based index. What happens is we will run a search against the index and it generally returns in 1 second or less. But every once in a while it takes 15-20 seconds for the exact same search for no apparent reason. Are you sure it's not just a big GC? How big is your heap? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get Term Weights (document term matrix)?
Chris Hostetter wrote: I don't really know what a "term matrix" is, but when you ask about "weight' is it possible you are just looking for the TermDoc.freq() of the term/doc pair? Thank you Chris, that was also my first idea. I wanted to get the document frequency indexreader.docFreq(term) and the term frequency termdoc.freq() to calculate the term weight by my self. If I change the scoring by sub classing the Similarity class I have to change the code for the term weight calculation as well. The better way would be to use the same scoring engine for a single term weight and the ranking of search results. It seems that there is no simple function to ask the weight for a term in a document directly. So I decide not to iterate the documents of a term or the terms of a document. I'm iterating the terms of the index, searching for the term, iterating the result documents and using the score as my term weight for the document term matrix: TermEnum terms=indexreader.terms(); while(terms.next()) { Term term=terms.term(); // write the term to the external document term matrix Hits hits=indexsearcher.search(new TermQuery(term)); for(int i=0; i// write the document id (key, URL or index number) to the document term matrix float weight=hits.score(i); // write the term weight to the document term matrix } } Sören - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get Term Weights (document term matrix)?
: It seems that there is no simple function to ask the weight for a term : in a document directly. So I decide not to iterate the documents of a as i said: it depends on what you mean by "term weight" ... : term or the terms of a document. I'm iterating the terms of the index, : searching for the term, iterating the result documents and using the : score as my term weight for the document term matrix: ...okay, so it sounds like your defining term weight of a doc/term to be the score of that document when searching for that term. You really, *REALLY* don't wnat to be doing this using the "Hits" class like in your example ... 1) this will re-execute your search behind the scenes many many times 2) the scores returnd by "Hits" are psuedo-normalized ... they will be meaningless for any sort of comparison. if your concern is making sure that the score you get back matches the score you would get from executing a search even if you change the Similarity, you could just make sure you use the lengthNorm and tf functions from the SImilarity class just like TermScorer does ... or you could keep executing a TermQuery for each term like you are now, but using a HitCollector so you get the raw score) take a look at the Searcher.search methods that take in a HitCollector. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search within search
Hi, Doron, thanks for the advice. regards, Wooi Meng -- View this message in context: http://www.nabble.com/search-within-search-tf2558237.html#a7171019 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]