Syns2Index utility: version of Lucene and Java
I am trying to use Syns2Index utility to convert the WordNet into a Lucene index. First I downloaded the latest JDK and Lucene 2.0, but soon realized that both were too new for compiling Syns2Index.java. Next, got down to j2sdk1.4.2_13 and Lucene 1.4.3. by deciphering error messages. (I am running XP SP2) I have copied the java\org\apache\lucene directory in the same folder as the Syns2Index.java file. I have a feeling that my classpath is most likely set right (or at least close), but I get a huge amount of identical compile errors. Command used: D:\InfringeDetector\JavaLucene>javac -classpath "D:\Project\JavaLucene; C:\j2sdk1.4.2_13" D:\Project\JavaLucene\org\apache\lucene\wordnet\Syns2 Index.java Compile results (posting just a few from the bottom of my screen): ^ C:\j2sdk1.4.2_13\java\nio\DirectByteBuffer.java:843: cannot resolve symbol symbol : method assert (boolean) location: class java.nio.DirectByteBuffer assert (off <= lim); ^ C:\j2sdk1.4.2_13\java\nio\DirectByteBuffer.java:934: cannot resolve symbol symbol : method assert (boolean) location: class java.nio.DirectByteBuffer assert (off <= lim); ^ C:\j2sdk1.4.2_13\java\nio\Bits.java:642: cannot resolve symbol symbol : method assert (boolean) location: class java.nio.Bits assert (reservedMemory > -1); ^ C:\j2sdk1.4.2_13\java\lang\CharacterDataLatin1.java:284: cannot resolve symbol symbol : method assert (boolean) location: class java.lang.CharacterDataLatin1 assert (data.length == (256 * 2)); ^ C:\j2sdk1.4.2_13\java\lang\CharacterData.java:956: cannot resolve symbol symbol : method assert (boolean) location: class java.lang.CharacterData assert (data.length == (678 * 2)); ^ C:\j2sdk1.4.2_13\java\nio\DirectByteBufferR.java:165: cannot resolve symbol symbol : method assert (boolean) location: class java.nio.DirectByteBufferR assert (pos <= lim); ^ C:\j2sdk1.4.2_13\java\nio\DirectByteBufferR.java:479: cannot resolve symbol symbol : method assert (boolean) location: class java.nio.DirectByteBufferR assert (off <= lim); ^ Note: Some input files use or override a deprecated API. Note: Recompile with -deprecation for details. 100 errors 206 warnings I have to admit that I am fairly new to Java, but past the HelloWorld setups. I have been banging my head against the wall and Google for 10 hours. Please help!!! -marie
Re: How to set query time scoring
Thanks for the instant reply. More specifically i am trying to do is: 1) to show the results which contain the exact query phrase on top followed by ANDed results followed by the ORed results. 2) introduce new parameter that uses the query phrase to influence the ranking. regards Sajid Bhavin Pandya wrote: > > Hi sajid, > > As you already boost data at indexing time... > You can boost query at search time... > eg. If you are firing boolean query and phrasequery...you might need to > boost phrasequery > > PhraseQuery pq = new PhraseQuery(); > pq.setBoost(2.0f); > > Thanks. > Bhavin pandya > > - Original Message - > From: "Sajid Khan" <[EMAIL PROTECTED]> > To: > Sent: Monday, November 27, 2006 10:17 AM > Subject: How to set query time scoring > > >> >> I have already set some score at the index time. And now i want to >> set >> some score at the query time. But i am not getting any idea of how to set >> the score at query time in lucene. >> Has anybody an idea how to do this? >> >> Regards >> Sajid >> -- >> View this message in context: >> http://www.nabble.com/How-to-set-query-time-scoring-tf2709773.html#a7554766 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/How-to-set-query-time-scoring-tf2709773.html#a7557740 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question about the "not" in lucene
Thanks you for your answer. But, is it possible to group clauses with a "not". exemple: type:product NOT (name:"toto" OR name:"titi") ?? Christophe Mark Miller a écrit : Personally, I think of it as not a 'not' operator, but more a 'but not' or 'and not' operator. Thats not totally the case I believe, but gives you semantics that work. Truly I think that each part of the query creates a score and the NOT query scores 0. That gives a different result than a boolean system. More than a few times it has been mentioned that Lucene is a scoring system and not a boolean system. - Mark christophe leroy wrote: Hello, I don't understand how to use "not" with Lucene. I think that it is not a boolean not. I read the documentation but it is not clear enough on how the "not" works. For example, I tried to do this request: type:product --> I got 100 responses. It is normal. Then, I tried this request: type:product AND name:test --> I got 1 response. It is normal too. And when I tried this request: type:product AND (name:test OR NOT name:test) --> I got 1 response only. I should normally get 100 responses if the "not" was a boolean not. Could you explain me how the "not" works? Thank in advance, Christophe ___ Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! Profitez des connaissances, des opinions et des expériences des internautes sur Yahoo! Questions/Réponses http://fr.answers.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hits length with no sorting or scoring
Hello, I have an application in which we only need to know the total number of documents matching a query. In this case we do not need any sorting or scoring or to store any reference to the matching documents. Can you tell me how to execute such a query with maximum performance? Thanks Laurie
Re: Database searching using Lucene....
This has been discussed extensively on this thread, so I think you'd get the fastest answers by searching the mail archive for database, db, etc. The short answer is "it all depends upon what you want to accomplish and the characteristics of your problem". Erick On 11/27/06, Inderjeet Kalra <[EMAIL PROTECTED]> wrote: Hi, I need some inputs on the database searching using lucene. Lucene directly supports the document searching but I am unable to find out the easy and the fastest way for database searching. Which option would be better - SPs or Lucene search engine in terms of implementation, performance and security...if anyone has already done analysis on the same, can you please provide me the comparison matrix or benchmarks for the same ? Thanks in advance Regards Inderjeet ***The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review,retransmission,dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.***
Re: Question about the "not" in lucene
Yes, I believe that it is entirely possible. You can nest and link boolean clauses all you want: your example query would be a boolean with two top level clauses, one required to be there and one required not to be there. The second top level clause would itself be a boolean query with two two clauses, both with a SHOULD. Now, what I think happens (I haven't looked myself) is that the type:product will score a document positively if found, but the NOT clause will score a document to 0 if either of it's sub-clauses are found. Those 0 scores will not return as hits. Now notice that if you just have "NOT(name:"toto" OR name:"tit")" ALL of the docs will score 0 one way or another- the docs not found will be 0 and docs found will be scored 0 by the NOT...so you will not get a result. Now if you use the special query that matches all docs, and then use a NOT query...the not query will work as expected...all docs will get a positive score, but the NOT query will 0 out those in the MUST_NOT clause. I am an unclear kind of guy, so I hope that gives some help. - Mark hawat23 wrote: Thanks you for your answer. But, is it possible to group clauses with a "not". exemple: type:product NOT (name:"toto" OR name:"titi") ?? Christophe Mark Miller a écrit : Personally, I think of it as not a 'not' operator, but more a 'but not' or 'and not' operator. Thats not totally the case I believe, but gives you semantics that work. Truly I think that each part of the query creates a score and the NOT query scores 0. That gives a different result than a boolean system. More than a few times it has been mentioned that Lucene is a scoring system and not a boolean system. - Mark christophe leroy wrote: Hello, I don't understand how to use "not" with Lucene. I think that it is not a boolean not. I read the documentation but it is not clear enough on how the "not" works. For example, I tried to do this request: type:product --> I got 100 responses. It is normal. Then, I tried this request: type:product AND name:test --> I got 1 response. It is normal too. And when I tried this request: type:product AND (name:test OR NOT name:test) --> I got 1 response only. I should normally get 100 responses if the "not" was a boolean not. Could you explain me how the "not" works? Thank in advance, Christophe ___ Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! Profitez des connaissances, des opinions et des expériences des internautes sur Yahoo! Questions/Réponses http://fr.answers.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StackOverflowError while calling IndexReader.deleteDocuments(new Term())
I was trying to build a lucene index (Lucene 2.0, JDK 5) with approximately 15 documents containing about 25 fields for each document. After indexing about 45000 documents , the program crashed. It was running as a batch job and did not log the cause of the crash. In order to identify why the process crashed, I restarted the job about 50 documents before the crash point so that I can identify the problem. At this point, the program first tries to delete the document if it's already present in the index and then adds it. As soon as I start the program, the program aborts with a StackOverflowError while calling indexreader.deleteDocuments(new Term()) method (even for the document that was indexed earlier). Here is the partial stacktrace: Exception in thread "main" java.lang.StackOverflowError at java.lang.ref.Reference.(Reference.java:207) at java.lang.ref.WeakReference.(WeakReference.java:40) at java.lang.ThreadLocal$ThreadLocalMap$Entry.(ThreadLocal.java:240) at java.lang.ThreadLocal$ThreadLocalMap$Entry.(ThreadLocal.java:235) at java.lang.ThreadLocal$ThreadLocalMap.getAfterMiss(ThreadLocal.java:375) at java.lang.ThreadLocal$ThreadLocalMap.get(ThreadLocal.java:347) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:225) at java.lang.ThreadLocal.get(ThreadLocal.java:127) at org.apache.lucene.index.TermInfosReader.getEnum(TermInfosReader.java:79) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:139) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:50) at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:392) at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:348) at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349) The last line [at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)] repeats another 1010 times before the program crashes. I understand that without the actual index or the documents, it's nearly impossible to narrow down the cause of the error. However, can you please point to any theoretical reason why org.apache.lucene.index.MultiTermDocs.next will go into an infinite loop? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching by bit masks
i have the same problem here. I have an interest bit field, which i receive from the applciation backend. I have control over how the docuemtns are built. To be specific, the field looks like this: ID: interest 1 : sport 2 : music 4 : film 8 : clubs So someone interested in sports and music can be found by "interest & 3" => e.g. when using SQL. I do not wish to Post-Filter the results On to Lucene, Is there a filter which supports this kind of query ? Someone suggested splitting the bits into fields: > Document doc = new Document(); > doc.add("flag1", "Y"); > doc.add("flag2", "Y"); > IndexWriter.add(doc); Is this helpful at all ? Code would be helpful too as i am a newbie ltaylor.employon wrote: > > Hello, > > I am currently evaluating Lucene to see if it would be appropriate to > replace my company's current search software. So far everything has been > looking great, however there is one requirement that I am not too > certain about. > > What we need to do is to be able to store a bit mask specifying various > filter flags for a document in the index and then search this field by > specifying another bit mask with desired filters, returning documents > that have any of the specified flags set. In other words, we are doing a > bitwise OR on the stored filter bit mask and the specified filter bit > mask and if it is non-zero, we want to return the document. > > Before I started toying around with various options myself, I wanted to > see if any of you good folks in the Lucene community had some > suggestions for an efficient way to implement this. > > We currently need to index ~8,000,000 documents. We have several filter > flag fields, the most important of which currently has 7 possible flags > with any combination of the flags being valid. The number of flags is > expected to increase rather rapidly in the near future. > > My preemptive thanks for your suggestions, > > > Lawrence Taylor > Senior Software Engineer > Employon > Message was edited by: ltaylor.employon > > > -- View this message in context: http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7564237 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RAMDirectory vs MemoryIndex
On Nov 26, 2006, at 8:57 AM, jm wrote: I tested this. I use a single static analyzer for all my documents, and the caching analyzer was not working properly. I had to add a method to clear the cache each time a new document was to be indexed, and then it worked as expected. I have never looked into lucenes inner working so I am not sure if what I did is correct. Makes sense, I've now incorporated that as well by adding a clear() method and extracting the functionality into a public class AnalyzerUtil.TokenCachingAnalyzer. I also had to comment some code cause I merged the memory stuff from trunk with lucene 2.0. Performance was certainly much better (4 times faster in my very gross testing), but for my processing that operation is only a very small, so I will keep the original way, without caching the tokens, just to be able to use the unmodified lucene 2.0. I found a data problem in my tests, but as I was not going to pursue that improvement for now I did not look into it. Ok. Wolfgang. thanks, javier On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: Out of interest, I've checked an implementation of something like this into AnalyzerUtil SVN trunk: /** * Returns an analyzer wrapper that caches all tokens generated by the underlying child analyzer's * token stream, and delivers those cached tokens on subsequent calls to * tokenStream(String fieldName, Reader reader). * * This can help improve performance in the presence of expensive Analyzer / TokenFilter chains. * * Caveats: * 1) Caching only works if the methods equals() and hashCode() methods are properly * implemented on the Reader passed to tokenStream(String fieldName, Reader reader). * 2) Caching the tokens of large Lucene documents can lead to out of memory exceptions. * 3) The Token instances delivered by the underlying child analyzer must be immutable. * * @param child *the underlying child analyzer * @return a new analyzer */ public static Analyzer getTokenCachingAnalyzer(final Analyzer child) { ... } Check it out, and let me know if this is close to what you had in mind. Wolfgang. On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote: > I've never tried it, but I guess you could write an Analyzer and > TokenFilter that no only feeds into IndexWriter on > IndexWriter.addDocument(), but as a sneaky side effect also > simultaneously saves its tokens into a list so that you could later > turn that list into another TokenStream to be added to MemoryIndex. > How much this might help depends on how expensive your analyzer > chain is. For some examples on how to set up analyzers for chains > of token streams, see MemoryIndex.keywordTokenStream and class > AnalzyerUtil in the same package. > > Wolfgang. > > On Nov 22, 2006, at 4:15 AM, jm wrote: > >> checking one last thing, just in case... >> >> as I mentioned, I have previously indexed the same document in >> another >> index (for another purpose), as I am going to use the same analyzer, >> would it be possible to avoid analyzing the doc again? >> >> I see IndexWriter.addDocument() returns void, so it does not seem to >> be an easy way to do that no? >> >> thanks >> >> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >>> >>> On Nov 21, 2006, at 12:38 PM, jm wrote: >>> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good >>> enoguh >>> > I will explore the other options then. >>> >>> To get started you can use something like this: >>> >>> for each document D: >>> MemoryIndex index = createMemoryIndex(D, ...) >>> for each query Q: >>> float score = index.search(Q) >>> if (score > 0.0) System.out.println("it's a match"); >>> >>> >>> >>> >>>private MemoryIndex createMemoryIndex(Document doc, Analyzer >>> analyzer) { >>> MemoryIndex index = new MemoryIndex(); >>> Enumeration iter = doc.fields(); >>> while (iter.hasMoreElements()) { >>>Field field = (Field) iter.nextElement(); >>>index.addField(field.name(), field.stringValue(), analyzer); >>> } >>> return index; >>>} >>> >>> >>> >>> > >>> > >>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote: >>> >> >>> >> > Hi, >>> >> > >>> >> > I have to decide between using a RAMDirectory and >>> MemoryIndex, but >>> >> > not sure what approach will work better... >>> >> > >>> >> > I have to run many items (tens of thousands) against some >>> >> queries (100 >>> >> > at most), but I have to do it one item at a time. And I already >>> >> have >>> >> > the lucene Document associated with each item, from a previous >>> >> > operation I perform. >>> >> > >>> >> > From what I read MemoryIndex should be faster, but apparently I >>> >> cannot >>> >> > reuse the document I already have, and I have to create a new >>> >> > MemoryIndex per item. >>> >> >>> >> A MemoryIndex object holds
Re: RAMDirectory vs MemoryIndex
On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: On Nov 26, 2006, at 8:57 AM, jm wrote: > I tested this. I use a single static analyzer for all my documents, > and the caching analyzer was not working properly. I had to add a > method to clear the cache each time a new document was to be indexed, > and then it worked as expected. I have never looked into lucenes inner > working so I am not sure if what I did is correct. Makes sense, I've now incorporated that as well by adding a clear() method and extracting the functionality into a public class AnalyzerUtil.TokenCachingAnalyzer. yes, same here, I could have posted my code, sorry, but I was not sure if it was even correct... When theres is a new lucene 2.1 or whatever I'll incorporate to that optimization into my code. thanks > > I also had to comment some code cause I merged the memory stuff from > trunk with lucene 2.0. > > Performance was certainly much better (4 times faster in my very gross > testing), but for my processing that operation is only a very small, > so I will keep the original way, without caching the tokens, just to > be able to use the unmodified lucene 2.0. I found a data problem in > my tests, but as I was not going to pursue that improvement for now I > did not look into it. Ok. Wolfgang. > > thanks, > javier > > On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> Out of interest, I've checked an implementation of something like >> this into AnalyzerUtil SVN trunk: >> >>/** >> * Returns an analyzer wrapper that caches all tokens generated by >> the underlying child analyzer's >> * token stream, and delivers those cached tokens on subsequent >> calls to >> * tokenStream(String fieldName, Reader reader). >> * >> * This can help improve performance in the presence of expensive >> Analyzer / TokenFilter chains. >> * >> * Caveats: >> * 1) Caching only works if the methods equals() and hashCode() >> methods are properly >> * implemented on the Reader passed to tokenStream(String >> fieldName, Reader reader). >> * 2) Caching the tokens of large Lucene documents can lead to out >> of memory exceptions. >> * 3) The Token instances delivered by the underlying child >> analyzer must be immutable. >> * >> * @param child >> *the underlying child analyzer >> * @return a new analyzer >> */ >>public static Analyzer getTokenCachingAnalyzer(final Analyzer >> child) { ... } >> >> >> Check it out, and let me know if this is close to what you had in >> mind. >> >> Wolfgang. >> >> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote: >> >> > I've never tried it, but I guess you could write an Analyzer and >> > TokenFilter that no only feeds into IndexWriter on >> > IndexWriter.addDocument(), but as a sneaky side effect also >> > simultaneously saves its tokens into a list so that you could later >> > turn that list into another TokenStream to be added to MemoryIndex. >> > How much this might help depends on how expensive your analyzer >> > chain is. For some examples on how to set up analyzers for chains >> > of token streams, see MemoryIndex.keywordTokenStream and class >> > AnalzyerUtil in the same package. >> > >> > Wolfgang. >> > >> > On Nov 22, 2006, at 4:15 AM, jm wrote: >> > >> >> checking one last thing, just in case... >> >> >> >> as I mentioned, I have previously indexed the same document in >> >> another >> >> index (for another purpose), as I am going to use the same >> analyzer, >> >> would it be possible to avoid analyzing the doc again? >> >> >> >> I see IndexWriter.addDocument() returns void, so it does not >> seem to >> >> be an easy way to do that no? >> >> >> >> thanks >> >> >> >> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >>> >> >>> On Nov 21, 2006, at 12:38 PM, jm wrote: >> >>> >> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good >> >>> enoguh >> >>> > I will explore the other options then. >> >>> >> >>> To get started you can use something like this: >> >>> >> >>> for each document D: >> >>> MemoryIndex index = createMemoryIndex(D, ...) >> >>> for each query Q: >> >>> float score = index.search(Q) >> >>> if (score > 0.0) System.out.println("it's a match"); >> >>> >> >>> >> >>> >> >>> >> >>>private MemoryIndex createMemoryIndex(Document doc, Analyzer >> >>> analyzer) { >> >>> MemoryIndex index = new MemoryIndex(); >> >>> Enumeration iter = doc.fields(); >> >>> while (iter.hasMoreElements()) { >> >>>Field field = (Field) iter.nextElement(); >> >>>index.addField(field.name(), field.stringValue(), >> analyzer); >> >>> } >> >>> return index; >> >>>} >> >>> >> >>> >> >>> >> >>> > >> >>> > >> >>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote: >> >>> >> >> >>> >> > Hi, >> >>> >> > >> >>> >> > I have to decide between using a RAMDirectory and >> >>> MemoryInd
Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote: The last line [at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)] repeats another 1010 times before the program crashes. I understand that without the actual index or the documents, it's nearly impossible to narrow down the cause of the error. However, can you please point to any theoretical reason why org.apache.lucene.index.MultiTermDocs.next will go into an infinite loop? MultiTermDocs.next() is a recursive function. From what I can see of it though, it shouldn't recurse greater than the number of segments in the index. How many segments do you have in your index? What IndexWriter settings have you changed (mergeFactor, maxMergeDocs, etc)? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
Here are the values: mergeFactor=10 maxMergeDocs=10 minMergeDocs=100 And I see your point. At the time of the crash, I have over 5000 segments. I'll try some conservative number and try to rebuild the index. On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote: > The last line [at > org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)] > repeats another 1010 times before the program crashes. > > I understand that without the actual index or the documents, it's > nearly impossible to narrow down the cause of the error. However, can > you please point to any theoretical reason why > org.apache.lucene.index.MultiTermDocs.next will go into an infinite > loop? MultiTermDocs.next() is a recursive function. From what I can see of it though, it shouldn't recurse greater than the number of segments in the index. How many segments do you have in your index? What IndexWriter settings have you changed (mergeFactor, maxMergeDocs, etc)? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote: Here are the values: mergeFactor=10 maxMergeDocs=10 minMergeDocs=100 And I see your point. At the time of the crash, I have over 5000 segments. I'll try some conservative number and try to rebuild the index. Although I don't see how those settings can produce 5000 segments, I've developed a non-recursive patch you might want to try: https://issues.apache.org/jira/browse/LUCENE-729 The patch is to the Lucene trunk (current devel version), so if you want to stick with Lucene 2.0, you might have to patch by hand. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits length with no sorting or scoring
On Monday 27 November 2006 14:30, Hirsch Laurence wrote: > Hello, > > I have an application in which we only need to know the total number of > documents matching a query. In this case we do not need any sorting or > scoring or to store any reference to the matching documents. Can you > tell me how to execute such a query with maximum performance? A fairly quick way is to implement your own HitCollector to count, and then use the appropriate methods of IndexSearcher. If you really need maximum performance, this bit of code avoids computing the score values and invoking the HitCollector (untested): // s is the IndexSearcher, query the Query org.apache.lucene.search.Scorer scorer = query.weight(s).scorer(s.getIndexReader()); int count = 0; while (scorer.next()) count++; Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
Yonik, Thanks for the pointer. I'll try the nightly build once the change is committed. Suman On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote: > Here are the values: > > mergeFactor=10 > maxMergeDocs=10 > minMergeDocs=100 > > And I see your point. At the time of the crash, I have over 5000 > segments. I'll try some conservative number and try to rebuild the > index. Although I don't see how those settings can produce 5000 segments, I've developed a non-recursive patch you might want to try: https://issues.apache.org/jira/browse/LUCENE-729 The patch is to the Lucene trunk (current devel version), so if you want to stick with Lucene 2.0, you might have to patch by hand. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RAMDirectory vs MemoryIndex
On Nov 27, 2006, at 9:57 AM, jm wrote: On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: On Nov 26, 2006, at 8:57 AM, jm wrote: > I tested this. I use a single static analyzer for all my documents, > and the caching analyzer was not working properly. I had to add a > method to clear the cache each time a new document was to be indexed, > and then it worked as expected. I have never looked into lucenes inner > working so I am not sure if what I did is correct. Makes sense, I've now incorporated that as well by adding a clear() method and extracting the functionality into a public class AnalyzerUtil.TokenCachingAnalyzer. yes, same here, I could have posted my code, sorry, but I was not sure if it was even correct... When theres is a new lucene 2.1 or whatever I'll incorporate to that optimization into my code. thanks Actually, now I'm considering reverting back to the version without a public clear() method. The rationale is that this would be less complex and more consistent with the AnalyzerUtil design (simple methods generating simple anonymous analyzer wrappers). If desired, you can still (re)use a single static "child" analyzer instance. It's cheap and easy to create a new caching analyzer on top of the static analyzer, and to do so before each document. The old one will simply be gc'd. Let me know if that'd work for you. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching by bit masks
Well, you really have the code already . From the top... 1> there's no good way to support searching bitfields If you wanted, you could probably store it as a small integer and then search on it, but that's waaay too complicated than you want. 2> Add the fields like you have the snippet from, something like Document doc = new Document. if (bitsfromdb & 1) { doc.add("sport", "y"); } if (bitsfromdb & 2) { doc.add("music", "y"); } . . . IndexWriter.add(doc); Now, when searching, search on things like new Term("sport", "y")). and you'll only get the documents that correspond to the 2s bit being set. Watch out for capitalization. Y may not be equivalent to y. It depends on the analyzer you use at index AND search time. You can or as many of these together as you want. In your example, you could have up to 4 sub-clauses just for the bitmask-equivalents. NOTE: the documents won't all have the same fields. A document may not have, for instance, the "sports" field. This is OK in Lucene, but not the first thing folks with their DB hat on think of Get a copy of Luke (google lucene luke) and get familiar with it for examining your index and the effects of various analyzers. Really, really, really get a copy of Luke. Really. Do you have a copy of "Lucene In Action"? If not, I highly recommend it. It has tons of useful examples as well as a good introduction to many of the concepts. It's written to the 1.4 codebase, so be warned that there are some incompatibilities that are, for the most part, minor. Best Erick On 11/27/06, Biggy <[EMAIL PROTECTED]> wrote: i have the same problem here. I have an interest bit field, which i receive from the applciation backend. I have control over how the docuemtns are built. To be specific, the field looks like this: ID: interest 1 : sport 2 : music 4 : film 8 : clubs So someone interested in sports and music can be found by "interest & 3" => e.g. when using SQL. I do not wish to Post-Filter the results On to Lucene, Is there a filter which supports this kind of query ? Someone suggested splitting the bits into fields: > Document doc = new Document(); > doc.add("flag1", "Y"); > doc.add("flag2", "Y"); > IndexWriter.add(doc); Is this helpful at all ? Code would be helpful too as i am a newbie ltaylor.employon wrote: > > Hello, > > I am currently evaluating Lucene to see if it would be appropriate to > replace my company's current search software. So far everything has been > looking great, however there is one requirement that I am not too > certain about. > > What we need to do is to be able to store a bit mask specifying various > filter flags for a document in the index and then search this field by > specifying another bit mask with desired filters, returning documents > that have any of the specified flags set. In other words, we are doing a > bitwise OR on the stored filter bit mask and the specified filter bit > mask and if it is non-zero, we want to return the document. > > Before I started toying around with various options myself, I wanted to > see if any of you good folks in the Lucene community had some > suggestions for an efficient way to implement this. > > We currently need to index ~8,000,000 documents. We have several filter > flag fields, the most important of which currently has 7 possible flags > with any combination of the flags being valid. The number of flags is > expected to increase rather rapidly in the near future. > > My preemptive thanks for your suggestions, > > > Lawrence Taylor > Senior Software Engineer > Employon > Message was edited by: ltaylor.employon > > > -- View this message in context: http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7564237 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RAMDirectory vs MemoryIndex
yes that would be ok for my, as long as I can reuse my child analyzer. On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: On Nov 27, 2006, at 9:57 AM, jm wrote: > On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >> On Nov 26, 2006, at 8:57 AM, jm wrote: >> >> > I tested this. I use a single static analyzer for all my documents, >> > and the caching analyzer was not working properly. I had to add a >> > method to clear the cache each time a new document was to be >> indexed, >> > and then it worked as expected. I have never looked into lucenes >> inner >> > working so I am not sure if what I did is correct. >> >> Makes sense, I've now incorporated that as well by adding a clear() >> method and extracting the functionality into a public class >> AnalyzerUtil.TokenCachingAnalyzer. > yes, same here, I could have posted my code, sorry, but I was not > sure if it was even correct... > When theres is a new lucene 2.1 or whatever I'll incorporate to that > optimization into my code. thanks Actually, now I'm considering reverting back to the version without a public clear() method. The rationale is that this would be less complex and more consistent with the AnalyzerUtil design (simple methods generating simple anonymous analyzer wrappers). If desired, you can still (re)use a single static "child" analyzer instance. It's cheap and easy to create a new caching analyzer on top of the static analyzer, and to do so before each document. The old one will simply be gc'd. Let me know if that'd work for you. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RAMDirectory vs MemoryIndex
Ok. I reverted back to the version without a public clear() method. Wolfgang. On Nov 27, 2006, at 12:17 PM, jm wrote: yes that would be ok for my, as long as I can reuse my child analyzer. On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: On Nov 27, 2006, at 9:57 AM, jm wrote: > On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >> On Nov 26, 2006, at 8:57 AM, jm wrote: >> >> > I tested this. I use a single static analyzer for all my documents, >> > and the caching analyzer was not working properly. I had to add a >> > method to clear the cache each time a new document was to be >> indexed, >> > and then it worked as expected. I have never looked into lucenes >> inner >> > working so I am not sure if what I did is correct. >> >> Makes sense, I've now incorporated that as well by adding a clear() >> method and extracting the functionality into a public class >> AnalyzerUtil.TokenCachingAnalyzer. > yes, same here, I could have posted my code, sorry, but I was not > sure if it was even correct... > When theres is a new lucene 2.1 or whatever I'll incorporate to that > optimization into my code. thanks Actually, now I'm considering reverting back to the version without a public clear() method. The rationale is that this would be less complex and more consistent with the AnalyzerUtil design (simple methods generating simple anonymous analyzer wrappers). If desired, you can still (re)use a single static "child" analyzer instance. It's cheap and easy to create a new caching analyzer on top of the static analyzer, and to do so before each document. The old one will simply be gc'd. Let me know if that'd work for you. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Querying performance decrease in 1.9.1 and 2.0.0
Stanislav, On Wednesday 22 November 2006 09:52, Stanislav Jordanov wrote: > Paul, > We are working on delivering the next release by the end of the week so > I have to take care of 2 or 3 issues before I try the nightly build. > I promise to try it and report the results here. I have made a first attempt at restoring the old query performance here: http://issues.apache.org/jira/browse/LUCENE-730 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
Suman Ghosh wrote: On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote: > Here are the values: > > mergeFactor=10 > maxMergeDocs=10 > minMergeDocs=100 > > And I see your point. At the time of the crash, I have over 5000 > segments. I'll try some conservative number and try to rebuild the > index. Although I don't see how those settings can produce 5000 segments, I've developed a non-recursive patch you might want to try: https://issues.apache.org/jira/browse/LUCENE-729 Suman, I'd really like to understand how you're getting so many segments in your index. Is this (getting 5000 segments) easy to reproduce? Are you closing / reopening your writer every so often (eg to delete documents or something)? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching by bit masks
Erick Erickson wrote: Well, you really have the code already . From the top... 1> there's no good way to support searching bitfields If you wanted, you could probably store it as a small integer and then search on it, but that's waaay too complicated than you want. 2> Add the fields like you have the snippet from, something like Document doc = new Document. if (bitsfromdb & 1) { doc.add("sport", "y"); } if (bitsfromdb & 2) { doc.add("music", "y"); } Beware that if there are a large number of bits, this is going to impact memory usage due to there being more fields. Perhaps a better way would be to use a single "bits" field and store the words "sport", "music", ... in that field. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
Mike, I've not tried it yet, but I think the problem can be reproduced. However, it'll take a few hours to reach that threshhold since my code also needs to extract text from some very large PDF documents to store in the index. I'll post the pseudo-code of my code tomorrow. Maybe that'll help point to mistakes I'm making in the logic. Suman On 11/27/06, Michael McCandless <[EMAIL PROTECTED]> wrote: Suman Ghosh wrote: > On 11/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> On 11/27/06, Suman Ghosh <[EMAIL PROTECTED]> wrote: >> > Here are the values: >> > >> > mergeFactor=10 >> > maxMergeDocs=10 >> > minMergeDocs=100 >> > >> > And I see your point. At the time of the crash, I have over 5000 >> > segments. I'll try some conservative number and try to rebuild the >> > index. >> >> Although I don't see how those settings can produce 5000 segments, >> I've developed a non-recursive patch you might want to try: >> https://issues.apache.org/jira/browse/LUCENE-729 Suman, I'd really like to understand how you're getting so many segments in your index. Is this (getting 5000 segments) easy to reproduce? Are you closing / reopening your writer every so often (eg to delete documents or something)? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]