Re: Index updates between machines

2007-04-06 Thread Chun Wei Ho
Thanks for the ideas. We are testing out the methods and changes suggested to see if they work with our current set up, and are checking if the disks are the bottleneck in this case, but feel free to drop more hints. :) At the moment we are copying the index at an offpeak hour, but we would also

Re: search on colon ":" ending words

2007-04-06 Thread Felix Litman
We ended up using String newquery = query.replace(query, ":", with space in quotes after the ":". It worked great. Now results come back even if you use colon in the query. And one can still use ":" as a special operator if there is no space afterwards. Great suggestion. Thanks! Felix --

Re: Out of memory exception for big indexes

2007-04-06 Thread Bublic Online
Hi Ivan, Chris and all! I'm that contributor of LUCENE-769 and I recommend it too :) OutOfMemory error was one of main reasons for me to make it. Regards, Artem Vasiliev On 4/6/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : The problem I suspect is the sorting. As I understand, Lucene : bu

Re: Out of memory exception for big indexes

2007-04-06 Thread Otis Gospodnetic
Craig, This just shows you that the JVM OOMed while running thar particular method, and does not necessarily mean that that method is what's consuming your RAM. Run your app and, if you are using Java 1.5/1.6 run jmap against that java process and tell it to show you how much memory objects are

Re: Explanation from FunctionQuery

2007-04-06 Thread Annona Keene
And it was as easy as all that... Thanks. - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, April 6, 2007 12:23:30 PM Subject: Re: Explanation from FunctionQuery : So we reach a problem at extractTerms. I get an explanation no

Re: Out of memory exception for big indexes

2007-04-06 Thread Chris Hostetter
: The problem I suspect is the sorting. As I understand, Lucene : builds internal caches for sorting and I suspect that this is the root : of your problem. You can test this by trying your problem queries : without sorting. if Sorting really is the cause of your problems, you may want to try out

Re: Out of memory exception for big indexes

2007-04-06 Thread Chris Hostetter
: Would it be fair to say that you can expect OutOfMemory errors if you : run complex queries? ie sorts, boosts, weights... not intrinsicly ... the amount of memory used has more to do with the size of hte index and the sorting done then it does with teh number of clauses in your query (of course

Re: Explanation from FunctionQuery

2007-04-06 Thread Chris Hostetter
: So we reach a problem at extractTerms. I get an explanation no problem ... : I'm using the version of FunctionQuery from the JIRA attachment. that seems like the heart of the problem ... i haven't looked at the version in Jira for a while, but the version commited into Solr does prov

Re: indexing and searching real numbers

2007-04-06 Thread Yonik Seeley
On 4/5/07, Leon <[EMAIL PROTECTED]> wrote: I need to index and search real numbers in Lucene. I found NumberUtils class in Solr project which permits one to encode doubles into string so that alpha numeric ordering would correctly correspond to the ordering on numbers. When I use ConstantScoreRan

Re: Out of memory exception for big indexes

2007-04-06 Thread Craig W Conway
Would it be fair to say that you can expect OutOfMemory errors if you run complex queries? ie sorts, boosts, weights... My query looks like this: +(pathNodeId_2976569:1^5.0 pathNodeId_2976969:1 pathNodeId_2976255:1 pathNodeId_2976571:1) +(pathClassId:1 pathClassId:346 pathClassId:314) -id:369

Re: UN_TOKENIZED and StandardAnalyzer

2007-04-06 Thread Roberto Fonti
Thanks Erick for your help. Actually I was already using Luke! The only thing I was missing was the possibility of using different Analyzers at the same time, with PerFieldAnalyzerWrapper. Thank you again. Best, Roberto Erick Erickson ha scritto: Really, really, really get a copy of Luke. Rea

Re: luke v0.7 and SnowBallAnalyzer

2007-04-06 Thread Daniel Naber
On Thursday 05 April 2007 17:07, Paul Hermans wrote: > I do receive the message > "java.lang.ClassNotFound: > net.sf.snowball.ext.GermansStemmer". This class is not part of the lukeall-0.7.jar, but it's in lucene-snowball-2.1.0.jar (which you can find on the Luke homepage). You will then need t

Re: IndexReader.deleteDocuement(); How to use it with our code??

2007-04-06 Thread Otis Gospodnetic
Hi Ratnesh, 1. There is no need to use that many question marks, really. 2. Use java-user list, not java-dev 3. You cannot delete using negative criteria. You can delete 1 Document using its document ID, or you can delete 1 or more Documents using a Term where you specify a field name and a val

Re: Out of memory exception for big indexes

2007-04-06 Thread Otis Gospodnetic
Ivane, Sorts will eat your memory, and how much they use depends on what you store in them - ints, String, floats... A profiler like JProfiler will tell you what's going on, who's eating your memory. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com

IndexModifier's docCount is inconsistent

2007-04-06 Thread Cheolgoo Kang
When we use IndexModifier's docCount() method, it calls it's underlying IndexReader's numDocs() or IndexWriter's docCount() method. Here is the problem that IndexReader.numDocs() cares about deleted documents, but IndexWriter.docCount() ignores it. So, I've made some modifications in IndexWriter.

Re: Out of memory exception for big indexes

2007-04-06 Thread Erick Erickson
I can only shed a little light on a couple of points, see below. On 4/6/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote: Hi All, I have the following problem - we have OutOfMemoryException when seraching on the indexes that are of size 20 - 40 GB and contain 10 - 15 million docs. When we make searc

Re: UN_TOKENIZED and StandardAnalyzer

2007-04-06 Thread Erick Erickson
Really, really, really get a copy of Luke. Really. Use it to open your index and run experimental queries, especially to see how they get rewritten (but be sure to pick the appropriate analyzer). Google "lucene luke". Really, really get a copy. It'll help you make MUCH faster progress than waitin

Re: Lucene for name matching

2007-04-06 Thread moraleslos
Thanks guys! I really really appreciate your feedback. I didn't know a "simple" problem like People name matching would be this complicated. I knew there will be some unusual circumstances or rules, but I did not realize how much work has been done to solve parts of the problem (string matching

Re: Explanation from FunctionQuery

2007-04-06 Thread Annona Keene
Ok, glossing over some of the details was not the best idea. ms is a MultiSearcher in that it's something I wrote that extends MultiSearcher. And this part I should have mentioned before, the explain method being called is the one in org.apache.lucene.search.Searcher. So explain is public E

Re: Lucene for name matching

2007-04-06 Thread Grant Ingersoll
I agree, SecondString was helpful to me. Also have a look at William Winkler's work at the US Census. We did similar things to come up with blocking criteria to get an initial division into duplicates, unique and undecided. Then we refined on the undecided set. No approach is going to b

Out of memory exception for big indexes

2007-04-06 Thread Ivan Vasilev
Hi All, I have the following problem - we have OutOfMemoryException when seraching on the indexes that are of size 20 - 40 GB and contain 10 - 15 million docs. When we make searches we perform query that match all the results but we DO NOT fetch all the results - we fetch 100 of them. We also

UN_TOKENIZED and StandardAnalyzer

2007-04-06 Thread Roberto Fonti
Hi All, I'm indexing categories with this code: for (Category category : item.getCategories()) { lucene_doc.add(new Field( "CATEGORY", category.getName(), Field.Store.NO, Field.Index.UN_TOKENIZED)); } And searching using the query: Str

Re: Lucene for name matching

2007-04-06 Thread eks dev
I've been doing this in past couple of years, and yes we use Lucene for some key parts of the problem. Basically, the problem you face is on how to run extremely high recall without compromising precision, hard! the key problem is performance, imagine you have DB with 10Mio persons you need to