Re: WildCard search replacement

2005-04-20 Thread Volodymyr Bychkoviak
I used It to measure speed and but I was planning to use it in file search application. when u need wildcard search like *.txt and so on. The matter is that file search application is not my primary job, so I will tune it later. This is just an example to give you an idea how it can work. reg

Re: numDocs method of IndexReader

2005-04-20 Thread Otis Gospodnetic
--- Tomcat Programmer <[EMAIL PROTECTED]> wrote: > > Hi Otis, > > Thanks for your answer on the integer issue. I was not > sure if the index was actually limited, or if it was > just the numDocs method call. I guess it really does > not matter which it is; and for me, I don't think my > index w

Re: Lucene bulk indexing

2005-04-20 Thread Otis Gospodnetic
That sounds way too long, unless you have veeery slow disks, veeery large Documents (long fields that you analyze, index, and store in Lucene), or some such. If you have very lng filds you could try setting http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#

Re: fields that are indexed as UnStored

2005-04-20 Thread Chuck Williams
Omar Didi writes (4/20/2005 5:05 PM): Hi guys, If a field is indexed as UnStored how can I get it value? I tried document.get("UnStored_field") it returns null. You didn't store it, so it's not there. If the field happens to be a single Term, you might be able to find it in the index, expensiv

Re: Lucene bulk indexing

2005-04-20 Thread Aalap Parikh
Hi, I have similar issues in indexing time. I am doing a SELECT from database and getting back 10,000 rows. I then start indexing each row and hence would have 10,000 documents in my Lucene index. Each doc has 27 fields. I added some timing code to my indexing process. The DB select call takes a

fields that are indexed as UnStored

2005-04-20 Thread Omar Didi
Hi guys, If a field is indexed as UnStored how can I get it value? I tried document.get("UnStored_field") it returns null. thanks -Original Message- From: Kevin L. Cobb [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 20, 2005 8:52 AM To: java-user@lucene.apache.org Subject: RE: Best way

RE: Best way to purposely corrupt an index?

2005-04-20 Thread Kevin L. Cobb
My policy on this type of exception handling is to only byte off what you can chew. If you catch an IOException, then you simply report to the user that an unexpected error has occurred and the search engine is unobtainable at the moment. Errors should be logged and developers should look at the sp

Re: Passing XML objects to the analyzer ?

2005-04-20 Thread Paul Libbrecht
I don't agree with this if the query is expected to contain the same "text-encoding" as the content being analyzed. So one example would be matching f is continuous since it is the product of g and x |-> x^2 (in "email notation", we work with a semantic encoding) This combination of text and

Re: Scoring, cosine measure

2005-04-20 Thread Andrzej Bialecki
Daniel Naber wrote: On Wednesday 20 April 2005 18:22, Paul Elschot wrote: Has anyone tried an index based on n-grams? Nutch has bigrams for phrases with frequently occurring words. Also the spell checker in SVN uses n-grams I think. Yes, but Nutch uses word n-grams, whereas the spell checker use

Re: Scoring, cosine measure

2005-04-20 Thread David Spencer
Daniel Naber wrote: On Wednesday 20 April 2005 18:22, Paul Elschot wrote: Has anyone tried an index based on n-grams? Nutch has bigrams for phrases with frequently occurring words. Also the spell checker in SVN uses n-grams I think. SVN here: http://svn.apache.org/repos/asf/lucene/java/trunk/co

Re: Scoring, cosine measure

2005-04-20 Thread Daniel Naber
On Wednesday 20 April 2005 18:22, Paul Elschot wrote: > > Has anyone tried an index based on n-grams? > > Nutch has bigrams for phrases with frequently occurring words. Also the spell checker in SVN uses n-grams I think. Regards Daniel -- http://www.danielnaber.de --

RE: What is going on with subversion.

2005-04-20 Thread Peter Veentjer - Anchor Men
You are right.. From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] Sent: Wed 20-4-2005 18:12 To: java-user@lucene.apache.org Subject: Re: What is going on with subversion. IMHO QueryParser.DEFAULT_OPERATOR_AND and QueryParser.DEFAULT_OPERATOR_OR should be

Re: WildCard search replacement

2005-04-20 Thread Aalap Parikh
Hi, > Also this analyzer is not used in any application, I > wrote it only to > measure search speed. So you don't use the method you described for your wildcard search trick? Thanks, Aalap. - To unsubscribe, e-mail: [EMAIL PR

Re: Scoring, cosine measure

2005-04-20 Thread Paul Elschot
On Wednesday 20 April 2005 14:04, Barbara Krausz wrote: > Hi, > currently I'm writing my Bachelorthesis about Lucene. I searched for > theoretical information for example about the IR-model Lucene uses, but > I couldn't find anything so I had to figure it out on my own. > I think Lucene uses the

Re: What is going on with subversion.

2005-04-20 Thread Volodymyr Bychkoviak
IMHO QueryParser.DEFAULT_OPERATOR_AND and QueryParser.DEFAULT_OPERATOR_OR should be used instead QueryParser.AND and QueryParser.OR Peter Veentjer - Anchor Men wrote: package com.jph.lucene.parsers; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Versio

Re: What is going on with subversion.

2005-04-20 Thread Erik Hatcher
On Apr 20, 2005, at 9:47 AM, Peter Veentjer - Anchor Men wrote: I guess and hope the original MultifieldQueryParser in Lucene 2.0 will be of better design. MFQP in Subversion is what you'll get unless someone supplies patches to improve it. Erik --

RE: What is going on with subversion.

2005-04-20 Thread Peter Veentjer - Anchor Men
The MultiFieldQueryParser is of terrible design. It looks like it extends the QueryParser, but it doesn`t. There are only a few static methods that restrict the functionality of the QueryParser a lot. That is why I have created this util class, that does exactly the same job and has a few extra fea

Re: What is going on with subversion.

2005-04-20 Thread Volodymyr Bychkoviak
thanks. Peter Veentjer - Anchor Men wrote: package com.jph.lucene.parsers; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the Licens

RE: What is going on with subversion.

2005-04-20 Thread Peter Veentjer - Anchor Men
package com.jph.lucene.parsers; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/

RE: Passing XML objects to the analyzer ?

2005-04-20 Thread Vanlerberghe, Luc
The problem with this approach is that the Analyser you will use for indexing will be *very* different from the one used for searching. The way I see it, the Document objects pqssed to Lucene should contain fields that are as much text based as possible, comparable to what a user would type whi

Re: What is going on with subversion.

2005-04-20 Thread Volodymyr Bychkoviak
Sorry, I've already read about servers moving. Can somebody mail me latest MultiFieldQueryParser.java and highlighting source code. Because I can't get it from subversion and I need it urgently. Thanks in advance. Regards, Volodymyr Bychkoviak Volodymyr Bychkoviak wrote: I can't connect svn.ap

What is going on with subversion.

2005-04-20 Thread Volodymyr Bychkoviak
I can't connect svn.apache.org. It seems that apache.org is down. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Scoring, cosine measure

2005-04-20 Thread Barbara Krausz
Hi, currently I'm writing my Bachelorthesis about Lucene. I searched for theoretical information for example about the IR-model Lucene uses, but I couldn't find anything so I had to figure it out on my own. I think Lucene uses the vector space model with a variation of the cosine measure (cosine

Re: WildCard search replacement

2005-04-20 Thread Volodymyr Bychkoviak
Aalap Parikh wrote: Hi Volodymyr, About the trick you described about wildcard search replacement, you mentioned: So I found following workaround. I index this field as > sequence of terms, each of containing single digit from > needed value. (For example I have “123214213” value that n

Re: Best way to purposely corrupt an index?

2005-04-20 Thread Maik Schreiber
> It looks to me that if I do get an IOException, I will then have to perform a > number of additional checks to eliminate the other possible causes of > IOExceptions (such as permissions issues), and by a process of elimination, > determine a corrupt index. Slightly off-topic: That's exactly

Re: Lucene bulk indexing

2005-04-20 Thread Volodymyr Bychkoviak
Hi, The best way to determine bottlenecks is profiling. (JProfiler is very good tool for that. It's commercial product with free evaluation) I was indexing 1.5 million documents in 45 minutes. before optimizing it took much more time to index. optimization was done through 'select' query changin

Re: Best way to purposely corrupt an index?

2005-04-20 Thread Andy Roberts
On Wednesday 20 Apr 2005 08:27, Maik Schreiber wrote: > > As the index is rather critical to my program, I just wanted to make it > > really robust, and able to cope should a problem occur with the index > > itself. Otherwise, the user will be left with a non-functioning program > > with no explana

Re: Best way to purposely corrupt an index?

2005-04-20 Thread Maik Schreiber
> As the index is rather critical to my program, I just wanted to make it > really > robust, and able to cope should a problem occur with the index itself. > Otherwise, the user will be left with a non-functioning program with no > explanation. That's my reasoning anyway. You should perhaps go

Re: Best way to purposely corrupt an index?

2005-04-20 Thread Andy Roberts
On Tuesday 19 Apr 2005 22:37, Daniel Herlitz wrote: > I would suggest you simply do not create unusable indexes. :-) I agree! :) I am obviously very confident that my application is building indexes correctly. I'm thinking of the rarer instances whereby user or system error has caused a proble