Re: how many size of the index is the lucene's limit on per server ?

2009-03-02 Thread Danil ŢORIN
It depends what you call a server : - 4 dual Xeon, 64G RAM, 1TB of 15000 rpm raid10 hard-disks is one thing - 1 P4, 512M RAM, 40G 5400 rpm hard-disk, Win2K is completly something else It depends on index structure and the size of the documents you index/store . It depends on the way you query

Analyze other language using English Analyzer

2009-03-02 Thread Ganesh
Hello all, I am using default English Snowball analyzer to index and search English documents. There may be chances to index European, Chinese documents. What will be the impact to use English Analyzer for European or Chinese language documents? Whether i could do index and search as expected?

How to index Named Entities

2009-03-02 Thread Seid Mohammed
I want to index document conents in two ways, one just a simple content, and the other as named entity. the senario is like this. if i have this document "the source of Nile is Ethiopia" then I want to index "source" as a normal content, "Nile" as river name, and "Ethiopia" as Country name. so that

how many size of the index is the lucene's limit on per server ?

2009-03-02 Thread buddha1021
hi: how many size of the index is the lucene's limit on per server ? I mean that the speed of the search is very fast and doesn't be affected by the huge index ! which is the limit on per server,if the index is bigger than it ,the speed of the search will be low! any expert have a experience to te

Re: Indexing synonyms for multiple words

2009-03-02 Thread Sumukh
Thanks for your suggestion Michael and thanks to Uwe for clarifying. Payload is currently used to store only the start positions. What I gathered from your suggestion is that we could possibly store the end position, or span, or some other complex encoding in order to store the extra informati

Re: term position in phrase query using queryparser

2009-03-02 Thread Matt Ronge
On Feb 25, 2009, at 2:52 PM, Tim Williams wrote: Is there a syntax to set the term position in a query built with queryparser? For example, I would like something like: PhraseQuery q = new PhraseQuery(); q.add(t1, 0); q.add(t2, 0); q.setSlop(0); As I understand it, the slop defaults to 0, bu

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
On 3/2/09 4:23 PM, "Ken Williams" wrote: > On 3/2/09 1:58 PM, "Erik Hatcher" wrote: > >> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >>> In the output, I get explanations like "0.88922405 = (MATCH) product >>> of:" >>> with no details. Perhaps I need to do something different in >>> ind

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
On 3/2/09 1:58 PM, "Erik Hatcher" wrote: > > On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >> In the output, I get explanations like "0.88922405 = (MATCH) product >> of:" >> with no details. Perhaps I need to do something different in >> indexing? > > Explanation.toString() only returns t

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
On 3/2/09 4:19 PM, "Steven A Rowe" wrote: > On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote: >> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: >>> Also, while perusing the threads you refer to below, I saw a >>> reference to the following link, which seems to have gone dead: >>> >>> https://i

RE: Confidence scores at search time

2009-03-02 Thread Steven A Rowe
On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote: > On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: > > Also, while perusing the threads you refer to below, I saw a > > reference to the following link, which seems to have gone dead: > > > > https://issues.apache.org/bugzilla/show_bug.cgi?id=31841 >

Re: Faceted Search using Lucene

2009-03-02 Thread Michael McCandless
So then all is good. We were only pursuing this to explain it. Now that we know your directories are empty, that explains it. So you should call maybeReopen() inside get(), as long as it does not slow queries down. Mike Amin Mohammed-Coleman wrote: I think that is the case. When my

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
I think that is the case. When my SearchManager is initialised the directories are empty so when I do a get() nothing is present. Subsequent calls seem to work. Is there something I can do? or do I accept this or just do a maybeReopen and do a get(). As you mentioned it depends on timiing but I

Re: Confidence scores at search time

2009-03-02 Thread Grant Ingersoll
On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: Hi Grant, It's true, I may have an X-Y problem here. =) My basic need is to sacrifice recall to achieve greater precision. Rather than always presenting the user with the top N documents, I need to return *only* the documents that seem rele

Re: Extracting TFIDF vectors

2009-03-02 Thread Grant Ingersoll
Have a look at the MoreLikeThis contrib module in the contrib section of Lucene. You can start with that, and then do the additions and subtractions from there. On Mar 2, 2009, at 9:35 AM, Gregory Gay wrote: Hi, I'm a complete novice at Lucene, and I'm looking for a little bit of help

Re: Faceted Search using Lucene

2009-03-02 Thread Michael McCandless
Well the code looks fine. I can't explain why you see no search results if you don't call maybeReopen() in get, unless at the time you first create SearcherManager the Directories each have an empty index in them. Mike Amin Mohammed-Coleman wrote: Hi Here is the code that I am using, I'

Re: Marking commit points as deleted does not clean up on IW.close

2009-03-02 Thread Michael McCandless
You mean on calling IndexWriter.close, with a deletion policy that's functionally equivalent to KeepOnlyLastCommitDeletionPolicy, you somehow see that last 2 commits remaining in the Directory once IndexWriter is done closing? That's odd. Are you sure "onCommit()" is really calling dele

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
Hi Here is the code that I am using, I've modified the get() method to include the maybeReopen() call. Again I'm not sure if this is a good idea. public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm();

Marking commit points as deleted does not clean up on IW.close

2009-03-02 Thread Shalin Shekhar Mangar
Hello, In Solr, when a user calls commit, the IndexWriter is closed (causing a commit). It is opened again only when another document is added or, a delete is performed. In order to support replication, Solr trunk now uses a deletion policy. The default policy is (should be?) equivalent to KeepOnl

Re: Confidence scores at search time

2009-03-02 Thread Erik Hatcher
On Mar 2, 2009, at 2:47 PM, Ken Williams wrote: Finally, I seem unable to get Searcher.explain() to do much useful - my code looks like: Searcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(LuceneIndex.CONTENT, analyzer); Query query = pa

Re: Confidence scores at search time

2009-03-02 Thread Ken Williams
Hi Grant, It's true, I may have an X-Y problem here. =) My basic need is to sacrifice recall to achieve greater precision. Rather than always presenting the user with the top N documents, I need to return *only* the documents that seem relevant. For some searches this may be 3 documents, for so

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Chris Lu
There are two ways to handle this: 1) During indexing time, expand the group tree and store them to the documents, like "groups:1 2 3" 2) When indexing, storing only the exact group the document belongs to. Then during search time, expand group tree to search all the groups the user belongs to, inc

Re: Faceted Search using Lucene

2009-03-02 Thread Michael McCandless
It makes perfect sense to call maybeReopen() followed by get(), as long as maybeReopen() is never slow enough to be noticeable to an end user (because you are making random queries pay the reopen/warming cost). If you call maybeReopen() after get(), then that search will not see the new

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Ken Krugler
Hi Markus, I need to restrict the resultset to the appropriate rights of the user who is searching the index. A document may belong to several groups. A user must belong to all groups of the document to find it. There's one additional problem: The groups are a tree. A user is automaticaly in e

Re: Question on Proximity Search in Lucene Query

2009-03-02 Thread Erick Erickson
See page 88 in Lucene In Action for a fuller explanation, including ordering considerations. But basically, phrase query slop is the maximum number of "moves" be required to get all the words next to each other in the proper order. If you can get all the words next to each other within slop moves,

Question on Proximity Search in Lucene Query

2009-03-02 Thread Vasudevan Comandur
Hi All, I had posted the below mentioned query a week back and I have not received any response from the group so far. I was wondering if this is a trivial question to the group or it has been answered previously. I appreciate your answers or any pointers to the answers are also welcome.

Re: Restricting the result set with hierarchical ACL

2009-03-02 Thread Erick Erickson
If you have a reasonable way of getting the doc IDs that your user is allowed to see (and it appears you do), you probably want a Filter. At root a Filter is just a BitSet where you turn on the bit for each document that *could* be allowed in the results and pass that filter to the appropriate sear

Restricting the result set with hierarchical ACL

2009-03-02 Thread Markus Malkusch
Dear list I need to restrict the resultset to the appropriate rights of the user who is searching the index. A document may belong to several groups. A user must belong to all groups of the document to find it. There's one additional problem: The groups are a tree. A user is automaticaly in ever

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
Hi Just out of curiosity does it not make sense to call maybeReopen and then call get()? If I call get() then I have a new mulitsearcher, so a call to maybeopen won't reinitialise the multi searcher. Unless I pass the multi searcher into the maybereopen method. But somehow that doesn't m

Re: N-grams with numbers and Shinglefilters

2009-03-02 Thread Raymond Balmès
Yes, I don't need a ShingleFilter I understand it by now. Yes I will have many of these phrases in the documents... this is why I thought I shouldn't use Lucene fields. I will investigate further your keyword approach sounds like possible, thx for the tip. However I presume I may need to normaliz

Re: Indexing synonyms for multiple words

2009-03-02 Thread Michael McCandless
Since Lucene doesn't represent/store end position for a token, I don't think the index can properly represent SYN spanning two positions? I suppose you could encode this into payloads, and create a custom query that would look at the payload to enforce the constraint. Or, if you switch to

RE: Sort Collection of ScoreDocs

2009-03-02 Thread Chetan Shah
Perfect Thanks. Was also looking at org.apache.lucene.search.ScoreDocComparator Uwe Schindler wrote: > > How about java.util.Arrays.sort() on the array using a simple > Comparator with a compare() that returns -Float.compare(a.score, > b.score)? This is just about 7 lines of Java code. > >

Re: queryNorm affect on score

2009-03-02 Thread Peter Keegan
If I set the boost=0 at query time and the query contains only terms with boost=0, the scores are NaN (because weight.queryNorm = 1/0 = infinity), instead of 0. Peter On Sun, Mar 1, 2009 at 9:27 PM, Erick Erickson wrote: > FWIW, Hossman pointed out that the difference between index and > query

RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe
Hi Raymond, On 3/2/2009 at 10:09 AM, Raymond Balmès wrote: > suppose I have a tri-gram, what I want to do is index the tri-gram > "string digit1 digit2" as one indexing phrase, and not index each token > separately. As long as you don't want any transformation performed on the phrase or its comp

RE: Sort Collection of ScoreDocs

2009-03-02 Thread Uwe Schindler
How about java.util.Arrays.sort() on the array using a simple Comparator with a compare() that returns -Float.compare(a.score, b.score)? This is just about 7 lines of Java code. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original

Restricting the result set with hierarchical ACL

2009-03-02 Thread markus
Dear list I need to restrict the resultlist to the appropriate rights of the user who is searching the index. A document may belong to several groups. A user must belong to all groups of the document to find it. There's one additional problem: The groups are a tree. A user is automaticaly in eve

Sort Collection of ScoreDocs

2009-03-02 Thread Chetan Shah
Is there an existing Utility class which will sort a collection of ScoreDocs ? I have a result set (array of ScoreDocs) stored in JVM and want to sort them by relevanceScore. I do not want to execute the query again. The stored result set is sorted by another term and hence the need. Would highly

RE: Indexing synonyms for multiple words

2009-03-02 Thread Uwe Schindler
I think his problem is, that "SYN" is a synonym for the phrase "WORD1 WORD2". Using these positions, a phrase like "SYN WORD2" would also match (or other problems in queries that depend on order of words). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail:

Indexing synonyms for multiple words

2009-03-02 Thread Sumukh
> > Hi, > > I'm fairly new to Lucene. I'd like to know how we can index synonyms for > multiple words. > > This is the scenario: > > Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. > > Now assume the two words combined WORD1 WORD2 can be replaced by another > word SYN. > > If I place SYN afte

Re: N-grams with numbers and Shinglefilters

2009-03-02 Thread Raymond Balmès
Well, In the mean time I've looked at the details of the implementation and it gave me an idea for what I'm looking for : suppose I have a tri-gram, what I want to do is index the tri-gram "string digit1 digit2" as one indexing phrase, and not index each token separately. In the shingler filter,

Re: Indexing synonyms for multiple words

2009-03-02 Thread Michael McCandless
Shouldn't WORD2's position be 1 more than your SYN? Ie, don't you want these positions?: WORD1 2 WORD2 3 SYN 2 The position is the starting position of the token; Lucene doesn't store an ending position Mike Sumukh wrote: Hi, I'm fairly new to Lucene. I'd like to know how we

Re: Indexing synonyms for multiple words

2009-03-02 Thread Erick Erickson
This has been discussed in the user list, so searching there might get you answer quicker. See: http://wiki.apache.org/lucene-java/MailingListArchives I don't remember the results, but... Best Erick On Mon, Mar 2, 2009 at 9:13 AM, Sumukh wrote: > Hi, > > I'm fairly new to Lucene. I'd like to

Extracting TFIDF vectors

2009-03-02 Thread Gregory Gay
Hi, I'm a complete novice at Lucene, and I'm looking for a little bit of help with something. How can I extract the TF*IDF vector for each document in the indexed collection? Also for the query? I need to build a user-feedback system which manipulates the query based on the liked and disliked do

Indexing synonyms for multiple words

2009-03-02 Thread Sumukh
Hi, I'm fairly new to Lucene. I'd like to know how we can index synonyms for multiple words. This is the scenario: Consider a sentence: AAA BBB WORD1 WORD2 EEE FFF GGG. Now assume the two words combined WORD1 WORD2 can be replaced by another word SYN. If I place SYN after WORD1 with positionIn

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
In my test case I have a set up method that should populate the indexes before I start using the document searcher. I will start adding some more debug statements. So basically I should be able to do: get() followed by maybeReopen. I will let you know what the outcome is. Cheers Amin On Mon,

Re: Faceted Search using Lucene

2009-03-02 Thread Michael McCandless
Is it possible that when you first create the SearcherManager, there is no index in each Directory? If not... you better start adding diagnostics. EG inside your get(), print out the numDocs() of each IndexReader you get from the SearcherManager? Something is wrong and it's best to exp

RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe
Hi Raymond, On 3/1/2009, Raymond Balmès wrote: > I'm trying to index (& search later) documents that contain tri-grams > however they have the following form: > > <2 digit> <2 digit> > > Does the ShingleFilter work with numbers in the match ? Yes, though it is the tokenizer and previous filter

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
Nope. If i remove the maybeReopen the search doesn't work. It only works when i cal maybeReopen followed by get(). Cheers Amin On Mon, Mar 2, 2009 at 12:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > That's not right; something must be wrong. > > get() before maybeReopen() sh

Re: Faceted Search using Lucene

2009-03-02 Thread Michael McCandless
That's not right; something must be wrong. get() before maybeReopen() should simply let you search based on the searcher before reopening. If you just do get() and don't call maybeReopen() does it work? Mike Amin Mohammed-Coleman wrote: I noticed that if i do the get() before the maybeRe

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
I noticed that if i do the get() before the maybeReopen then I get no results. But otherwise I can change it further. On Mon, Mar 2, 2009 at 11:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > > There is no such thing as final code -- code is alive and is always > changing ;) > >

Re: Merging database index with fulltext index

2009-03-02 Thread Marcelo Ochoa
Hi: The point to catch with bad performance during merging a database result is to reduce the number of rows visited by your first query. As an example take a look a these two queries using Lucene Domain Index, the two are equivalents: Option A: select * from (select rownum as ntop_pos,q.* fro

Re: Faceted Search using Lucene

2009-03-02 Thread Michael McCandless
There is no such thing as final code -- code is alive and is always changing ;) It looks good to me. Though one trivial thing is: I would move the code in the try clause up to and including the multiSearcher=get() out above the try. I always attempt to "shrink wrap" what's inside a try

Re: Adding another factor to Lucene search

2009-03-02 Thread Ian Lea
Hi Document.setBoost(float boost) where boost is either your score as is, or a value based on that score, might do the trick for you. Other boosting and custom score options include BoostingQuery, BoostingTermQuery and CustomScoreQuery. A google search for "lucene boosting" throws up lots of h

Re: search by word offset

2009-03-02 Thread Shashi Kant
Not sure what you are asking about, but you might want to take a look at http://lucene.apache.org/java/2_4_0/api/contrib-surround/index.html The Surround parser offers many features around the span query (which I suspect is what you are looking for) Shashi On Mon, Mar 2, 2009 at 4:57 AM, shb w

Adding another factor to Lucene search

2009-03-02 Thread liat oren
Hi, I would like to add to lucene's score another factor - a score between words. I have an index that holds couple of words with their score. How can I take it into account when using Lucene search? Many thanks, Liat

search by word offset

2009-03-02 Thread shb
hi i need help. i need to search by word in sentences with lucene. for example by the word "bbb" i got the right results of all the sentences : "text ok ok ok bbb" , "text 2 bbb text " , "bbb text 4...". but i need the result by the word offset in the sentence like this: "bbb text 4...".

Re: Faceted Search using Lucene

2009-03-02 Thread Amin Mohammed-Coleman
Hi there Good morning! Here is the final search code: public Summary[] search(final SearchRequest searchRequest) throwsSearchExecutionException { final String searchTerm = searchRequest.getSearchTerm(); if (StringUtils.isBlank(searchTerm)) { throw new SearchExecutionException("Search string ca