Re: Strange tokenization with StandardFilter

2005-11-23 Thread yahootintin . 11533894
Yes, this is a repeat... I mailed this a few days before and it never made it to the list so I reposted. Now it suddenly appears... weird! --- java-user@lucene.apache.org wrote: > On 21 Nov 2005, at 18:54, [EMAIL PROTECTED] wrote: > > > I'm using a StandardFilter and seeing some strange tokeni

Strange tokenization with StandardFilter

2005-11-23 Thread yahootintin . 11533894
I'm using a StandardFilter and seeing some strange tokenization. Here's the input: apache.org hosts lucene at apache.org. Here's the tokens it outputs: apache.org hosts lucene at apacheorg Is this a bug that apache.org and apache.org. don't convert to the same token? -

Re: Inconsistent StandardTokenizer behaviour

2005-11-22 Thread yahootintin . 11533894
Cool, I'll take a look at fixing this. --- java-user@lucene.apache.org wrote: > On 21 Nov 2005, at 19:39, [EMAIL PROTECTED] wrote: > > This is the results for the StandardTokenizer: > >input - output token - > > output type > > 1. 1.2 - 1.2 - > > 2. 1.2. - 1.2 - > > >

Re: Inconsistent StandardTokenizer behaviour

2005-11-21 Thread yahootintin . 11533894
Sorry for the bad looking table. Retrying... input string - output token (output type) 1. 1.2 - 1.2 () 2. 1.2. - 1.2 () 3. a.b - a.b () 4. a.b. - a.b. () 5. www.apache.org - www.apache.org () 6. www.apache.org. - www.apache.org. () --- java-user@lucene.apache.org wrote: This is the results for

Inconsistent StandardTokenizer behaviour

2005-11-21 Thread yahootintin . 11533894
This is the results for the StandardTokenizer: input - output token - output type 1. 1.2 - 1.2 - 2. 1.2. - 1.2 - 3. a.b - a.b - 4. a.b. - a.b. - 5. www.apache.org - www.apache.org - 6. www.apache.org. - www.apache.org. - Number 6 should still be

Anyone using Chandler Lucene / Berkeley DB?

2005-08-04 Thread yahootintin . 11533894
how well does it work? does it provide the ability to search shortly after adding a document? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Search shortly after adding a doc

2005-08-04 Thread yahootintin . 11533894
i want to use lucene to search shortly (within a second) after adding a document. closing a writer to ensure the new document is written and then opening an index reader seems to be too slow on large indexes. how do other people handle this? (i know this can be solved with a database but i'd

Re: retrieving raw scores

2005-07-19 Thread yahootintin . 11533894
Thanks. I'll try that... --- java-user@lucene.apache.org wrote: Use HitCollector's collect method: > > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html#collect(int,%20float) > > Otis > > > --- [EMAIL PROTECTED] wrote: > > > hi, > > > > i need to retrieve th

retrieving raw scores

2005-07-19 Thread yahootintin . 11533894
hi, i need to retrieve the raw scores (3.6, 2.8, etc) for a hit and not the normalized score (1.0, 0.8, etc). commenting out the normalizing code in Hits.java does what i want. is there a better way to do this? i'm wondering about adding a method to Similarity.java that looks like this: boole

Loading large index into RAM

2005-07-07 Thread yahootintin . 11533894
Is it possible to use a RAMDirectory to load a 5 GB index into RAM on Linux? I have access to a server with 6 GB of RAM and will try it next week but I've heard that Java on Linux may only support up to 2 GB of RAM per process. Anyone already tried this? Thanks.

Lucene faster on JDK 1.5?

2005-07-07 Thread yahootintin . 11533894
Are people seeing a significant speed performance with Lucene when they upgrade to JDK 1.5? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Strategy for making short documents not bubble to the top?

2005-06-29 Thread yahootintin . 11533894
Hi Jian, Thanks for the reply. The problem with that is it completely ignores document length. A book that mentions "frog" 5 times in its 2,000 pages should be less relevant than a book that mentions "frog" 4 times in its 4 pages. I really want to lower the document length weight instead of rem

Strategy for making short documents not bubble to the top?

2005-06-29 Thread yahootintin . 11533894
Hi, Short documents bubble to the top of the results because the field length is short. Does anyone have a good strategy for working around this? Will doing something like log(document length) flatten out my results while still making them meaningful? I'm going to try some different approaches

Span query performance issue

2005-06-24 Thread yahootintin . 11533894
Hi, I'm comparing SpanNearQuery to PhraseQuery results and noticing about an 8x difference on Linux. Is a SpanNearQuery doing 8x as much work? I'm considering diving into the code if the results sounds unusual to people. But if its really doing that much more work, I won't spend time optimiz

Re: Calculating idf across multiple indexes

2005-06-06 Thread yahootintin . 11533894
Hmmm... I'll look into that. I thought the MultiSearcher would still need access to each index. Does the RemoteSearchable avoid that? Will it allow me to delegate searching to multiple boxes and then collate the results correctly? Thanks for the tip about the RemoteSearchable. --- java-us

Re: Calculating idf across multiple indexes

2005-06-06 Thread yahootintin . 11533894
Hi Daniel, The problem is that if I tell Lucene about only one of the indexes it has no way of knowing what the total document frequency is across the other index servers. Does that make sense? I think my collator will need to calculate the idf somehow. Thanks. --- java-user@lucene.apa

Calculating idf across multiple indexes

2005-06-06 Thread yahootintin . 11533894
Hi, Due to the size of my index, I need to break it into several different segments. I have a service that gets a query from the user and contacts each index searcher service asynchronously and waits for the results. The results are then collated and returned to the user. The problem is tha