Re: Sort by count?

2009-03-09 Thread Koji Sekiguchi
> first, I rewrite the Similarity(include lengthNorm), but it not works..., so I modify the lucene source, by set the norm_table = 1.0(all). it can work If you overrides lengthNorm(), reindexing is needed to take effect. Koji

Re: Re: Re: Sort by count?

2009-03-09 Thread hyj
> >: yes, it works... but the NORM_TABLE(normalization) in Similarity cannot >: be eliminated... > >the norm table is used to encode the values generated based on your >lengthNorm function and document/field boosts -- so make lengthNorm return >"1", and don't uses index time boosts (or better ye

Re: Re: Re: Sort by count?

2009-03-09 Thread hyj
> >: yes, it works... but the NORM_TABLE(normalization) in Similarity cannot >: be eliminated... > >the norm table is used to encode the values generated based on your >lengthNorm function and document/field boosts -- so make lengthNorm return >"1", and don't uses index time boosts (or better y

Re: Re: Sort by count?

2009-03-09 Thread Chris Hostetter
: yes, it works... but the NORM_TABLE(normalization) in Similarity cannot : be eliminated... the norm table is used to encode the values generated based on your lengthNorm function and document/field boosts -- so make lengthNorm return "1", and don't uses index time boosts (or better yet: use

Re: A model for predicting indexing memory costs?

2009-03-09 Thread Michael McCandless
mark harwood wrote: I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values. I've been hitting out of memory issues when doing periodic commits/ closes which I suspect is down to the sheer number of ter

Re: Instantiating a RAMDirectory from a mutating directory

2009-03-09 Thread Michael McCandless
You're welcome, and let us know how it goes! Mike Kieran Topping wrote: Mike, many thanks for this most comprehensive reply. Actually, I believe that NOTE only applies to the two addIndexes methods that take Directory. So I think this approach will work fine in general. Have you hit any

Re: Lucene 2.9

2009-03-09 Thread Yonik Seeley
On Mon, Mar 9, 2009 at 2:02 PM, Michael McCandless wrote: > Once added, something inside the index (a "write once" schema) records > that this field is an IntField and then it's an error to ever use a > different type field by that same name. I dunno... coupling functionality to restrictions seem

Re: Lucene 2.9

2009-03-09 Thread Michael McCandless
markharw00d wrote: >>(a "write once" schema) I like this idea. Enforcing consistent field-typing on instances of fields with the same name does not seem like an unreasonable restriction - especially given the upsides to this. And also when it's "opt-in", ie, you can continue to use untyp

Re: Scores between words. Boosting?

2009-03-09 Thread liat oren
I have an index that has for every two words a score. I would like my analyzer - that is a combination of whitespace tokenizer, a stop words analyzer and stemming. The regular score of Lucene takes into account the position of the words. I would like to add another factor to that score which is t

Re: Lucene 2.9

2009-03-09 Thread markharw00d
>>(a "write once" schema) I like this idea. Enforcing consistent field-typing on instances of fields with the same name does not seem like an unreasonable restriction - especially given the upsides to this. It doesn't dispense with all the full schema logic in Solr but seems like a useful ba

Re: Lucene 2.9

2009-03-09 Thread Michael McCandless
mark harwood wrote: Time for some standardised index metadata? OK, thinking out loud... What if we created IntField, subclassing Field. It holds a single int, and you can add it to Document just like any other field. Once added, something inside the index (a "write once" schema) records th

[ANNOUNCE] Lucene Java 2.4.1 released

2009-03-09 Thread Michael McCandless
Release 2.4.1 of Lucene is now available. This release fixes bugs from 2.4.0, including one data loss bug where in certain situations binary fields would be truncated to 0 bytes. 2.4.1 has no new features, nor API changes or changes to file formats, so it's fully compatible with 2.4.0. See chan

Re: Lucene 2.9

2009-03-09 Thread Michael McCandless
mark harwood wrote: This trie/parser issue is an example of a broader issue for me. Yeah I agree. There was also a new Document impl attached in Jira somewhere to more strongly type fields (can't find it now), ie IntField, DateField, etc. And it also ties into refactoring AbstractField/Field

Re: ZipFile directory implementation

2009-03-09 Thread Michael McCandless
tsuraan wrote: Sounds interesting. Can you tell us a bit more about the use case for it? Is it basically you are in a situation where you can't unzip the index? Indices compress pretty nicely: 30% to 50% in my experience. So, if youre indices are read-only anyhow (mine aren't live; we

Re: ZipFile directory implementation

2009-03-09 Thread tsuraan
> Also, have you looked at how it performs? Just making a directory of 1,000,000 documents and reading from it, it looks like this implementation is probably unbearably slow, unless Lucene has some really good caching. ZipFile gives InputStreams for the zip contents, and InputStreams don't suppor

Re: How to search both Tokenized and Untokenized fields

2009-03-09 Thread Erick Erickson
PerFieldAnalyzerWrapper is your friend, assuming that you have separate fields, some tokenized and some not. If you *don't* have separate fields, then we need more details of what you hope to accomplish... something like (+tokenized:value1 +tokenized:vaue2) (+untokenized:value3 + untokenized:valu

How to search both Tokenized and Untokenized fields

2009-03-09 Thread rokham
Hi, I've been trying to find a way which allows executing a query that contains both Tokenized and Untokenized fields on Lucene's index, without having to parse the query. I've been able to execute a query which only uses Tokenized fields as follows: QueryParser queryParser = new QueryParser(

Re: ZipFile directory implementation

2009-03-09 Thread tsuraan
> Sounds interesting. Can you tell us a bit more about the use case for it? Is it basically you are in a situation where you can't unzip the index? Indices compress pretty nicely: 30% to 50% in my experience. So, if youre indices are read-only anyhow (mine aren't live; we do batch jobs to modif

Re: Scores between words. Boosting?

2009-03-09 Thread Grant Ingersoll
Hmmm, I have some inklings of an idea, but can we take a step back? Can you explain the problem you are trying to solve at a higher level (instead of the current solution)? I imagine it is something related to co-occurrence analysis. On Mar 8, 2009, at 8:05 AM, liat oren wrote: Hi Gran

Re: Lucene 2.9

2009-03-09 Thread Yonik Seeley
On Mon, Mar 9, 2009 at 8:10 AM, Michael McCandless wrote: > Could we add APIs to QueryParser so the application can state the > disposition > toward certain fields? overriding QueryParser.getRangeQuery() seems the most powerful and flexible (and it's already there). -Yonik http://www.lucidimagin

Re: sloppyFreq question

2009-03-09 Thread Peter Keegan
The reason I asked about Span scoring is that the behavior changed when I switched from TermQuery to BoostingTermQuery to take advantage of payloads. It seems to me that a SpanTermQuery and BoostingTermQuery should behave the same as TermQuery with respect to term frequency. The 'edit distance' is

Re: Lucene 2.9

2009-03-09 Thread mark harwood
>>Maybe we could do something similar to declare that agiven field uses Trie*, >>and with what datatype. With the current implementation you can at least test for the presence of a field called: [fieldName]#trie ..which tells you some form of trie is used but could be extended to include

RE: Lucene 2.9

2009-03-09 Thread Allahbaksh Mohammedali Asadullah
Hi, It is really nice idea to have something like if I am doing a query like amount >15 something depending upon the field and do query parsing. Basically we need have pluggable query parser which can convert different queries like amount >15 to lucene specified query. That is what I think. I

Re: Lucene 2.9

2009-03-09 Thread Michael McCandless
Uwe Schindler wrote: Or perhaps we should move Trie* into core Lucene, and then build a real (ootb) integration with QueryParser. The problem is that the query parser does not know if a field is encoded as trie or is just a normal text token. Furthermore, the new trie API does not differe

Re: Using Lucene for user query parsing

2009-03-09 Thread Shashi Kant
The BoW approach is simple and highly effective IMO. If you want to get a bit fancy, you could also use a MultiField query in the combined index. Another brute-force approach would be to hit all 3 indexes and see which ones come back with the highest score(s). On Mon, Mar 9, 2009 at 8:43 AM, Er

RE: Lucene 2.9

2009-03-09 Thread Uwe Schindler
> -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Monday, March 09, 2009 12:51 PM > To: java-user@lucene.apache.org > Subject: Re: Lucene 2.9 > > > Uwe Schindler wrote: > > >>> Is there any plans to have simpler queries for Numbers and Data? > >>

Re: IndexSearcher

2009-03-09 Thread liat oren
Thanks! 2009/3/9 Andrzej Bialecki > liat oren wrote: > >> Yes, I changed it to TOKENIZED and its working now, Thanks! >> >> About Luke, what do you mean by saying that the analyzer is in the >> classpath? >> It exists in a package in my computer - it also has its filter and other >> classes. How

RE: Lucene 2.9

2009-03-09 Thread Allahbaksh Mohammedali Asadullah
Hi, I was not aware of TrieRangeQuery but whether syntax for numberical query will change. For example I want to search amount >= 15 rather than doing it amount:[ 15] or something? Is there any open source queryparser which converts something like amount >=15 into lucene number format query.

Re: Using Lucene for user query parsing

2009-03-09 Thread Erick Erickson
Sure, Lucene is suited. If The central problem here isn't the search engine, IMO, it's figuring out what bits of the query are relevant to what parts of the data. That is, in some random string, what is the street, business name, address, etc. Lucene has nothing built in that I know of that'l

Re: Lucene 2.9

2009-03-09 Thread Michael McCandless
Uwe Schindler wrote: Is there any plans to have simpler queries for Numbers and Data? With the recent addition of TrieRangeQuery (in 2.9), I think Lucene's range querying is actually very strong, though you'd have to subclass QueryParser and override getRangeQuery to have it create TrieRang

Re: Instantiating a RAMDirectory from a mutating directory

2009-03-09 Thread Kieran Topping
Mike, many thanks for this most comprehensive reply. Actually, I believe that NOTE only applies to the two addIndexes methods that take Directory. So I think this approach will work fine in general. Have you hit any problems in testing it? I'll update the javadocs. I have not attempted this

RE: Lucene 2.9

2009-03-09 Thread Uwe Schindler
> > Is there any plans to have simpler queries for Numbers and Data? > > With the recent addition of TrieRangeQuery (in 2.9), I think Lucene's > range querying is actually very strong, though you'd have to subclass > QueryParser and override getRangeQuery to have it create TrieRangeQuery. The add

A model for predicting indexing memory costs?

2009-03-09 Thread mark harwood
I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values. I've been hitting out of memory issues when doing periodic commits/closes which I suspect is down to the sheer number of terms. I set the IndexWriter..

Re: Lucene 2.9

2009-03-09 Thread Michael McCandless
Allahbaksh Mohammedali Asadullah wrote: When is Lucene 2.9 due? I am eagerly waiting for the new lucene to come. There have been some discussions on java-dev, but there's no clear consensus/date yet. We do have quite a few Jira issues marked as 2.9 at this point, which we need to make p

Re: Lucene: MultiSearcher

2009-03-09 Thread Michael McCandless
Excellent, that's the way to go. I hadn't realized that method was public. Mike Daniel Noll wrote: Michael McCandless wrote: You could look at the docID of each hit, and compare to the .maxDoc() of each underlying reader. There is also MultiSearcher#subSearcher(int) which also works a

Lucene 2.9

2009-03-09 Thread Allahbaksh Mohammedali Asadullah
Hi, When is Lucene 2.9 due? I am eagerly waiting for the new lucene to come. As I compared Lucene with Minion I think Minion offers very rich capabilities like easier range query etc. Is there any plans to have simpler queries for Numbers and Data? No doubt lucene is the best Open Source Search

Re: IndexSearcher

2009-03-09 Thread Andrzej Bialecki
liat oren wrote: Yes, I changed it to TOKENIZED and its working now, Thanks! About Luke, what do you mean by saying that the analyzer is in the classpath? It exists in a package in my computer - it also has its filter and other classes. How can it be used in Luke? You need to add this jar to t

Re: Using Lucene for user query parsing

2009-03-09 Thread Srinivas Bharghav
Thanks for all the inputs guys. As Erick said let me elaborate the problem a bit. We are trying to develop a local search application. The user will be able to locate businesses, localities and roads. We have data for all the 3 with us. We do not want to provide separate boxes for the user to ent

Re: Lucene Highlighting and Dynamic Summaries

2009-03-09 Thread Amin Mohammed-Coleman
Hi I am seeing some strange behaviour with the highlighter and I'm wondering if anyone else is experiencing this. In certain instances I don't get a summary being generated. I perform the search and the search returns the correct document. I can see that the lucene document contains the text in