> first, I rewrite the Similarity(include lengthNorm), but it not
works..., so I modify the lucene source, by set the norm_table =
1.0(all). it can work
If you overrides lengthNorm(), reindexing is needed to take effect.
Koji
>
>: yes, it works... but the NORM_TABLE(normalization) in Similarity cannot
>: be eliminated...
>
>the norm table is used to encode the values generated based on your
>lengthNorm function and document/field boosts -- so make lengthNorm return
>"1", and don't uses index time boosts (or better ye
>
>: yes, it works... but the NORM_TABLE(normalization) in Similarity cannot
>: be eliminated...
>
>the norm table is used to encode the values generated based on your
>lengthNorm function and document/field boosts -- so make lengthNorm return
>"1", and don't uses index time boosts (or better y
: yes, it works... but the NORM_TABLE(normalization) in Similarity cannot
: be eliminated...
the norm table is used to encode the values generated based on your
lengthNorm function and document/field boosts -- so make lengthNorm return
"1", and don't uses index time boosts (or better yet: use
mark harwood wrote:
I've been building a large index (hundreds of millions) with mainly
structured data which consists of several fields with mostly unique
values.
I've been hitting out of memory issues when doing periodic commits/
closes which I suspect is down to the sheer number of ter
You're welcome, and let us know how it goes!
Mike
Kieran Topping wrote:
Mike, many thanks for this most comprehensive reply.
Actually, I believe that NOTE only applies to the two addIndexes
methods that take Directory. So I think this approach will work
fine in general. Have you hit any
On Mon, Mar 9, 2009 at 2:02 PM, Michael McCandless
wrote:
> Once added, something inside the index (a "write once" schema) records
> that this field is an IntField and then it's an error to ever use a
> different type field by that same name.
I dunno... coupling functionality to restrictions seem
markharw00d wrote:
>>(a "write once" schema)
I like this idea. Enforcing consistent field-typing on instances of
fields with the same name does not seem like an unreasonable
restriction - especially given the upsides to this.
And also when it's "opt-in", ie, you can continue to use untyp
I have an index that has for every two words a score.
I would like my analyzer - that is a combination of whitespace tokenizer, a
stop words analyzer and stemming.
The regular score of Lucene takes into account the position of the words.
I would like to add another factor to that score which is t
>>(a "write once" schema)
I like this idea. Enforcing consistent field-typing on instances of
fields with the same name does not seem like an unreasonable restriction
- especially given the upsides to this.
It doesn't dispense with all the full schema logic in Solr but seems
like a useful ba
mark harwood wrote:
Time for some standardised index metadata?
OK, thinking out loud...
What if we created IntField, subclassing Field. It holds a single
int, and you can add it to Document just like any other field.
Once added, something inside the index (a "write once" schema) records
th
Release 2.4.1 of Lucene is now available.
This release fixes bugs from 2.4.0, including one data loss bug where
in certain situations binary fields would be truncated to 0 bytes.
2.4.1 has no new features, nor API changes or changes to file formats,
so it's fully compatible with 2.4.0.
See chan
mark harwood wrote:
This trie/parser issue is an example of a broader issue for me.
Yeah I agree.
There was also a new Document impl attached in Jira somewhere to more
strongly type fields (can't find it now), ie IntField, DateField, etc.
And it also ties into refactoring AbstractField/Field
tsuraan wrote:
Sounds interesting. Can you tell us a bit more about the use case
for it?
Is it basically you are in a situation where you can't unzip the
index?
Indices compress pretty nicely: 30% to 50% in my experience. So, if
youre
indices are read-only anyhow (mine aren't live; we
> Also, have you looked at how it performs?
Just making a directory of 1,000,000 documents and reading from it, it
looks like this implementation is probably unbearably slow, unless
Lucene has some really good caching. ZipFile gives InputStreams for
the zip contents, and InputStreams don't suppor
PerFieldAnalyzerWrapper is your friend, assuming that you have separate
fields, some tokenized and some not. If you *don't* have separate
fields, then we need more details of what you hope to accomplish...
something like
(+tokenized:value1 +tokenized:vaue2) (+untokenized:value3 +
untokenized:valu
Hi,
I've been trying to find a way which allows executing a query that contains
both Tokenized and Untokenized fields on Lucene's index, without having to
parse the query. I've been able to execute a query which only uses Tokenized
fields as follows:
QueryParser queryParser = new QueryParser(
> Sounds interesting. Can you tell us a bit more about the use case for it?
Is it basically you are in a situation where you can't unzip the index?
Indices compress pretty nicely: 30% to 50% in my experience. So, if youre
indices are read-only anyhow (mine aren't live; we do batch jobs to modif
Hmmm, I have some inklings of an idea, but can we take a step back?
Can you explain the problem you are trying to solve at a higher level
(instead of the current solution)? I imagine it is something related
to co-occurrence analysis.
On Mar 8, 2009, at 8:05 AM, liat oren wrote:
Hi Gran
On Mon, Mar 9, 2009 at 8:10 AM, Michael McCandless
wrote:
> Could we add APIs to QueryParser so the application can state the
> disposition
> toward certain fields?
overriding QueryParser.getRangeQuery() seems the most powerful and
flexible (and it's already there).
-Yonik
http://www.lucidimagin
The reason I asked about Span scoring is that the behavior changed when I
switched from TermQuery to BoostingTermQuery to take advantage of payloads.
It seems to me that a SpanTermQuery and BoostingTermQuery should behave the
same as TermQuery with respect to term frequency. The 'edit distance' is
>>Maybe we could do something similar to declare that agiven field uses Trie*,
>>and with what datatype.
With the current implementation you can at least test for the presence of a
field called:
[fieldName]#trie
..which tells you some form of trie is used but could be extended to include
Hi,
It is really nice idea to have something like if I am doing a query like amount
>15 something depending upon the field and do query parsing. Basically we need
have pluggable query parser which can convert different queries like amount >15
to lucene specified query.
That is what I think. I
Uwe Schindler wrote:
Or perhaps we should move Trie* into core Lucene, and then build a
real (ootb) integration with QueryParser.
The problem is that the query parser does not know if a field is
encoded as
trie or is just a normal text token. Furthermore, the new trie API
does not
differe
The BoW approach is simple and highly effective IMO. If you want to get a
bit fancy, you could also use a MultiField query in the combined index.
Another brute-force approach would be to hit all 3 indexes and see which
ones come back with the highest score(s).
On Mon, Mar 9, 2009 at 8:43 AM, Er
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, March 09, 2009 12:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 2.9
>
>
> Uwe Schindler wrote:
>
> >>> Is there any plans to have simpler queries for Numbers and Data?
> >>
Thanks!
2009/3/9 Andrzej Bialecki
> liat oren wrote:
>
>> Yes, I changed it to TOKENIZED and its working now, Thanks!
>>
>> About Luke, what do you mean by saying that the analyzer is in the
>> classpath?
>> It exists in a package in my computer - it also has its filter and other
>> classes. How
Hi,
I was not aware of TrieRangeQuery but whether syntax for numberical query will
change. For example I want to search
amount >= 15 rather than doing it amount:[ 15] or something?
Is there any open source queryparser which converts something like amount >=15
into lucene number format query.
Sure, Lucene is suited. If
The central problem here isn't the search engine, IMO, it's
figuring out what bits of the query are relevant to what
parts of the data. That is, in some random string, what is
the street, business name, address, etc.
Lucene has nothing built in that I know of that'l
Uwe Schindler wrote:
Is there any plans to have simpler queries for Numbers and Data?
With the recent addition of TrieRangeQuery (in 2.9), I think Lucene's
range querying is actually very strong, though you'd have to subclass
QueryParser and override getRangeQuery to have it create
TrieRang
Mike, many thanks for this most comprehensive reply.
Actually, I believe that NOTE only applies to the two addIndexes
methods that take Directory. So I think this approach will work fine
in general. Have you hit any problems in testing it? I'll update the
javadocs.
I have not attempted this
> > Is there any plans to have simpler queries for Numbers and Data?
>
> With the recent addition of TrieRangeQuery (in 2.9), I think Lucene's
> range querying is actually very strong, though you'd have to subclass
> QueryParser and override getRangeQuery to have it create TrieRangeQuery.
The add
I've been building a large index (hundreds of millions) with mainly structured
data which consists of several fields with mostly unique values.
I've been hitting out of memory issues when doing periodic commits/closes which
I suspect is down to the sheer number of terms.
I set the IndexWriter..
Allahbaksh Mohammedali Asadullah wrote:
When is Lucene 2.9 due? I am eagerly waiting for the new lucene to
come.
There have been some discussions on java-dev, but there's no clear
consensus/date yet. We do have quite a few Jira issues marked as 2.9
at this point, which we need to make p
Excellent, that's the way to go. I hadn't realized that method was
public.
Mike
Daniel Noll wrote:
Michael McCandless wrote:
You could look at the docID of each hit, and compare to
the .maxDoc() of each underlying reader.
There is also MultiSearcher#subSearcher(int) which also works a
Hi,
When is Lucene 2.9 due? I am eagerly waiting for the new lucene to come.
As I compared Lucene with Minion I think Minion offers very rich capabilities
like easier range query etc.
Is there any plans to have simpler queries for Numbers and Data?
No doubt lucene is the best Open Source Search
liat oren wrote:
Yes, I changed it to TOKENIZED and its working now, Thanks!
About Luke, what do you mean by saying that the analyzer is in the
classpath?
It exists in a package in my computer - it also has its filter and other
classes. How can it be used in Luke?
You need to add this jar to t
Thanks for all the inputs guys.
As Erick said let me elaborate the problem a bit.
We are trying to develop a local search application. The user will be able
to locate businesses, localities and roads. We have data for all the 3 with
us. We do not want to provide separate boxes for the user to ent
Hi
I am seeing some strange behaviour with the highlighter and I'm wondering if
anyone else is experiencing this. In certain instances I don't get a
summary being generated. I perform the search and the search returns the
correct document. I can see that the lucene document contains the text in
39 matches
Mail list logo