I have a requirement where reads and writes are quite high ( @ 100-500
per-sec ). A document has the following fields : timestamp,
unique-docid, content-text, keyword. Average content-text length is ~
20 bytes, there is only 1 keyword for a given docid.
At runtime, given a query-term ( which coul
On Tue, Jan 3, 2012 at 7:04 PM, Ryan McKinley wrote:
> On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir wrote:
>> On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley wrote:
>>>
>>> Just brainstorming, it seems like an FST could be a good/efficient way
>>> to match documents. My plan would be to:
>>>
>>> 1
On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir wrote:
> On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley wrote:
>>
>> Just brainstorming, it seems like an FST could be a good/efficient way
>> to match documents. My plan would be to:
>>
>> 1. Use an Analyzer to create a TokenStream for each place name.
Thanks Simon for you answer!
> as far as I can see you are comparing apples and pears.
When excluding the waiting time I also get the slight but reproducable
difference**. The times for waitForGeneration are nearly the same
(~2sec). Also when I commit instead waitForGeneration it is no
difference
On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley wrote:
>
> Just brainstorming, it seems like an FST could be a good/efficient way
> to match documents. My plan would be to:
>
> 1. Use an Analyzer to create a TokenStream for each place name. From
> the TokenStream create an FST -- this would have t
Happy new year!
I'm working on a way to simple geocode documents as they are indexed.
I'm hoping to use existing Lucene infrastructure to do this as much as
possible. My plan is to build an index of known place names then look
for matches in incoming text. When there is a match, some extra
field
Hi Uwe,
Uwe Schindler schrieb:
Hi Mike,
if you want to mix and/or in one query, always use parenthesis. The
operator precedence is strange with the default query parser. In
contrib there is another one (called PrecedenceQueryParser) that can
handle this but is incompatible with existing queries
Chris Hostetter schrieb:
: if you want to mix and/or in one query, always use parenthesis. The
or better yet, train yourself not to use AND, OR and NOT...
http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/
Thanks for the blog entry. I will read through that!
---
Charlie,
the code you provided will double the size of the index on FS every time when
saving occurs.
Can we avoid duplicating the index but synchronizing the changed records?
-- Original --
From: "dyzc2010 "<1393975...@qq.com>;
Date: Mon, Jan 2, 2012 01:30
Anna Hunecke schrieb:
Hi Mike,
I think the problem is grouping. If you have a query A AND B OR C, it will be
grouped as A AND (B OR C) and not as you expected as (A AND B) OR C.
Just put parentheses in your query and you get the result that you want.
Hi Anna,
This is it but the documentation
hey Peter,
as far as I can see you are comparing apples and pears. Your
comparison is waiting for merges to finish and if you are using
multiple threads lucene 4.0 will flush more segments to disk than 3.5
so what you are seeing is likely a merge that is still trying to merge
small segments. can y
Hi Hoss,
Hey, nice blog article - it makes my previous mail obsolete :-) Explained
perfectly!
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sen
: if you want to mix and/or in one query, always use parenthesis. The
or better yet, train yourself not to use AND, OR and NOT...
http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/
-Hoss
-
To unsubscribe
Hi folks,
I have a query result problem I do not understand. The documentation for Lucene
3.2 query syntax says the following about boolean OR queries: "The OR operator
links two terms and finds a matching document if either of the terms exist in a
document. This is equivalent to a union using
Hi folks,
I have a query result problem I do not understand. The documentation for Lucene
3.2 query syntax says the following about boolean OR queries: "The OR operator
links two terms and finds a matching document if either of the terms exist in a
document. This is equivalent to a union using
Hi,
I recently switched an experimental project from Lucene 3.5 to 4.0 from
6th Dec 2011
and my indexing time increased by nearly 20% on my local machine*.
It seems to me that two simple StringField's could cause this slow down:
Field uIdField = new Field("_uid", "" + id, StringField.TYPE_STORED);
Hi Mike,
if you want to mix and/or in one query, always use parenthesis. The operator
precedence is strange with the default query parser. In contrib there is
another one (called PrecedenceQueryParser) that can handle this but is
incompatible with existing queries. The parser in contrib on the
Hi Mike,
I think the problem is grouping. If you have a query A AND B OR C, it will be
grouped as A AND (B OR C) and not as you expected as (A AND B) OR C.
Just put parentheses in your query and you get the result that you want.
Best,
Anna
-Ursprüngliche Nachricht-
Von: 1983-01...@gmx.n
I have a couple comments on your code after briefly looking through it.
* If you want to work with miles, initialize SimpleSpatialContext with
DistanceUnits.MILES. I see you chose KM but your leading statistics are in
miles. Perhaps that is the reason for the discrepency in your numbers.
* You a
Hi folks,
I have a query result problem I do not understand. The documentation for Lucene
3.2 query syntax says the following about boolean OR queries: "The OR operator
links two terms and finds a matching document if either of the terms exist in a
document. This is equivalent to a union using
On Tue, Jan 3, 2012 at 10:10 AM, Paul Libbrecht wrote:
> I think the idf is also about terms and not about tokens.
> Maybe an expert can confirm my belief or we have to invent a test.
>
idf is docFreq and maxDoc.
docFreq is per-field, maxDoc is not. This might not even matter though.
if you are
I think the idf is also about terms and not about tokens.
Maybe an expert can confirm my belief or we have to invent a test.
paul
Le 3 janv. 2012 à 15:43, heikki a écrit :
> hi Paul,
>
> yes, but my concern isn't about the term-frequency, but rather the
> inverted-document-frequency, which als
Unfortunately, we don't have a designated field for product identifiers,
and the product identifiers are from various manufacturers. So it is
hard to normalize product keys, as we can't distinguish them from other
parts of the document.
Examples are xbox 360 (which might be searched as xbox360)
My suggestion wasn't to store/index the triplets, just a normalized
version of the product key. So if you had
id: CRXUSB2.0-16GB
desc: some 16GB USB thing
you'd index, in your searchable words field, "CRXUSB2.016GB some 16GB USB thing"
And then at search time you'd take "CRX USB2.0-16G" and nor
hi Paul,
yes, but my concern isn't about the term-frequency, but rather the
inverted-document-frequency, which also is used in the relevance score and
which takes into account all documents in the index.. in this way the
relevance score of one document is influenced by the contents of all other
do
Heikki,
it does solve your main concern: a term in lucene is a pair of a token and
field name.
The term frequency is, thus, the frequency of a token in a field.
So the term-frequency of text-stemmed-de:firewall is independent of the
term-frequency of text-stemmed-en:firewall (for example).
But
Hi,
> Has somebody ever tried something like this? Is there a way to do this
without
> increasing the index to about 15 times (1+2+3+4+5) its original size?
The index will not have 15 times the size as it is inverted index and only
indexes the unique parts of your tokens. In most cases it will ha
hi,
thanks for your response :
> On the web it is often hard to trust such (e.g. because of people working
in multiple languages, internet cafés...) but... it is your choice.
our web app has a language selector for the user to choose the GUI language
>After?
>Would "shallow matches" in the righ
Hi,
In Solr there is WordDelimiterFilter (it's also available in Lucene
trunk/4.0, in the analyzers-common module), that can handle these product
keys (split them up, keep them together, merge them). You can extract the
source code in 3.x and use it as own TokenFilter! But if the product keys
are
Hi Ian,
thank you for your reply.
Unfortunately this will be hard, as we have no way of knowing at which
position the user might enter spaces, so we cannot expand the product
keys at indexing time.
The other way round (triplets without spaces or hyphens) might work,
however we have no real
Le 3 janv. 2012 à 13:56, heikki a écrit :
> In our case, it is "known" in which language the user is searching (because
> he tells us, and if he doesn't, we use the current GUI language).
On the web it is often hard to trust such (e.g. because of people working in
multiple languages, internet c
Hi, I'm using Lucene 2.0 and was wondering how to flush/commit index data to
disk. It doesn't look like there is a flush() or commit() method in the 2.0
IndexWriter. Is there a way to flush the data without calling close()? Thank
you.
Hi Aditya,
Thank you for your suggestion!
Unfortunately, this is not possible, as there is no common format for
all product keys. The products are not ours nor are they all from the
same manufacturer, so we don't have any influence on how the product
keys look like.
Regards,
Christoph
On 03
Hi Christoph
My opinion is, you should not normalize or do any modification to the
product keys. This should be unique. Should be used as it is. Instead of
spaces you should have only used "-" but since the product already out in
the market, it cannot help.
In your UI, You could provide multiple
hello,
I would like to have your opinions on the impact on relevance scoring in the
scenario where multiple languages are indexed in a single index.
>> Besides, IMO, scoring / ordering documents in different
>> languages is a bit like comparing apples and oranges.
> Not too sure about that. I
When indexing you could normalise them down to a standard format
without spaces or hyphens, but searching is much harder if you really
can't identify possible product ids within user queries. Make
triplets without spaces or hyphens? "CRX USB-2.0 16GB" ==>
CRXUSB2.016GB but also "some random words
Hello,
we use lucene as search engine in an online shop. The products in this
shop often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering
their key. The problem is that product keys sometimes contain spaces or
dashes and customers so
37 matches
Mail list logo