frequent keyword computation within a search ( and timeinterval )

2012-01-03 Thread prasenjit mukherjee
I have a requirement where reads and writes are quite high ( @ 100-500 per-sec ). A document has the following fields : timestamp, unique-docid, content-text, keyword. Average content-text length is ~ 20 bytes, there is only 1 keyword for a given docid. At runtime, given a query-term ( which coul

Re: Tagging documents as they are indexed -- Is FST a reasonable approach?

2012-01-03 Thread Robert Muir
On Tue, Jan 3, 2012 at 7:04 PM, Ryan McKinley wrote: > On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir wrote: >> On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley wrote: >>> >>> Just brainstorming, it seems like an FST could be a good/efficient way >>> to match documents.  My plan would be to: >>> >>> 1

Re: Tagging documents as they are indexed -- Is FST a reasonable approach?

2012-01-03 Thread Ryan McKinley
On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir wrote: > On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley wrote: >> >> Just brainstorming, it seems like an FST could be a good/efficient way >> to match documents.  My plan would be to: >> >> 1. Use an Analyzer to create a TokenStream for each place name.

Re: Comparing Indexing Speed of Lucene 3.5 and 4.0

2012-01-03 Thread Peter K
Thanks Simon for you answer! > as far as I can see you are comparing apples and pears. When excluding the waiting time I also get the slight but reproducable difference**. The times for waitForGeneration are nearly the same (~2sec). Also when I commit instead waitForGeneration it is no difference

Re: Tagging documents as they are indexed -- Is FST a reasonable approach?

2012-01-03 Thread Robert Muir
On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley wrote: > > Just brainstorming, it seems like an FST could be a good/efficient way > to match documents.  My plan would be to: > > 1. Use an Analyzer to create a TokenStream for each place name.  From > the TokenStream create an FST -- this would have t

Tagging documents as they are indexed -- Is FST a reasonable approach?

2012-01-03 Thread Ryan McKinley
Happy new year! I'm working on a way to simple geocode documents as they are indexed. I'm hoping to use existing Lucene infrastructure to do this as much as possible. My plan is to build an index of known place names then look for matches in incoming text. When there is a match, some extra field

Re: Boolean OR does not work as described

2012-01-03 Thread Michael-O
Hi Uwe, Uwe Schindler schrieb: Hi Mike, if you want to mix and/or in one query, always use parenthesis. The operator precedence is strange with the default query parser. In contrib there is another one (called PrecedenceQueryParser) that can handle this but is incompatible with existing queries

Re: Boolean OR does not work as described

2012-01-03 Thread Michael-O
Chris Hostetter schrieb: : if you want to mix and/or in one query, always use parenthesis. The or better yet, train yourself not to use AND, OR and NOT... http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/ Thanks for the blog entry. I will read through that! ---

Re: How to use RAMDirectory more efficiently

2012-01-03 Thread dyzc2010
Charlie, the code you provided will double the size of the index on FS every time when saving occurs. Can we avoid duplicating the index but synchronizing the changed records? -- Original -- From: "dyzc2010 "<1393975...@qq.com>; Date: Mon, Jan 2, 2012 01:30

Re: AW: Boolean OR does not work as described

2012-01-03 Thread Michael-O
Anna Hunecke schrieb: Hi Mike, I think the problem is grouping. If you have a query A AND B OR C, it will be grouped as A AND (B OR C) and not as you expected as (A AND B) OR C. Just put parentheses in your query and you get the result that you want. Hi Anna, This is it but the documentation

Re: Comparing Indexing Speed of Lucene 3.5 and 4.0

2012-01-03 Thread Simon Willnauer
hey Peter, as far as I can see you are comparing apples and pears. Your comparison is waiting for merges to finish and if you are using multiple threads lucene 4.0 will flush more segments to disk than 3.5 so what you are seeing is likely a merge that is still trying to merge small segments. can y

RE: Boolean OR does not work as described

2012-01-03 Thread Uwe Schindler
Hi Hoss, Hey, nice blog article - it makes my previous mail obsolete :-) Explained perfectly! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Chris Hostetter [mailto:hossman_luc...@fucit.org] > Sen

RE: Boolean OR does not work as described

2012-01-03 Thread Chris Hostetter
: if you want to mix and/or in one query, always use parenthesis. The or better yet, train yourself not to use AND, OR and NOT... http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/ -Hoss - To unsubscribe

Boolean OR does not work as described

2012-01-03 Thread 1983-01-06
Hi folks, I have a query result problem I do not understand. The documentation for Lucene 3.2 query syntax says the following about boolean OR queries: "The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using

Boolean OR does not work as described

2012-01-03 Thread 1983-01-06
Hi folks, I have a query result problem I do not understand. The documentation for Lucene 3.2 query syntax says the following about boolean OR queries: "The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using

Comparing Indexing Speed of Lucene 3.5 and 4.0

2012-01-03 Thread Peter K
Hi, I recently switched an experimental project from Lucene 3.5 to 4.0 from 6th Dec 2011 and my indexing time increased by nearly 20% on my local machine*. It seems to me that two simple StringField's could cause this slow down: Field uIdField = new Field("_uid", "" + id, StringField.TYPE_STORED);

RE: Boolean OR does not work as described

2012-01-03 Thread Uwe Schindler
Hi Mike, if you want to mix and/or in one query, always use parenthesis. The operator precedence is strange with the default query parser. In contrib there is another one (called PrecedenceQueryParser) that can handle this but is incompatible with existing queries. The parser in contrib on the

AW: Boolean OR does not work as described

2012-01-03 Thread Anna Hunecke
Hi Mike, I think the problem is grouping. If you have a query A AND B OR C, it will be grouped as A AND (B OR C) and not as you expected as (A AND B) OR C. Just put parentheses in your query and you get the result that you want. Best, Anna -Ursprüngliche Nachricht- Von: 1983-01...@gmx.n

Re: Spatial Search

2012-01-03 Thread David Smiley (@MITRE.org)
I have a couple comments on your code after briefly looking through it. * If you want to work with miles, initialize SimpleSpatialContext with DistanceUnits.MILES. I see you chose KM but your leading statistics are in miles. Perhaps that is the reason for the discrepency in your numbers. * You a

Boolean OR does not work as described

2012-01-03 Thread 1983-01-06
Hi folks, I have a query result problem I do not understand. The documentation for Lucene 3.2 query syntax says the following about boolean OR queries: "The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using

Re: Designing a multilingual index

2012-01-03 Thread Robert Muir
On Tue, Jan 3, 2012 at 10:10 AM, Paul Libbrecht wrote: > I think the idf is also about terms and not about tokens. > Maybe an expert can confirm my belief or we have to invent a test. > idf is docFreq and maxDoc. docFreq is per-field, maxDoc is not. This might not even matter though. if you are

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
I think the idf is also about terms and not about tokens. Maybe an expert can confirm my belief or we have to invent a test. paul Le 3 janv. 2012 à 15:43, heikki a écrit : > hi Paul, > > yes, but my concern isn't about the term-frequency, but rather the > inverted-document-frequency, which als

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser
Unfortunately, we don't have a designated field for product identifiers, and the product identifiers are from various manufacturers. So it is hard to normalize product keys, as we can't distinguish them from other parts of the document. Examples are xbox 360 (which might be searched as xbox360)

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Ian Lea
My suggestion wasn't to store/index the triplets, just a normalized version of the product key. So if you had id: CRXUSB2.0-16GB desc: some 16GB USB thing you'd index, in your searchable words field, "CRXUSB2.016GB some 16GB USB thing" And then at search time you'd take "CRX USB2.0-16G" and nor

Re: Designing a multilingual index

2012-01-03 Thread heikki
hi Paul, yes, but my concern isn't about the term-frequency, but rather the inverted-document-frequency, which also is used in the relevance score and which takes into account all documents in the index.. in this way the relevance score of one document is influenced by the contents of all other do

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
Heikki, it does solve your main concern: a term in lucene is a pair of a token and field name. The term frequency is, thus, the frequency of a token in a field. So the term-frequency of text-stemmed-de:firewall is independent of the term-frequency of text-stemmed-en:firewall (for example). But

RE: Indexing product keys with and without spaces in them

2012-01-03 Thread Uwe Schindler
Hi, > Has somebody ever tried something like this? Is there a way to do this without > increasing the index to about 15 times (1+2+3+4+5) its original size? The index will not have 15 times the size as it is inverted index and only indexes the unique parts of your tokens. In most cases it will ha

Re: Designing a multilingual index

2012-01-03 Thread heikki
hi, thanks for your response : > On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet cafés...) but... it is your choice. our web app has a language selector for the user to choose the GUI language >After? >Would "shallow matches" in the righ

RE: Indexing product keys with and without spaces in them

2012-01-03 Thread Uwe Schindler
Hi, In Solr there is WordDelimiterFilter (it's also available in Lucene trunk/4.0, in the analyzers-common module), that can handle these product keys (split them up, keep them together, merge them). You can extract the source code in 3.x and use it as own TokenFilter! But if the product keys are

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser
Hi Ian, thank you for your reply. Unfortunately this will be hard, as we have no way of knowing at which position the user might enter spaces, so we cannot expand the product keys at indexing time. The other way round (triplets without spaces or hyphens) might work, however we have no real

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
Le 3 janv. 2012 à 13:56, heikki a écrit : > In our case, it is "known" in which language the user is searching (because > he tells us, and if he doesn't, we use the current GUI language). On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet c

Commit data to disk ...

2012-01-03 Thread Dragon Fly
Hi, I'm using Lucene 2.0 and was wondering how to flush/commit index data to disk. It doesn't look like there is a flush() or commit() method in the 2.0 IndexWriter. Is there a way to flush the data without calling close()? Thank you.

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser
Hi Aditya, Thank you for your suggestion! Unfortunately, this is not possible, as there is no common format for all product keys. The products are not ours nor are they all from the same manufacturer, so we don't have any influence on how the product keys look like. Regards, Christoph On 03

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread findbestopensource
Hi Christoph My opinion is, you should not normalize or do any modification to the product keys. This should be unique. Should be used as it is. Instead of spaces you should have only used "-" but since the product already out in the market, it cannot help. In your UI, You could provide multiple

Re: Designing a multilingual index

2012-01-03 Thread heikki
hello, I would like to have your opinions on the impact on relevance scoring in the scenario where multiple languages are indexed in a single index. >> Besides, IMO, scoring / ordering documents in different >> languages is a bit like comparing apples and oranges. > Not too sure about that. I

Re: Indexing product keys with and without spaces in them

2012-01-03 Thread Ian Lea
When indexing you could normalise them down to a standard format without spaces or hyphens, but searching is much harder if you really can't identify possible product ids within user queries. Make triplets without spaces or hyphens? "CRX USB-2.0 16GB" ==> CRXUSB2.016GB but also "some random words

Indexing product keys with and without spaces in them

2012-01-03 Thread Christoph Kaser
Hello, we use lucene as search engine in an online shop. The products in this shop often contain product keys like CRXUSB2.0-16GB. We would like our customers to be able to find products by entering their key. The problem is that product keys sometimes contain spaces or dashes and customers so