Searching for user agents

2010-07-23 Thread Maciej Bednarz
Hi, I am using apache lucene 3.0.2 and searching for an optimal analyzer to search for best matching http user agents. Imagine, that we store following http user agents in a field: Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Mozi

Re: on-the-fly "filters" from docID lists

2010-07-23 Thread Mark Harwood
> What is the best way to efficiently convert that list of primary keys to > Lucene docIds. Avoid disk seeks. Lucene is fast but still beholden to the laws of physics. Random disk seeks will cost you eg. 50,000 * 5ms =250 seconds (minus any effects of OS disk caching). Best way to handle t

RE: on-the-fly "filters" from docID lists

2010-07-23 Thread Burton-West, Tom
Hi all, >>Re scalability of filter construction - the database is likely to hold stable >>primary keys not lucene doc ids >>which are unstable in the face of updates. This is the scalability issue I was concerned about. Assume the database call efficiently retrieves a sorted array of 50,000 s

LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)

2010-07-23 Thread Utku Can Topçu
Hi All, I'm trying to use the patch for testing, provided in the issue. I downloaded the patch and the dependency *LUCENE-2453 *. I tested this contribution against the r942817 revision where I assume the contributor has been using during the tim

Re: Using lucene for substring matching

2010-07-23 Thread Ian Lea
So, if I've understood this correctly, you've got some text and wan't to loop through a list of words and/or phrases, and see which of those match the text. e.g. text "some random article about something or other of some random length" words some - matches many - no match article - matches word

Re: Reverse Lucene queries

2010-07-23 Thread Ivan Provalov
You can also look at carrot2 open source project, which does search results clustering. Cluster labels which carrot2 generates can be used as query terms "fitting" the documents in these clusters. Keep in mind that carrot2 is designed for a small set of documents (1000). http://project.carrot

Re: Hot to get word importance in lucene index

2010-07-23 Thread Grant Ingersoll
Couple of thoughts inline... On Jul 22, 2010, at 10:44 PM, Xaida wrote: > > Hi all! > > hmmm, i need to get how important is the word in entire document collection > that is indexed in the lucene index. I need to extract some "representable > words", lets say concepts that are common and can be

Re: Hot to get word importance in lucene index

2010-07-23 Thread Xaida
Thanx! I am not sure, I have to study this class more deeper today , this is bit complex, and i am not so advanced user to understand all. But this part written in description is important to me: "An efficient, effective "more-like-this" query generator would be a great contribution, if anyone'

Re: Reverse Lucene queries

2010-07-23 Thread Grant Ingersoll
On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote: > > 23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu: > >> Hi all, I have an interesting problem...instead of going from a query >> to a document collection, is it possible to come up with the best fit >> query for a given document collection (resu

Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin
Are you perhaps looking for this: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/similar/MoreLikeThis.html ? karl 23 jul 2010 kl. 10.54 skrev Xaida: Hi! thanks for reply! I will try to explain better, sorry if it was unclear. I have user text document colle

Re: Reverse Lucene queries

2010-07-23 Thread Karl Wettin
23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu: Hi all, I have an interesting problem...instead of going from a query to a document collection, is it possible to come up with the best fit query for a given document collection (results)? "Best fit" being a query which maximizes the hit scores o

Re: Hot to get word importance in lucene index

2010-07-23 Thread Xaida
Hi! thanks for reply! I will try to explain better, sorry if it was unclear. I have user text document collection. Not too big. Goal is to get the most "important" concepts which would in a way represent user interests. That is what i mean when i say important :) So lets say, in my collection

Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin
Hi, Please define "important". Important to do what? It would probably be helpful if you explained what it is you attempt to achieve by doing this. Perhaps there is something in MoreLikeThis that will help you? karl 23 jul 2010 kl. 04.44 skrev Xaida: Hi all! hmmm, i need t

Re: Databases

2010-07-23 Thread tarun sapra
You can use HibernateSearch to maintain the synchronization between Lucene index and Mysql RDBMS. On Fri, Jul 23, 2010 at 11:16 AM, manjula wijewickrema wrote: > Hi, > > Normally, when I am building my index directory for indexed documents, I > used to keep my indexed files simply in a directory

Re: Databases

2010-07-23 Thread Chris Lu
3) Sounds you want to use Lucene for storage, without databases like mysql. It may work, but hard for later data management. 1) and 2) You can use mysql as main storage, and pull data out to create Lucene indexes. Pay attention to incremental changes. It's a continuous process, not one-time data