date:20100503

Indexing only newly created files

2010-05-03 Thread Vijay Veeraraghavan

Dear all, I am using lucene 3.0 to index the pdf reports that I generate dynamically. I index the pdf file name (without extension), file path and its absolute path as fields. I search with the file name without extension; it retrieves a list, as usually 2 or more files are present in the same name

Re: Indexing only newly created files

2010-05-03 Thread Simon Willnauer

Hey there, you might have to implement a some kind of unique identifier using an indexed lucene field. When you are indexing you should fire a query with the uuid of your document (maybe the path to you pdf document) and check if the document is in the index already. You could also do a boolean qu

Re: Indexing only newly created files

2010-05-03 Thread Vijay Veeraraghavan

dear, Thanks for you reply Mr. simon, I found it very useful. I have another doubt, I create the index in a clustered environment (2 physical systems and 2 virtual). A shared system among the nodes is where this index will be created. The scheduler runs in another remote system which will create an

Re: Indexing only newly created files

2010-05-03 Thread Vijay Veeraraghavan

dear all, as replied below, does searching again for the document in the index and if found skip the indexing else index it, is this not similar to indexing all pdf documents once again, is not this overhead? As I am not going to index the details of the pdf (so if an indexed pdf was recreated i n

Using IndexReader in the web environment

2010-05-03 Thread Vijay Veeraraghavan

Hi all, In a clustered environment I search the index from the web application. In the web application I am creating IndexReader on each request. is it expensive to do like this? I read somewhere in the web that try using the same reader as much as possible. Can i keep the initially created IndexR

AW: Relevancy Practices

2010-05-03 Thread Uwe Goetzke

Regarding Part3: Data quality For our search domain (catalog products) we face very often the problem that the search data is full of acronyms and abbreviations like: cable,nym-j,pvc,3x2.5mm² or dvd-/cd-/usb-carradio,4x50W,divx,bl We solved this by a combination of normalization for better data

Re: Using IndexReader in the web environment

2010-05-03 Thread Erick Erickson

The quick answer is that the session is probably the wrong place to keep an IndexReader, since that's per-user. I'd define a new server/servlet that did my searching and have my webapps use that. Makes it really simple to re-use index readers. And reopening the IndexReader for each request will p

Re: Relevancy Practices

2010-05-03 Thread Peter Keegan

We discovered very soon after going to production that Lucene's scores were often 'too precise'. For example, a page of 25 results may have several different score values, and all within 15% of each other, but to the end user all 25 results were equally relevant. Thus we wanted the secondary sort f

Wich way would you recommend for successive-words similarity and scoring ?

2010-05-03 Thread Pablo

Hello, Lucene core doesn't seems to use relative word positioning (?) for scoring. For example, indexing that phrase "a b c d e f g h i j k l m n o p q r s t u v w x y z", these queries give the same results (0.19308087) : - 1 : phrase:'e f g' - 2 : phrase:'o k z' I'm a bit familiar with lucen

Re: Relevancy Practices

2010-05-03 Thread Ivan Provalov

Grant, We are currently working on a relevancy improvement project. We took the IBM's paper from 2007 TREC and followed the approaches they described to improve Lucene's relevance. It also gave us some idea of Lucene’s out-of-the-box precision performance (MAP). In addition to it we used som

Re: Questions about the new query parser framework

2010-05-03 Thread Daniel Noll

On Mon, May 3, 2010 at 15:11, Adriano Crestani wrote: > I actually never liked how QueryNode -> query string is done today, using > QueryNode.toQueryString(...) method. A QueryNode shouldn't be responsible > for converting itself back to the string format, because different > SyntaxParser(s) may c

Indexing only newly created files

Re: Indexing only newly created files

Re: Indexing only newly created files

Re: Indexing only newly created files

Using IndexReader in the web environment

AW: Relevancy Practices

Re: Using IndexReader in the web environment

Re: Relevancy Practices

Wich way would you recommend for successive-words similarity and scoring ?

Re: Relevancy Practices

Re: Questions about the new query parser framework

11 matches

Site Navigation

Mail list logo

Footer information