Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: > JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs

Strange Error while deleting Documents from index while indexing.

2007-07-25 Thread miztaken
Hi, I am dumping the database tables into lucene documents. I am doing like this: 1. Get the rowset from database to be stored as Lucene Document. 2. Open IndexReader and check if they are already indexed. If Indexed, delete them and add the new rowset. Continue this till the end 3. Close I

Linear Hashing in Lucene?

2007-07-25 Thread Dmitry
Hey, Some common questions about Lucene. 1. does exist Ontology Wraper in Lucene implementation? 2. Does Lucene using Linear Hashing? thnaks, DT, www.ejinz.com Search news - To unsubscribe, e-mail: [EMAIL PROTECTED] For addition

Highlighter strategy in Lucene

2007-07-25 Thread Dmitry
Waht kind of Highlighter strategy Lucene is using? thanks, Dt www.ejinz.com Search Engine for News - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Displaying results in the order

2007-07-25 Thread Dmitry
Is there a way to update a document in the Index without causing any change to the order in which it comes up in searches? thanks, DT, www.ejinz.com Search everything news, tech, movies, music - To unsubscribe, e-mail: [EMAIL P

Re: Search for null

2007-07-25 Thread Daniel Noll
On Thursday 26 July 2007 03:12:20 daniel rosher wrote: > In this case you should look at the source for RangeFilter.java. > > Using this you could create your own filter using TermEnum and TermDocs > to find all documents that had some value for the field. That's certainly the way to do it for spe

Assembling a query from multiple fields

2007-07-25 Thread Joe Attardi
Hi all, Apologies for the cryptic subject line, but I couldn't think of a more descriptive one-liner to describe my problem/question to you all. Still fairly new to Lucene here, although I'm hoping to have more of a clue once I get a chance to read "Lucene In Action". I am implementing a search

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Yonik Seeley
On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten com

MoreLikeThis for multiple documents

2007-07-25 Thread Jens Grivolla
Hello, I'm looking to extract significant terms characterizing a set of documents (which in turn relate to a topic). This basically comes down to functionality similar to determining the terms with the greatest offer weight (as used for blind relevance feedback), or maximizing tf.idf (as is

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey Guys, Thanks for all the responses. I finally got it working with some query modification. The idea was to pick an itemID from the database and for that itemID in the Index, get the scores across 4 fields; add them up and ta-da ! I still have to verify my scores. Thanks a ton, I'll be activ

Delete corrupted doc

2007-07-25 Thread Rafael Rossini
Hi guys, Is there a way of deleting a document that, because of some corruption, got and docID larger than the maxDoc() ? I´m trying to do this but I get this Exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 106577 at org.apache.lucen

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Doron Cohen
"Askar Zaidi" wrote: > ... Heres what I am trying to accomplish: > > 1. Iterate over itemID (unique) in the database using one SQL query. > 2. For every itemID found, run 4 searches on Lucene Index. > 3. doTagSearch(itemID) ; collect score > 4. doTitleSearch(itemID...) ; collect score > 5. doS

java gc with a frequently changing index?

2007-07-25 Thread Tim Sturge
Hi, I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds is acceptable) with the index. Right now I have code

Re: Query parsing?

2007-07-25 Thread Daniel Naber
On Wednesday 25 July 2007 00:44, Lindsey Hess wrote: > Now, I do not need Lucene to index anything, but I'm wondering if Lucene > has query parsing classes that will allow me to transform the queries. The Lucene QueryParser class can parse the format descriped at http://lucene.apache.org/java/d

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
On Jul 25, 2007, at 1:26 PM, Askar Zaidi wrote: Hey guys, One last question and I think I'll have an optimized algorithm. How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer()); queryParser.setDefaultO

Re: What replaced org.apache.lucene.document.Field.Text?

2007-07-25 Thread Lindsey Hess
Andy, Patrick, Thank you. I replaced Field.Text with new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED); and it works just fine. Cheers, Lindsey Patrick Kimber <[EMAIL PROTECTED]> wrote: Hi Andy I think: Field.Text("name", "value"); has been replaced

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey guys, One last question and I think I'll have an optimized algorithm. How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer()); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query q = query

Re: Search for null

2007-07-25 Thread daniel rosher
In this case you should look at the source for RangeFilter.java. Using this you could create your own filter using TermEnum and TermDocs to find all documents that had some value for the field. You would then flip this filter (perhaps write a FlipFilter.java, that takes an existing filter in it

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
Yes, you can do that. On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote: Heres what I mean: http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields title:"The Right Way" AND text:go Although, I am not searching for the title "the right way" , I am looking for the score by specify

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Heres what I mean: http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields title:"The Right Way" AND text:go Although, I am not searching for the title "the right way" , I am looking for the score by specifying a unique field (itemID). when I do System.out.println(query); I get: +co

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Instead of refactoring the code, would there be a way to just modify the query in each search routine ? Such as, "search contents: and item:"; This means it would just collect the score of that one document whose itemID field = itemID passed from while(rs.next()). I just need to collect the score

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
So, you really want a single Lucene score (based on the scores of your 4 fields) for every itemID, correct? And this score consists of scoring the title, tag, summary and body against some keywords correct? Here's what I would do: while (rs.next()) { doc = getDocument(itemId); // Get y

Re: Search for null

2007-07-25 Thread Jay Yu
what if I do not know all possible values of that field which is a typical case in a free text search? daniel rosher wrote: You will be unable to search for fields that do not exist which is what you originally wanted to do, instead you can do something like: -Establish the query that will sel

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hi Grant, Thanks for the response. Heres what I am trying to accomplish: 1. Iterate over itemID (unique) in the database using one SQL query. 2. For every itemID found, run 4 searches on Lucene Index. 3. doTagSearch(itemID) ; collect score 4. doTitleSearch(itemID...) ; collect score 5. doSumm

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Grant Ingersoll
Hi Askar, I suggest we take a step back, and ask the question, what are you trying to accomplish? That is, what is your application trying to do? Forget the code, etc. just explain what you want the end result to be and we can work from there. Based on what you have described, I am no

Re: Fine Tuning Lucene implementation

2007-07-25 Thread Askar Zaidi
Hey Guys, I need to know how I can use the HitCollector class ? I am using Hits and looping over all the possible document hits (turns out its 92 times I am looping; for 300 searches, its 300*92 !!). Can I avoid this using HitCollector ? I can't seem to understand how its used. thanks a lot, Ask

Lucene Highlighter linkage Error

2007-07-25 Thread ki
Hello! I am working with Tomcat. I have put the Lucene highlighter.jar in the folder lib. And I have created an extra css, where I say that the background color has to be yellow. The searchword has to be highlighted know. I have got a dataTable in which the result of the following Lucene method i

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Grant Ingersoll
On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
I am sure a faster StandardAnalyzer would be greatly appreciated. I'm increasing the priority of that task then :) StandardAnalyzer appears widely used and horrendously slow. Even better would be a StandardAnalyzer that could have different recognizers enabled/disabled. For example, dropping

Re: Which field matched ?

2007-07-25 Thread makkhar
Currently, we use regular expression pattern matching to get hold of which field matched. Again a pathetic solution since we have to agree upon the subset of the lucene search and pattern matching. We cannot use Boolean queries etc in this case. makkhar wrote: > > This problem has been baffli

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Mark Miller
I would be very interested. I have been playing around with Antlr to see if it is any faster than JavaCC, but haven't seen great gains in my simple tests. I had not considered trying JFlex. I am sure a faster StandardAnalyzer would be greatly appreciated. StandardAnalyzer appears widely used a

Which field matched ?

2007-07-25 Thread makkhar
This problem has been baffling me since quite some time now and has no perfect solution in the forum ! I have 10 documents, each with 10 fields with "parameterName and parameterValue". Now, When i search for some term and I get 5 hits, how do I find out which paramName-Value pair matched ? I am

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to JF

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
"Simon Wistow" <[EMAIL PROTECTED]> wrote: > On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said: > > Ahhh, OK. But do you have a segments_N file? > > Yup. OK, though I still don't understand why the existence of "write.lock" caused you to lose most of your index on creating a new

Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said: > Ahhh, OK. But do you have a segments_N file? Yup. > Yes, this is perfect. This is the "simple" option I described. The > more complex option is to use a custom deletion policy which enables > you to safely do backups (even i

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
> > The data appears to be there - please tell me that I'm doing something > > stupid and I can recover from this. > > It appears by deleting the write.lock files everything has recovered. Hmmm -- it's odd that the existence of the write.lock caused you to lose most of your index. All that should

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
"Simon Wistow" <[EMAIL PROTECTED]> wrote: > On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said: > > It's somewhat spooky that you have a write.lock present because that > > means you backed up while a writer was actively writing to the index > > which is a bit dangerous because if th

Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said: > It's somewhat spooky that you have a write.lock present because that > means you backed up while a writer was actively writing to the index > which is a bit dangerous because if the timing is unlucky (backup does > an "ls" but bef

Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

2007-07-25 Thread Maximilian Hütter
Mathieu Lecarme schrieb: > Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit : >> Hi, guys, >> I found Analyzers for Japanese, Korean and Chinese, but not stemmers; >> the Snowball stemmers only include European languages. Does stemming >> not make sense for ideograph-based languages (i.

Re: Recovering from a Crash

2007-07-25 Thread Simon Wistow
On Wed, Jul 25, 2007 at 10:08:56AM +0100, me said: > The data appears to be there - please tell me that I'm doing something > stupid and I can recover from this. It appears by deleting the write.lock files everything has recovered. Is this best practice? Have I just done something so terribly wr

Re: Recovering from a Crash

2007-07-25 Thread Michael McCandless
"Simon Wistow" <[EMAIL PROTECTED]> wrote: > We were affected by the great SF outage yesterday and apparently the > indexing machine crashed without being shutdown properly. Eek, sorry! We are so reliant on electricity these days > I've taken a backup of the indexes which has the usual smat

Recovering from a Crash

2007-07-25 Thread Simon Wistow
We were affected by the great SF outage yesterday and apparently the indexing machine crashed without being shutdown properly. I've taken a backup of the indexes which has the usual smattering of write.lock segments.gen, .cfs, .fdt, .fnm and .fdx etc files and looks to be about the right size.

Re: Search for null

2007-07-25 Thread daniel rosher
You will be unable to search for fields that do not exist which is what you originally wanted to do, instead you can do something like: -Establish the query that will select all non-null values TermQuery tq1 = new TermQuery(new Term("field","value1")); TermQuery tq2 = new TermQuery(new Term("fiel

Re: What replaced org.apache.lucene.document.Field.Text?

2007-07-25 Thread Patrick Kimber
Hi Andy I think: Field.Text("name", "value"); has been replaced with: new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED); Patrick On 25/07/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Please reference How do I get code written for Lucene 1.4.x to work with Lucene 2.x? http

Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

2007-07-25 Thread Mathieu Lecarme
Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit : > Hi, guys, > I found Analyzers for Japanese, Korean and Chinese, but not stemmers; > the Snowball stemmers only include European languages. Does stemming > not make sense for ideograph-based languages (i.e., no stemming is > needed for