Re: Indexing in pieces?

2007-08-31 Thread Chris Lu
Generally you are right. Except not exactly copy old index, but first index new content in a new directory, and merge old index with the new index. There is a quicker merge index method, indexWriter.addIndexesNoOptimize(), but in general, merging index is slow, although quicker than re-index. --

Re: Indexing in pieces?

2007-08-31 Thread Berlin Brown
So I am assuming that is not just a matter of "indexing" to that same directory as you "indexed" before. So, based on what you are saying, you would have to reload the previous index (eg, INDEX_DIR_OLD) and then index the new content. When I mean "index", I am talking about actually invoking lucen

Is there a Term ID for each distinctive term indexed in Lucene?

2007-08-31 Thread Tao Cheng
Hi all, I found that instead of storing a term ID for a term in the index, Lucene stores the actual term string value. I am wondering if there ever is such a "term ID" for each distinctive term indexed in Lucne, similar as a "doc ID" for each distinctive document indexed in Lucene. In other words

Re: Indexing in pieces?

2007-08-31 Thread Chris Lu
I think you can simply change you sql to select only the recently updated messages, and add to your existing index. Although adding to an existing large index also takes a long time, it should be quicker than re-building the whole index. If your index continues to grow, you may need to have a dedi

Indexing in pieces?

2007-08-31 Thread bbrown
I have been fine with my database (discussion forum) to lucene. I am taking the simplest approach, eg; I have a discussion forum which are just text messages, I take those out of the databse and then index the content. I am having troubling because I have hundreds of thousands of messages and i

Re: Query Analyzer Issue

2007-08-31 Thread sandeep chawla
Which analyzer are you using in your query parser ? can you share the one line of code in which you construct a QueryParser Object. As you might be parsing a query string made of different fields, I suggest you use a PerFieldAnalyzerWrapper which lets you do unique analysis for different fie

Re: Query Analyzer Issue

2007-08-31 Thread Kalvir Sandhu
You should test your Analyzer to confirm what tokens are being produced. You can do this by using a helper class, to save time there is one written in the Lucene in Action book called AnalyzerUtils, you should be able to get it out of the download of sourcecode from the book here: http://www.lucene

Re: Weighting Issue

2007-08-31 Thread Chris Hostetter
Have you tried giving the name field a boost? E.g. name:(John Smith)^10 alias:(John Smith) i'm also guessing youd be much happier with a sloppy phrase query then with the boolean queries you are currently using.. name:"John Smith"~3^10 alias:"John Smith"~3 -Hoss -

Re: Weighting Issue

2007-08-31 Thread Kalvir Sandhu
Thanks for the reply - i have tried boosting but not like you stated. I have tried to boost the Alias field so that it would score as high as a match on the name field. But it didn't increase enough. like : name:(John Smith) alias:(John Smith)^10 I think it has something to do with the fact that

Query Analyzer Issue

2007-08-31 Thread Harini Raghavan
Hi Everyone, I am facing some strange behaviour with Analyzers. I am using SimpleAnalyzer for some fields in my Compass entity, but I also wrote a custom Analyzer that is slightly different from the SimpleAnalyzer as I wanted to allow even letters and digits in company name column. So custom analy

Re: Weighting Issue

2007-08-31 Thread Michael Stoppelman
Kalvir, Have you tried giving the name field a boost? E.g. name:(John Smith)^10 alias:(John Smith) -M On 8/31/07, Kalvir Sandhu <[EMAIL PROTECTED]> wrote: > > Hi all. > > I am working on building a lucene index to search names of people. I want > to > be able to score things differently. Here i

Weighting Issue

2007-08-31 Thread Kalvir Sandhu
Hi all. I am working on building a lucene index to search names of people. I want to be able to score things differently. Here is an example of the behaviour i need. Doc 1 with aliases name: Bob Jones alias: John Smith Andrew Jones Doc 2 without aliases name: John Andrew Smith alias: none When

Re: Lucene indexing for pdf files

2007-08-31 Thread Steven Rowe
Hi Madhu, Madhu wrote: > i am indexing pdf document using pdfbox 7.4, its working fine for some pdf > files. for japanese pdf files its giving the below exception. > > caught a class java.io.IOException > with message: Unknown encoding for 'UniJIS-UCS2-H' > > Can any one help me , how to set th

OutOfMemoryError tokenizing a boring text file

2007-08-31 Thread Per Lindberg
I'm creating a tokenized "content" Field from a plain text file using an InputStreamReader and new Field("content", in); The text file is large, 20 MB, and contains zillions lines, each with the the same 100-character token. That causes an OutOfMemoryError. Given that all tokens are the *same*,

Re: How to speed-up index opening

2007-08-31 Thread Antoine Baudoux
Great! Thanks ! -- Antoine Baudoux Development Manager [EMAIL PROTECTED] Tél.: +32 2 333 58 44 GSM: +32 499 534 538 Fax.: +32 2 648 16 53 Le 31 Aug 2007 à 09:45, Michael Busch a écrit : Antoine Baudoux wrote: From what I have seen in the patch, It re-opens the segments th

Re: How to speed-up index opening

2007-08-31 Thread Michael Busch
Antoine Baudoux wrote: > From what I have seen in the patch, It re-opens the segments tha > have changed. > > So Imagine I always change the biggest sement (because that's where > most docs are and i need to update them frequently) . Will there still > be a benefit of IndexReader.reopen()?

Re: How to speed-up index opening

2007-08-31 Thread Antoine Baudoux
From what I have seen in the patch, It re-opens the segments tha have changed. So Imagine I always change the biggest sement (because that's where most docs are and i need to update them frequently) . Will there still be a benefit of IndexReader.reopen()? -- Antoine Baudoux Development