Re: Indexing HTML pages and phrases

2007-03-14 Thread Bhavin Pandya
Hi Maryam, You can index the content of specific field as UN_TOKENIZED and then you can do phrase search on that field.. It will search for only phrases not tokens... To index HTML pages you can use any HTML parser... this may be useful to you.. http://lucene.apache.org/java/docs/api/org/apache

Re: Indexing HTML pages and phrases

2007-03-14 Thread Bhavin Pandya
- Original Message - From: "Maryam" <[EMAIL PROTECTED]> To: Sent: Thursday, March 15, 2007 7:55 AM Subject: Indexing HTML pages and phrases Hi, I am wondering if we can index a phrase (not term) in Lucene? Also, I am not usre if it can index HTML pages? I need to have access to the

Re: Performance between Filter and HitCollector?

2007-03-14 Thread karl wettin
15 mar 2007 kl. 04.09 skrev Otis Gospodnetic: eks dev and others - have you tried using the code from LUCENE-584? Noticed any performance increase when you disabled scoring? I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if you

Re: SpellChecker and Lucene 2.1

2007-03-14 Thread karl wettin
14 mar 2007 kl. 21.47 skrev Ryan O'Hara: Is there a SpellChecker.jar compatible with Lucene 2.1. After updating to Lucene 2.1, I seem to have lost the ability to create a spell index using spellchecker-2.0-rc1-dev.jar. Any help would be greatly appreciated. Can you explain the problem

Re: Performance between Filter and HitCollector?

2007-03-14 Thread Antony Bowesman
Thanks for the detailed reponse Hoss. That's the sort of in depth golden nugget I'd like to see in a copy of LIA 2 when it becomes available... I've wanted to use Filter to cache certain of my Term Queries, as it looked faster for straight Term Query searches, but Solr's DocSet interface abstr

Re: Performance between Filter and HitCollector?

2007-03-14 Thread Otis Gospodnetic
eks dev and others - have you tried using the code from LUCENE-584? Noticed any performance increase when you disabled scoring? I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if you or anyone else already tried this patch... Thanks,

Is Lucene Java trunk still stable for production code?

2007-03-14 Thread Jean-Philippe Robichaud
Hello Dear Lucene Users! Back in the old days (well, last year) the lucene/java/trunk subversion path was always stable enough for everyone to use into production code. Now, with the 2.0/2.1/2.2 braches, is it still the case? In December, I 'ported' my app to use the lucene 2.0 release.

Indexing HTML pages and phrases

2007-03-14 Thread Maryam
Hi, I am wondering if we can index a phrase (not term) in Lucene? Also, I am not usre if it can index HTML pages? I need to have access to the text of some of tags, I am not sure if this can be done in Lucene. I would be so glad if you help me in this case. Thanks

Re: how to get approximate total matching

2007-03-14 Thread Xiaocheng Luan
If I remember correctly, I once searched over 40G of indexes using multi-searcher with 512M max heap size, how much memory did you give the JVM? Thanks, Xiaocheng senthil kumaran <[EMAIL PROTECTED]> wrote: Hi. I have more index directories (>6) all in GB,and searching my query with single Ind

Re: Fast index traversal and update for stored field?

2007-03-14 Thread Thomas K. Burkholder
Hey, thanks for the quick reply. I've considered using a secondary index just for this data but thought I would look at storing the data in lucene first, since ultimately this data gets transported to an outside system, and it's a lot easier if there's only one "thing" to transfer. The d

Re: Fast index traversal and update for stored field?

2007-03-14 Thread Erick Erickson
If you search the mail archive for "update in place" (no quotes), you'll find extensive discussions of this idea. Although you're raising an interesting variant because you're talking about a non- indexed field, so now I'm not sure those discussions are relevant. I don't know of anyone who has do

Fast index traversal and update for stored field?

2007-03-14 Thread Thomas K. Burkholder
Hi there, I'm using lucene to index and store entries from a database table for ultimate retrieval as search results. This works fine. But I find myself in the position of wanting to occasionally (daily-ish) bulk- update a single, stored, non-indexed field in every document in the index,

Re: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Chris Hostetter
: > the only real reason you should really need 2 searchers at a time is if : > you are searching other queries in parallel threads at the same time ... : > or if you are warming up one new searcher that's "ondeck" while still : > serving queries with an older searcher. : : Hoss, I hope I misunder

SpellChecker and Lucene 2.1

2007-03-14 Thread Ryan O'Hara
Is there a SpellChecker.jar compatible with Lucene 2.1. After updating to Lucene 2.1, I seem to have lost the ability to create a spell index using spellchecker-2.0-rc1-dev.jar. Any help would be greatly appreciated. Thanks, Ryan --

Re: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Antony Bowesman
Chris Hostetter wrote: the only real reason you should really need 2 searchers at a time is if you are searching other queries in parallel threads at the same time ... or if you are warming up one new searcher that's "ondeck" while still serving queries with an older searcher. Hoss, I hope I mi

Re: Performance between Filter and HitCollector?

2007-03-14 Thread eks dev
just to complete this fine answer, there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584) that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate) - Original Message From: Ch

Re: Search vs. Rank

2007-03-14 Thread Chris Hostetter
: I'm thinking something like +pizza^0 garlic^1 "goat cheese"^-1 that does in fact work. : 2) Once I have this list of results, can I change their rank order without : having to do a full scale search again? the frequency of "pizza' won't affect the score at all, so you should need to do much

Search vs. Rank

2007-03-14 Thread Walt Stoneburner
Most search engine technologies return result sets based some weighted frequency of the search terms found. I've got a new problem, I want to rank by different criteria than I searched for. For example, I might want to return as my result set all documents that contain the word pizza, but rank t

Re: Performance between Filter and HitCollector?

2007-03-14 Thread Chris Hostetter
it's kind of an Apples/Oranges comparison .. in the examples you gave below, one is executing an arbitrary query (which oculd be anything) the other is doing a simple TermEnumeration. Asuming that Query is a TermQuery, the Filter is theoreticaly going to be faster becuase it does't have to comput

RE: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Chris Hostetter
: I just have two IndexSearchers opened now most of the time, which is : deprecated, : But I think that's my only choice ! 2 searchers is fine ... it's "N" where N is not bound that you want to avoid. from what i understand of your requirements, you don't *really* need two searchers open ... ope

Re: ways to minimize index size?

2007-03-14 Thread Erick Erickson
OK, I caused more confusion than rendered help by my stemming statement. The only reason I mentioned it was to illustrate that performance is not linearly related to size. It took some effort to put stemming into the index, see PorterStemmer etc. This is NOT the default. So I took it out to see w

Re: Wildcard searches with * or ? as the first character - Thanks

2007-03-14 Thread Oystein Reigem
Thanks Steven and Antony. I read the FAQ not very long ago, but that slipped my attention. Or perhaps it's a recent change. - Øystein - -- Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47

Re: memory consumption on large indices

2007-03-14 Thread Tim Patton
I'm searching a 20GB index and my searching JVM is allocated 1Gig. However, my indexing app only had 384mb availible to it, which means you can get away with far less. I believe certain index tables will need to be swapped in and out of memory though so it may not search as quickly. With a 1.

Re: Can we extract phrase from lucene index

2007-03-14 Thread karl wettin
14 mar 2007 kl. 14.51 skrev Bhavin Pandya: what i am looking for is dictionary for spell checker. I am trying to customised lucene spell checker for phrase. so thinking if anyhow i am able to fetech phrases from the index itself then i can train my spellchecker. I tried with query logs but

Re: memory consumption on large indices

2007-03-14 Thread Ian Lea
When your app gets a java.lang.OutOfMemory exception. -- Ian. On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: Ian Lea schrieb: > No, you don't need 1.8Gb of memory. Start with default and raise if > you need to? how do I know when I need it? > Or jump straight in at about 512Mb. > > > -

Re: memory consumption on large indices

2007-03-14 Thread Dennis Berger
Ian Lea schrieb: No, you don't need 1.8Gb of memory. Start with default and raise if you need to? how do I know when I need it? Or jump straight in at about 512Mb. -- Ian. On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: Do I have to keep something in mind to do searching on large ind

Re: ways to minimize index size?

2007-03-14 Thread jm
hi Erick, Well, typically my application will start with some hundreds of indexes...and then grow at a rate of several per day, for ever. At some point I know I can do some merging etc if needed. Size is dependant on the customer, could be up to a 1G per index. That is way I would like to minim

Re: memory consumption on large indices

2007-03-14 Thread Ian Lea
No, you don't need 1.8Gb of memory. Start with default and raise if you need to? Or jump straight in at about 512Mb. -- Ian. On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote: Do I have to keep something in mind to do searching on large indices? I actually have an index with a size of 1.8g

memory consumption on large indices

2007-03-14 Thread Dennis Berger
Do I have to keep something in mind to do searching on large indices? I actually have an index with a size of 1.8gb. I have indexed 1.5 million items from Amazon. How much memory do I have to give to the jvm? As a sidenote I have to tell you that I optimized the index so it's one segment file.

Re: Can we extract phrase from lucene index

2007-03-14 Thread Bhavin Pandya
Hi erick, what i am looking for is dictionary for spell checker. I am trying to customised lucene spell checker for phrase. so thinking if anyhow i am able to fetech phrases from the index itself then i can train my spellchecker. I tried with query logs but it has lot of spell mistakes... Any

RE: ways to minimize index size?

2007-03-14 Thread Jeff
I found that reducing my index from 8G to 4G (through not stemming) gave me about a 10% performance improvement. How did you do this? I don't see this as an option. Jeff

Re: how to get approximate total matching

2007-03-14 Thread Erick Erickson
How much memory are you allocating for your JVM? Because you're paying a huge search time penalty by opening and closing your searcher sequentially, it would be a good thing to not do this. But, as you say, if you're getting OOM errors, that's a problem. What is the total size of all your indexes

Re: ways to minimize index size?

2007-03-14 Thread Erick Erickson
Store as little as possible, index as little as possible . How big is your index, and how much do you expect it to grow? I ask this because it's probably not worth your time to try to reduce the index size below some threshold... I found that reducing my index from 8G to 4G (through not stemm

how to get approximate total matching

2007-03-14 Thread senthil kumaran
Hi. I have more index directories (>6) all in GB,and searching my query with single IndexSearcher to all indexes one after another.i.e. I create one IndexSearcher for index1 and search over that.Finally I close that and create new IndexSearcher for index2 and so on. If i get 200 total results

Re: Can we extract phrase from lucene index

2007-03-14 Thread Erick Erickson
Your problem statement lends itself to flippant answers like "just use a PhraseQuery". So I clearly don't understand what you're trying to accomplish. Are you trying to find all of the occurrences of a particular phrase? All the phrases (however that's defined) for all the documents? What problem

Re: IndexReader.GetTermFreqVectors

2007-03-14 Thread Ian Lea
From http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.TermVector.html: "A term vector is a list of the document's terms and their number of occurences in that document." -- Ian. On 3/14/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote: Yes but what is a term vector? ---

RE: IndexReader.GetTermFreqVectors

2007-03-14 Thread Kainth, Sachin
Yes but what is a term vector? -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: 13 March 2007 19:28 To: java-user@lucene.apache.org Subject: Re: IndexReader.GetTermFreqVectors It means it return the term vectors for all the fields on that document where you have

ways to minimize index size?

2007-03-14 Thread jm
Hi, I want to make my index as small as possible. I noticed about field.setOmitNorms(true), I read in the list the diff is 1 byte per field per doc, not huge but hey...is the only effect the score being different? I hardly mind about the score so that would be ok. And can I add to an index witho

Can we extract phrase from lucene index

2007-03-14 Thread Bhavin Pandya
Hello guys, I am using lucene 1.9 and i have 3GB of index. I know we can extract tokens from index easily but can we extract phrase ? Regards. Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e

RE: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread DECAFFMEYER MATHIEU
>> Is that the same reader that is used in IndexSearcher? I opened an IndexSearcher on the path (String) to the index. Now I tried to open on the clone IndexReader and use the constructor that has an IndexReader as param, and I got everything working now I just have two IndexSearchers opene