Re: Best practice - preparing search term for Lucene

2022-09-24 Thread Hrvoje Lončar
Oh yes, I also use Spring Cache which works fine and I don't have to store products in Lucene making index smaller and faster. On Fri, 23 Sept 2022, 19:26 Stephane Passignat, wrote: > Hi > > I would don't store the original value. That's "just" an index. But store > the value of your db identifi

Re: Best practice - preparing search term for Lucene

2022-09-24 Thread Hrvoje Lončar
Well, my bad is that I used wrong word. I'm not storing but just goving keywords to analyzer. That was my mistake in writing. So far I don't index exotic letters, but just normalized. Additionally I put in index something like "Prod_3443" which is a product ID for situation when specific product is

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Hrvoje Lončar
Good point! For now I'll leave it normalized. Every search term coming from frontend is stored and also its counter updated which will help me after some time to see trends and to decide to change the logic or not. P.S. Here is the funny part: in Croatian "pišanje" means peeing while "pisanje" mea

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Stephane Passignat
Hi I would don't store the original value. That's "just" an index. But store the value of your db identifiers, because I think you'll want it at some point. (I made the same kind of feature on top of datanucleus) I use to have tech id in my db. Even more since I started to use jdo jpa some 20

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Michael Sokolov
I think it depends how precise you want to make the search. If you want to enable diacritic-sensitive search in order to avoid confusions when users actually are able to enter the diacritics, you can index both ways (ascii-folded and not folded) and not normalize the query terms. Or you can just fo

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Hrvoje Lončar
Hi Stephane! Actually, I have excactly that kind of conversion, but I didn't mention as my mail was long enough whithout it :) My main concern it should I let Lucene index original keywords or not. Considering what you wrote, I guess your answer would be to store only converted values without exot

Re: Best practice - preparing search term for Lucene

2022-09-22 Thread Stephane Passignat
Hello, The way I did it took me some time and I almost sure it's applicable to all languages. I normalized the words. Replacing letters or group of letters by another approaching one. In french e é è ê ai ei sound a bit the same, and for someone who write mistakes having to use the right lett

Best practice - preparing search term for Lucene

2022-09-22 Thread Hrvoje Lončar
Hi! I'm using Hibernate Search / Lucene to index my entities in Spring Boot aplication. One thing I'm not sure is how to handle Croatian specific letters. Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š* *Š* *ž* *Ž*". Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ

Re: best practice for NRT. Is it through ControlledRealTimeReopenThread ?

2014-05-30 Thread Michael McCandless
I put a comment on the StackOverflow question. Mike McCandless http://blog.mikemccandless.com On Fri, May 30, 2014 at 2:51 AM, Gaurav gupta wrote: > Hi, > > I am implementing NRT and looking for best practice to implement it. I > found that 4.4.0 release onwards the Near Real

best practice for NRT. Is it through ControlledRealTimeReopenThread ?

2014-05-29 Thread Gaurav gupta
Hi, I am implementing NRT and looking for best practice to implement it. I found that 4.4.0 release onwards the Near Real Time Manager (org.apache.lucene.search.NRTManager) has been replaced by ControlledRealTimeReopenThread. But as per Java doc it appears "experimenter". Please advis

Re: Best practice to map Lucene docids to real ids

2014-05-18 Thread Sven Teichmann
Thank you, that helped me a lot. Sven Teichmann __ Software for Intellectual Property GmbH Gewerbering 14a 83607 Holzkirchen (Germany) Phone: +49 (0)8024 46699-00 Fax:+49 (0)8024 46699-02 E-Mail: s.teichm...@s4ip.de Local Court of Munich

Re: Best practice to map Lucene docids to real ids

2014-05-16 Thread Michael McCandless
On Tue, May 13, 2014 at 1:34 AM, Sven Teichmann wrote: > Hi, > > I also found this response very useful and right now I am playing around > with DocValues. > >> If the default DocValuesFormat isn't fast enough, you can always >> switch to e.g. DirectDocValuesFormat (uses lots of RAM but it just an

Re: Best practice to map Lucene docids to real ids

2014-05-12 Thread Sven Teichmann
Hi, I also found this response very useful and right now I am playing around with DocValues. If the default DocValuesFormat isn't fast enough, you can always switch to e.g. DirectDocValuesFormat (uses lots of RAM but it just an array lookup). How do I switch do DirectDocValuesFormat? And how do

Re: Best practice to map Lucene docids to real ids

2014-05-12 Thread Wouter Heijke
Hey Mike, That was a very useful response, also for long time Lucene users like myself who were stuck in legacy ways of doing things! I managed to easily change indexing of keys to DocValues and found myself wondering why I did not get anything returned, it appears indexing works transparent to an

Re: Best practice to map Lucene docids to real ids

2014-05-06 Thread Michael McCandless
Doc values is far faster than a stored field. If the default DocValuesFormat isn't fast enough, you can always switch to e.g. DirectDocValuesFormat (uses lots of RAM but it just an array lookup). Mike McCandless http://blog.mikemccandless.com On Tue, May 6, 2014 at 4:33 AM, Sven Teichmann wro

Re: Best practice to map Lucene docids to real ids

2014-05-06 Thread Wouter Heijke
Hi, I would index it in a field, you can use the database id and even add additional information so compose your own key and retrieve that (only) when you collect search results. Wouter > Hi, > > what is the best way to retrieve our "real" ids (as they are in our > database) while searching? > >

Best practice to map Lucene docids to real ids

2014-05-06 Thread Sven Teichmann
Hi, what is the best way to retrieve our "real" ids (as they are in our database) while searching? Right now we generate a file after indexing which contains all Lucene docids and the matching id in our database. Our own Collector converts the docids to our ids while collecting. This works a

Re: best practice for reusing documents with multi-valued fields

2011-04-18 Thread Anshum
; ScoreDoc[] sd = is.search(query, 10).scoreDocs; for(ScoreDoc scoreDoc:sd){ System.out.println(ir.document(scoreDoc.doc)); } is.close(); ir.close(); iw.close(); *--Snip--* -- Anshum Gupta http://ai-cafe.blogspot.com On Fri, Apr 15,

best practice for reusing documents with multi-valued fields

2011-04-14 Thread Christopher Condit
I know that it's best practice to reuse the Document object when indexing, but I'm curious how multi-valued fields affect this. I tried this before indexing each document: doc.removeFields(myMultiValuedField); for (String fieldName: fieldNames) { Field field= doc.getField(field);

Re: Best practice for stemming and exact matching

2011-04-01 Thread Christopher Condit
>> Ideally I'd like to have the parser use the >> custom analyzer for everything unless it's going to parse a clause into >> a PhraseQuery or a MultiPhraseQuery, in which case it uses the >> SimpleAnalyzer and looks in the _exact field - but I can't figure out >> the best way to accomplish this. >

Re: Best practice for stemming and exact matching

2011-03-29 Thread Robert Muir
On Tue, Mar 29, 2011 at 6:56 PM, Christopher Condit wrote: > Ideally I'd like to have the parser use the > custom analyzer for everything unless it's going to parse a clause into > a PhraseQuery or a MultiPhraseQuery, in which case it uses the > SimpleAnalyzer and looks in the _exact field - but

Best practice for stemming and exact matching

2011-03-29 Thread Christopher Condit
MultiPhraseQuery, in which case it uses the SimpleAnalyzer and looks in the _exact field - but I can't figure out the best way to accomplish this. Has anyone else encountered the same problem? Is there a best practice for doing this - or something much

Re: best practice: 1.4 billions documents

2010-11-29 Thread Ian Lea
use IndexSearcher with MultiReader? > > Regards > Ganesh > > - Original Message - > From: "Robert Muir" > To: > Sent: Saturday, November 27, 2010 1:28 AM > Subject: Re: best practice: 1.4 billions documents > > >> On Fri, Nov 26, 2010 at 12:4

Re: best practice: 1.4 billions documents

2010-11-29 Thread Ganesh
- Original Message - From: "Robert Muir" To: Sent: Saturday, November 27, 2010 1:28 AM Subject: Re: best practice: 1.4 billions documents > On Fri, Nov 26, 2010 at 12:49 PM, Uwe Schindler wrote: >> This is the problem for Fuzzy: each searcher expands the fuzzy quer

Re: best practice: 1.4 billions documents

2010-11-26 Thread Robert Muir
On Fri, Nov 26, 2010 at 12:49 PM, Uwe Schindler wrote: > This is the problem for Fuzzy: each searcher expands the fuzzy query to a > different Boolean Query and so the scores are not comparable - MultiSearcher > (but not Solr) tries to combine the resulting rewritten queries into one > query, so e

RE: best practice: 1.4 billions documents

2010-11-26 Thread Uwe Schindler
er@lucene.apache.org; Uwe Schindler > Subject: Re: best practice: 1.4 billions documents > > On Mon, Nov 22, 2010 at 12:49 PM, Uwe Schindler wrote: > > (Fuzzy scores on > > MultiSearcher and Solr are totally wrong because each shard uses > > another rewritten query). &

Re: best practice: 1.4 billions documents

2010-11-26 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 12:49 PM, Uwe Schindler wrote: > (Fuzzy scores on > MultiSearcher and Solr are totally wrong because each shard uses another > rewritten query). Hmmm, really? I thought that fuzzy scoring should just rely on edit distance? Oh wait, I think I see - it's because we can use

RE: best practice: 1.4 billions documents

2010-11-25 Thread Uwe Schindler
e eMail: u...@thetaphi.de > -Original Message- > From: Ganesh [mailto:emailg...@yahoo.co.in] > Sent: Thursday, November 25, 2010 9:55 AM > To: java-user@lucene.apache.org > Subject: Re: best practice: 1.4 billions documents > > Thanks for the input. > > My results

Re: best practice: 1.4 billions documents

2010-11-25 Thread Ganesh
Thanks for the input. My results are sorted by date and i am not much bothered about score. Will i still be in trouble? Regards Ganesh - Original Message - From: "Robert Muir" To: Sent: Thursday, November 25, 2010 1:45 PM Subject: Re: best practice: 1.4 billions document

Re: best practice: 1.4 billions documents

2010-11-25 Thread Robert Muir
On Thu, Nov 25, 2010 at 2:58 AM, Uwe Schindler wrote: > ParallelMultiSearcher as subclass of MultiSearcher has the same problems. > These are not crashes, but more that some queries do not return correct > scored results for some queries. This effects especially all MultiTermQueries > (TermRang

RE: best practice: 1.4 billions documents

2010-11-24 Thread Uwe Schindler
ilto:emailg...@yahoo.co.in] > Sent: Thursday, November 25, 2010 6:44 AM > To: java-user@lucene.apache.org > Subject: Re: best practice: 1.4 billions documents > > Since there was a debate about using multisearcher, what about using > ParallelMultiSearcher? > > I am having indexe

Re: best practice: 1.4 billions documents

2010-11-24 Thread Ganesh
now i didn't faced any issue. I used Lucene 2.9 and recently upgraded to 3.0.2. Do i need to switch to MultiReader? Regards Ganesh - Original Message - From: "Luca Rondanini" To: Sent: Monday, November 22, 2010 11:29 PM Subject: Re: best practice: 1.4 billions docu

Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
eheheheh, 1.4 billion of documents = 1,400,000,000 documents for almost 2T = 2 therabites = 2000 gigas on HD! On Mon, Nov 22, 2010 at 10:16 AM, wrote: > > of course I will distribute my index over many machines: > > store everything on > > one computer is just crazy, 1.4B docs is going to b

RE: best practice: 1.4 billions documents

2010-11-22 Thread spring
> of course I will distribute my index over many machines: > store everything on > one computer is just crazy, 1.4B docs is going to be an index > of almost 2T > (in my case) billion = giga in english billion = tera in non-english 2T docs = 2.000.000.000.000 docs... ;) AFAIK 2 ^ 32 - 1 docs is

Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
earchers, indexing additional documents, or filling FieldCache in > parallel. > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > -Original Messa

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
il.com [mailto:ysee...@gmail.com] On Behalf Of Yonik > Seeley > Sent: Monday, November 22, 2010 6:29 PM > To: java-user@lucene.apache.org > Subject: Re: best practice: 1.4 billions documents > > On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler wrote: > > The latest discussion

Re: best practice: 1.4 billions documents

2010-11-22 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler wrote: > The latest discussion was more about MultiReader vs. MultiSearcher. > > But you are right, 1.4 B documents is not easy to go, especially when you > index grows and you get to the 2.1 B marker, then no MultiSearcher or > whatever helps. > > O

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
er@lucene.apache.org > Subject: Re: best practice: 1.4 billions documents > > Am I the only one who thinks this is not the way to go, MultiReader (or > MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on > one machine is a big number, does not matter how you

Re: best practice: 1.4 billions documents

2010-11-22 Thread eks dev
ling FieldCache in parallel. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: David Fertig [mailto:dfer...@cymfony.com] > > Sent: Monday,

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: David Fertig [mailto:dfer...@cymfony.com] > Sent: Monday, November 22, 2010 5:57 PM > To: java-user@lucene.apache.org > Subject: RE: best practice: 1.4 billions documents >

RE: best practice: 1.4 billions documents

2010-11-22 Thread David Fertig
--- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, November 22, 2010 11:19 AM To: java-user@lucene.apache.org Subject: RE: best practice: 1.4 billions documents There is no reason to use MultiSearcher instead the much more consistent and effective MultiReader! We (Robert and me) are

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
remen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: David Fertig [mailto:dfer...@cymfony.com] > Sent: Monday, November 22, 2010 4:54 PM > To: java-user@lucene.apache.org > Subject: RE: best practice: 1.4 billions documents > > >> We have a couple

RE: best practice: 1.4 billions documents

2010-11-22 Thread David Fertig
ni [mailto:luca.rondan...@gmail.com] Sent: Monday, November 22, 2010 1:47 AM To: java-user@lucene.apache.org Subject: Re: best practice: 1.4 billions documents Hi David, thanks for your answer. it really helped a lot! so, you have an index with more than 2 billions segments. this is pretty

Re: best practice: 1.4 billions documents

2010-11-22 Thread Erick Erickson
iginal Message- > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com] > > Sent: Sunday, November 21, 2010 8:13 PM > > To: java-user@lucene.apache.org; yo...@lucidimagination.com > > Subject: Re: best practice: 1.4 billions documents > > > > thank you bot

Re: best practice: 1.4 billions documents

2010-11-21 Thread Luca Rondanini
M > To: java-user@lucene.apache.org; yo...@lucidimagination.com > Subject: Re: best practice: 1.4 billions documents > > thank you both! > > Johannes, katta seems interesting but I will need to solve the problems of > "hot" updates to the index > > Yonik, I see

RE: best practice: 1.4 billions documents

2010-11-21 Thread David Fertig
From: Luca Rondanini [mailto:luca.rondan...@gmail.com] Sent: Sunday, November 21, 2010 8:13 PM To: java-user@lucene.apache.org; yo...@lucidimagination.com Subject: Re: best practice: 1.4 billions documents thank you both! Johannes, katta seems interesting but I will need to solve the problems of &qu

Re: best practice: 1.4 billions documents

2010-11-21 Thread Luca Rondanini
thank you both! Johannes, katta seems interesting but I will need to solve the problems of "hot" updates to the index Yonik, I see your point - so your suggestion would be to build an architecture based on ParallelMultiSearcher? On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley wrote: > On Sun, No

Re: best practice: 1.4 billions documents

2010-11-21 Thread Yonik Seeley
On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini wrote: > Hi everybody, > > I really need some good advice! I need to index in lucene something like 1.4 > billions documents. I had experience in lucene but I've never worked with > such a big number of documents. Also this is just the number of docs

Re: best practice: 1.4 billions documents

2010-11-21 Thread Johannes Goll
Hi Luca, Katta is an open-source project that integrates Lucene with Hadoop http://katta.sourceforge.net Johannes 2010/11/21 Luca Rondanini > Hi everybody, > > I really need some good advice! I need to index in lucene something like > 1.4 > billions documents. I had experience in lucene but I'

best practice: 1.4 billions documents

2010-11-21 Thread Luca Rondanini
Hi everybody, I really need some good advice! I need to index in lucene something like 1.4 billions documents. I had experience in lucene but I've never worked with such a big number of documents. Also this is just the number of docs at "start-up": they are going to grow and fast. I don't have to

Re: Best practice for embedding extra information in an index

2010-09-21 Thread Erick Erickson
Off the top of my head... 1) is certainly easiest. This looks suspiciously like synonyms. That is, at index time you inject the ID as a synonym in the text and it gets indexed at the same position as the token. Why this helps is that then phrase queries continue to work. Lucene in Actio

Best practice for embedding extra information in an index

2010-09-21 Thread Christopher Condit
I'm curious about embedding extra information in an index (and being able to search the extra information as well). In this case certain tokens correspond to recognized entities with ids. I'd like to get the ids into the index so that searching for the id of the entity will also return that docu

Re: What is the best practice of using synonymy ?

2010-03-23 Thread Jeff Zhang
Ahmet, Thanks for your suggestion, and could you explain more about this or give me a refer article that explains the reason in details ? Thanks On Tue, Mar 23, 2010 at 6:33 PM, Ahmet Arslan wrote: > > > > I'd like to use the synonymy in my project. And I think > > there's two > > candidates s

Re: What is the best practice of using synonymy ?

2010-03-23 Thread Anshum
Index time is a much better approach. The only negative about it is the index size increase. I've used it for a considerable sized dataset and even the index time doesn't seem to go up considerably. Searching of multiple terms is generally unoptimized when you can do it with 1. -- Anshum Gupta Nau

Re: What is the best practice of using synonymy ?

2010-03-23 Thread Ahmet Arslan
> I'd like to use the synonymy in my project. And I think > there's two > candidates solution : > 1. using the synonymy in the indexing stage, enhance the > index by using > synonymy > 2. using the synonymy in the search stage, enhance the > search query by > synonymy . > > I'd like to know whic

What is the best practice of using synonymy ?

2010-03-22 Thread Jeff Zhang
Hi all, I'd like to use the synonymy in my project. And I think there's two candidates solution : 1. using the synonymy in the indexing stage, enhance the index by using synonymy 2. using the synonymy in the search stage, enhance the search query by synonymy . I'd like to know which one is better

RE: Batch Indexing - best practice?

2010-03-17 Thread Murdoch, Paul
> > -Original Message- > From: java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org > [mailto:java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org > ] On Behalf Of Mark Miller > Sent: Monday, March 15, 2010 10:48 AM > To: java-user@lucene.apache.org

Re: Batch Indexing - best practice?

2010-03-15 Thread Erick Erickson
-Original Message- > From: java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org > [mailto:java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org > ] On Behalf Of Mark Miller > Sent: Monday, March 15, 2010 10:48 AM > To: java-user@lucene.apache.org > Subject:

Re: Batch Indexing - best practice?

2010-03-15 Thread Mark Miller
:48 AM To: java-user@lucene.apache.org Subject: Re: Batch Indexing - best practice? On 03/15/2010 10:41 AM, Murdoch, Paul wrote: Hi, I'm using Lucene 2.9.2. Currently, when creating my index, I'm calling indexWriter.addDocument(doc) for each Document I want to ind

RE: Batch Indexing - best practice?

2010-03-15 Thread Murdoch, Paul
:java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org ] On Behalf Of Mark Miller Sent: Monday, March 15, 2010 10:48 AM To: java-user@lucene.apache.org Subject: Re: Batch Indexing - best practice? On 03/15/2010 10:41 AM, Murdoch, Paul wrote: > Hi, > > > > I'm using

Re: Batch Indexing - best practice?

2010-03-15 Thread Ian Lea
See http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for plenty of tips. Suggested by Mike just a few hours ago in another thread ... -- Ian. On Mon, Mar 15, 2010 at 2:41 PM, Murdoch, Paul wrote: > Hi, > > > > I'm using Lucene 2.9.2.  Currently, when creating my index, I'm calling > ind

Re: Batch Indexing - best practice?

2010-03-15 Thread Mark Miller
On 03/15/2010 10:41 AM, Murdoch, Paul wrote: Hi, I'm using Lucene 2.9.2. Currently, when creating my index, I'm calling indexWriter.addDocument(doc) for each Document I want to index. The Documents aren't large and I'm averaging indexing about 500 documents every 90 seconds. I'd like to try

Batch Indexing - best practice?

2010-03-15 Thread Murdoch, Paul
Hi, I'm using Lucene 2.9.2. Currently, when creating my index, I'm calling indexWriter.addDocument(doc) for each Document I want to index. The Documents aren't large and I'm averaging indexing about 500 documents every 90 seconds. I'd like to try and speed this upunless 90 seconds for 50

Re: Best Practice 3.0.0

2010-02-08 Thread Michael McCandless
Use IndexWriter.getReader to get a near real-time reader, after making changes... Mike On Mon, Feb 8, 2010 at 3:45 AM, NanoE wrote: > > Hello, > > I am writing small library search and want to know what are the best > practice for lucene 3.0.0 for almost real time index update?

Best Practice 3.0.0

2010-02-08 Thread NanoE
Hello, I am writing small library search and want to know what are the best practice for lucene 3.0.0 for almost real time index update? Thanks Nano -- View this message in context: http://old.nabble.com/Best-Practice-3.0.0-tp27496796p27496796.html Sent from the Lucene - Java Users mailing

Re: best practice on too many files vs IO overhead

2009-11-27 Thread Michael McCandless
Phew :) Thanks for bringing closure! Mike On Fri, Nov 27, 2009 at 6:02 AM, Michael McCandless wrote: > If in fact you are using CFS (it is the default), and your OS is > letting you use 10240 descriptors, and you haven't changed the > mergeFactor, then something is seriously wrong.  I would tri

Re: best practice on too many files vs IO overhead

2009-11-27 Thread Istvan Soos
You were right, my bad... I have an async reader closing on a scheduled basis (after the writer refreshes the index, to not interrupt the ongoing searches), but while I've setup the scheduling for my first two index, I've forgotten it in my third... oh dear... Thanks anyway the info, it was usefu

Re: best practice on too many files vs IO overhead

2009-11-27 Thread Michael McCandless
If in fact you are using CFS (it is the default), and your OS is letting you use 10240 descriptors, and you haven't changed the mergeFactor, then something is seriously wrong. I would triple check that all readers are being closed. Or... if you list the index directory, how many files do you see?

Re: best practice on too many files vs IO overhead

2009-11-27 Thread Istvan Soos
On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless wrote: > Are you sure you're closing all readers that you're opening? Absolutely. :) (okay, never say this, but I had bugz because of this previously so I'm pretty sure that one is ok). > It's surprising with normal usage of Lucene that you'd

Re: best practice on too many files vs IO overhead

2009-11-27 Thread Michael McCandless
Are you sure you're closing all readers that you're opening? It's surprising with normal usage of Lucene that you'd run out of descriptors, with its default mergeFactor (have you increased the mergeFactor)? You can also enable compound file, which uses far fewer file descriptors, at some cost to

best practice on too many files vs IO overhead

2009-11-27 Thread Istvan Soos
Hi, I've a requirement that involves frequent, batched update of my Lucene index. This is done by a memory queue and process that periodically wakes and process that queue into the Lucene index. If I do not optimize my index, I'll receive "too many open files" exception (yeah, right, I can get th

Re: what's the best practice for getting "next page" of hits?

2009-02-19 Thread Erick Erickson
The best practice is, well, "It Depends" (tm). First off, I wouldn't do any caching of results unless and until you had a reasonable certainty that you had performance issues, so would by my first choice. And if you *did* start to see performance issues, I'd look first at

Re: what's the best practice for getting "next page" of hits?

2009-02-19 Thread Joel Halbert
...@earthlink.net Subject: Re: what's the best practice for getting "next page" of hits? Date: Thu, 19 Feb 2009 10:48:02 +0530 Your solution (b) is better rather than using your own way of paging. Do search for every page and collect the (pageno * count) results, discard (pageno

Re: what's the best practice for getting "next page" of hits?

2009-02-18 Thread Ganesh
: To: Sent: Thursday, February 19, 2009 8:59 AM Subject: what's the best practice for getting "next page" of hits? R2.4 So, I may well be missing something here, but: I use IndexSearcher.search(someQuery, null, count, new Sort()); to get an instance of TopFieldDocs (the "

what's the best practice for getting "next page" of hits?

2009-02-18 Thread rolarenfan
R2.4 So, I may well be missing something here, but: I use IndexSearcher.search(someQuery, null, count, new Sort()); to get an instance of TopFieldDocs (the "Hits" is deprecated). So far, all fine; I get a bunch of documents. Now, what is the Lucene-best-practice for getting the *n

Best Practice for Lucene Search

2009-02-11 Thread Konstantyn Smirnov
ew this message in context: http://www.nabble.com/Best-Practice-for-Lucene-Search-tp21748839p21955474.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene

Re: Best Practice for Lucene Search

2009-02-02 Thread Karsten F.
ed, searched this Forum and read the manual, but I'm not sure what > would be the best practice for Lucene search. > > I have an e-Commerce application with about 10 mySQL tables for my > products. And I have an Index (which is working fine), with about 10 > fields for every pr

Re: Best Practice for Lucene Search

2009-02-01 Thread ilwes
I like the point about doing things the easiest way possible until it starts to become a problem. Thank you very much for your answers and for the insight how you handle this issue. You helped me a lot. Ilwes -- View this message in context: http://www.nabble.com/Best-Practice-for-Lucene

Re: Best Practice for Lucene Search

2009-01-30 Thread Erick Erickson
tails you have/ expect to have, because the answer varies depending upon what you need/expect. Best Erick On Fri, Jan 30, 2009 at 10:08 AM, ilwes wrote: > > Hello, > > I googled, searched this Forum and read the manual, but I'm not sure what > would be the best practic

RE: Best Practice for Lucene Search

2009-01-30 Thread Uwe Schindler
...@thetaphi.de > -Original Message- > From: Ian Lea [mailto:ian@gmail.com] > Sent: Friday, January 30, 2009 4:57 PM > To: java-user@lucene.apache.org > Subject: Re: Best Practice for Lucene Search > > That answer is fine, but there are others. We store denormalized

Re: Best Practice for Lucene Search

2009-01-30 Thread Ian Lea
- just what is returned for product searches. Overall I don't think there is a single best practice recommendation. As so often, it depends on your setup, requirements and preferences. -- Ian. On Fri, Jan 30, 2009 at 3:13 PM, Nilesh Thatte wrote: > Hello > > I would store normalised

Re: Best Practice for Lucene Search

2009-01-30 Thread Nilesh Thatte
Hello I would store normalised data in MySQL and index only searchable content in Lucene. Regards Nilesh   From: ilwes To: java-user@lucene.apache.org Sent: Friday, 30 January, 2009 15:08:10 Subject: Best Practice for Lucene Search Hello, I googled

Best Practice for Lucene Search

2009-01-30 Thread ilwes
Hello, I googled, searched this Forum and read the manual, but I'm not sure what would be the best practice for Lucene search. I have an e-Commerce application with about 10 mySQL tables for my products. And I have an Index (which is working fine), with about 10 fields for every product.

Re: Best practice for updating an index when reindexing is not an option

2008-07-11 Thread Michael McCandless
OK, sounds good. Fall will be here before you know it! Mike Christopher Kolstad wrote: The only way to make this work with svn is if you can have svn perform a switch without doing any removal, then restart your IndexSearcher, then do a normal svn switch to remove the now unused files.

Re: Best practice for updating an index when reindexing is not an option

2008-07-11 Thread Christopher Kolstad
> > The only way to make this work with svn is if you can have svn perform a > switch without doing any removal, then restart your IndexSearcher, then do a > normal svn switch to remove the now unused files. Does svn have an option > to "switch but don't remove any removed files"? Because IndexSe

Re: Best practice for updating an index when reindexing is not an option

2008-07-11 Thread Michael McCandless
OK, got it. The only way to make this work with svn is if you can have svn perform a switch without doing any removal, then restart your IndexSearcher, then do a normal svn switch to remove the now unused files. Does svn have an option to "switch but don't remove any removed files"? Bec

Re: Best practice for updating an index when reindexing is not an option

2008-07-11 Thread Christopher Kolstad
Hi. First, thanks for the reply. Why does SubversionUpdate require shutting down the IndexSearcher? What > goes wrong? > SubversionUpdate requires shutting down the IndexSearcher in our current implementation because the old index files are deleted in the tag we're switching to. Sorry, just rea

Re: Best practice for updating an index when reindexing is not an option

2008-07-10 Thread Michael McCandless
Why does SubversionUpdate require shutting down the IndexSearcher? What goes wrong? You might want to switch instead to rsync. A Lucene index is fundamentally write once, so, syncing changes over should simply be copying over new files and removing now-deleted files. You won't be able

Best practice for updating an index when reindexing is not an option

2008-07-10 Thread Christopher Kolstad
Hi. Currently using Lucene 2.3.2 in a tomcat webapp. We have an action configured that performs reindexing on our staging server. However, our live server can not reindex since it does not have the necessary dtd files to process the xml. To update the index on the live server we perform a subvers

Re: Search by KeyWord, the best practice

2007-12-27 Thread Erick Erickson
webspeak <[EMAIL PROTECTED]> wrote: > > > >> > >> Hello, > >> > >> I would like to search documents by "CUSTOMER". > >> So I search on the field "CUSTOMER" using a KeywordAnalyzer. > >> > >> The CUST

Re: Search by KeyWord, the best practice

2007-12-27 Thread webspeak
ike to search documents by "CUSTOMER". >> So I search on the field "CUSTOMER" using a KeywordAnalyzer. >> >> The CUSTOMER field is indexed with those params: >> Field.Index.UN_TOKENIZED >> Field.Index.Store >> >> Is it the Best Practice ?

Re: Search by KeyWord, the best practice

2007-12-27 Thread Erick Erickson
search on the field "CUSTOMER" using a KeywordAnalyzer. > > The CUSTOMER field is indexed with those params: > Field.Index.UN_TOKENIZED > Field.Index.Store > > Is it the Best Practice ? > > -- > View this message in context: > http://www.nabble.com/Search-by-Key

Re: Search by KeyWord, the best practice

2007-12-27 Thread Grant Ingersoll
eywordAnalyzer. The CUSTOMER field is indexed with those params: Field.Index.UN_TOKENIZED Field.Index.Store Is it the Best Practice ? -- View this message in context: http://www.nabble.com/Search-by-KeyWord%2C-the-best-practice-tp14513720p14513720.html Sent from the Lucene - Java Users mailing list a

Search by KeyWord, the best practice

2007-12-27 Thread webspeak
Hello, I would like to search documents by "CUSTOMER". So I search on the field "CUSTOMER" using a KeywordAnalyzer. The CUSTOMER field is indexed with those params: Field.Index.UN_TOKENIZED Field.Index.Store Is it the Best Practice ? -- View this message in context: h

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby
Oh rats. Thunderbird ate the indenting. The two examples should be: multipart/alternative text/plain multipart/related text/html image/gif image/gif application/msword and multipart/related text/html image/

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby
lude wrote: You also mentioned indexing each bodypart ("attachment") separately. Why? To my mind, there is no use case where it makes sense to search a particular bodypart I will give you the use case: [snip] 3.) The result list would show this: 1. mail-1 'subject' 'Abstract of the messa

Re: Best Practice: emails and file-attachments

2006-08-16 Thread lude
Hi Johan, thanks again for the many words and explanations! You also mentioned indexing each bodypart ("attachment") separately. Why? To my mind, there is no use case where it makes sense to search a particular bodypart I will give you the use case: 1.) User searches for "abcd" 2.) Luc

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby
lude wrote: Hi John, thanks for the detailed answer. You wrote: If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does this mean you index just the first file-attachment? What do you advice, if you have to

Re: Best Practice: emails and file-attachments

2006-08-16 Thread lude
essage- From: lude [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 15, 2006 10:29 AM To: java-user@lucene.apache.org Subject: Best Practice: emails and file-attachments Hello, does anybody has an idea what is the best design approch for realizing the following: The goal is to index

Re: Best Practice: emails and file-attachments

2006-08-16 Thread lude
Hi John, thanks for the detailed answer. You wrote: If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does this mean you index just the first file-attachment? What do you advice, if you have to index mulitp

  1   2   >