Re: How to setup a scalable deployment?

2009-10-08 Thread Chris Were
Hi Jake, Thanks for the great insight and suggestions. I will investigate different optimize() levels and see if that helps my particular use case -- if not I'll be considering the Zoie route and let you know how I get on. Cheers, Chris On Fri, Oct 9, 2009 at 3:40 PM, Jake Mannix wrote: > > >

Re: Realtime & distributed

2009-10-08 Thread John Wang
Jason: I would really appreciate it if you would stop making false statements and misinformation. Everyone is entitled to his/her opinions on technologies, but deliberately making misleading and false information on such a distribution is just unethical, and you'll end up just discrediting

Re: How to setup a scalable deployment?

2009-10-08 Thread Jake Mannix
On Thu, Oct 8, 2009 at 9:32 PM, Chris Were wrote: > Zoie looks very close to what I'm after, however my whole app is written in > Python and uses PyLucene, so there is a non-trivial amount of work to make > things work with Zoie. > I've never used PyLucene before, but since it's a wrapper, plugg

Re: FileNotFoundException on index

2009-10-08 Thread Max Lynch
Missed your response, thanks Bernd. I don't think that's it, since I haven't been executing any commands like that. The only thing I could think of is corruption. I've got the index backed up in case there is a way to fix it (it won't matter in a week or so since I cull any documents older than

Re: How to setup a scalable deployment?

2009-10-08 Thread Chris Were
> > In this case, I'd say that if you have a reliable, scalable queueing system > for > getting indexing events distributed to all of your servers, then indexing > on > all replicas simultaneously can be the best way to have maximally realtime > search, either using the very new feature of "near re

Re: Realtime & distributed

2009-10-08 Thread Jake Mannix
On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote: > There is the Zoie system which uses the RAMDir > solution, > Also, to clarify: zoie does not index into a RAMDir and then periodically merge that down to disk, as for one thing, this has a bad failure mode when the system crashes, as you

Re: Realtime & distributed

2009-10-08 Thread Jake Mannix
On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric wrote: > > Does anyone have any recommendations? I've looked at Katta, but it doesn't > seem to support realtime searching. It also uses hdfs, which I've heard can > be slow. I'm looking to serve 40gb of indexes and support about 1 million > updates

Re: Realtime & distributed

2009-10-08 Thread Jake Mannix
Jason, On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote: > Today near realtime search (with or without SSDs) comes at a > price, that is reduced indexing speed due to continued in RAM > merging. People typically hack something together where indexes > are held in a RAMDir until being flush

Re: Reverse stemmer?

2009-10-08 Thread Karl Wettin
For the case where the text contains mixed languages there are solutions that simutainously use morphological rules of two or more languages. Coveo search does this but I don't know what their solution looks like. I suppose one way to do it would be to stem all tokens with all algorithms an

Re: Realtime & distributed

2009-10-08 Thread Jason Rutherglen
Eric, Katta doesn't require HDFS which would be slow to search on, though Katta can be used to copy indexes out of HDFS onto local servers. The best bet is hardware that uses SSDs because merges and update latency will greatly decrease and there won't be a synchronous IO issue as there is with har

Realtime & distributed

2009-10-08 Thread Angel, Eric
Does anyone have any recommendations? I've looked at Katta, but it doesn't seem to support realtime searching. It also uses hdfs, which I've heard can be slow. I'm looking to serve 40gb of indexes and support about 1 million updates per day. Thx ---

RE: 2.9: TopScoreDocCollector

2009-10-08 Thread Angel, Eric
Thanks. Makes sense. -Original Message- From: Jake Mannix [mailto:jake.man...@gmail.com] Sent: Wednesday, October 07, 2009 10:15 PM To: java-user@lucene.apache.org Subject: Re: 2.9: TopScoreDocCollector Hi Eric, Different Query classes have different options on whether they can score

Re: Index.close() infinite TIME_WAITING

2009-10-08 Thread Michael McCandless
Is it possible a large merge is running? By default IW.close waits for outstanding merges to complete. Can you post the stacktrace? Mike On Thu, Oct 8, 2009 at 5:22 PM, Jamie Band wrote: > Hi All > > I have a long running situation where our indexing thread is getting stuck > indefinitely in I

RE: Index.close() infinite TIME_WAITING

2009-10-08 Thread Uwe Schindler
Did you do some extra locking around IndexWriter using the IndexWriter itsself as mutex (e.g. synchronized(writer) {...}). This is not supported and hangs. IndexWriter itself is thread-safe. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.d

Index.close() infinite TIME_WAITING

2009-10-08 Thread Jamie Band
Hi All I have a long running situation where our indexing thread is getting stuck indefinitely in IndexWriter's close method. Yourkit shows the thread to be stuck in TIME_WAITING. Any idea's on what could be causing this? Could it be one of the streams or readers we passed to the document? I

Re: Reverse stemmer?

2009-10-08 Thread Jason Rutherglen
Out of curiousity and perhaps for practical purposes, how does one handle mixed language documents? I suppose one could extract the words of a particular language and place it in a lang specific field? Are there libraries to perform this (yet)? On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-08 Thread Mark Miller
Nigel wrote: > Thanks, Mark. That makes sense. I guess if you do it in the right order, > you're guaranteed to have the files in a consistent state, since the only > thing that's actually overwritten is the segments.gen file at the end. > The main thing to do is to copy the segments_N files la

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-08 Thread Nigel
Thanks, Mark. That makes sense. I guess if you do it in the right order, you're guaranteed to have the files in a consistent state, since the only thing that's actually overwritten is the segments.gen file at the end. What about the technique of creating a copy of the directory with hard links a

Re: Question about how to speed up custom scoring

2009-10-08 Thread Andrzej Bialecki
Erick Erickson wrote: I suspect your problem here is the line: document = indexReader.document( doc ); See the caution in the docs You could try using lazy loading (so you don't load all the terms of the document, just those you're interested in). And I *think* (but it's been a while) that if t

Re: Question about how to speed up custom scoring

2009-10-08 Thread Erick Erickson
I suspect your problem here is the line: document = indexReader.document( doc ); See the caution in the docs You could try using lazy loading (so you don't load all the terms of the document, just those you're interested in). And I *think* (but it's been a while) that if the terms you load are in

Re: Question about how to speed up custom scoring

2009-10-08 Thread scott w
Oops, forgot to include the class I mentioned. Here it is: public class QueryTermBoostingQuery extends CustomScoreQuery { private Map queryTermWeights; private float bias; private IndexReader indexReader; public QueryTermBoostingQuery( Query q, Map termWeights, IndexReader indexReader, fl

Question about how to speed up custom scoring

2009-10-08 Thread scott w
I am trying to come up with a performant query that will allow me to use a custom score where the custom score is a sum-product over a set of query time weights where each weight gets applied only if the query time term exists in the document . So for example if I have a doc with three fields: comp

Re: Search By Phrase Not Working

2009-10-08 Thread sadronmeldir
Hello, apologies for the typo before. I mean "Rain of Fire" but sleep deprivation had gotten the better of me. I've used Luke to get more details about the problem. Below, I've listed one of the docs that I would expect to return a hit on a query of (text:"rain fire"). stored/uncompressed,indexe

RE: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"

2009-10-08 Thread Uwe Schindler
restoreState only restores the token contents, not the complete stream. So you cannot roll back the token stream (and this was also not possible with the old API). The while loop at the end of you code is not working as you exspect because of this. You may use CachingTokenFilter, which can be reset

Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"

2009-10-08 Thread Enrico Detoma
Hi all, I'm trying to implement a "stop phrases filter" with the new TokenStream API. I would like to be able to peek into N tokens ahead, see if the current token + N subsequent tokens match a "stop phrase" (the set of stop phrases are saved in a HashSet), then discard all these tokens when they

Re: Reverse stemmer?

2009-10-08 Thread Nuno Seco
Hi. You may want to take a look at: http://wordlist.sourceforge.net/ -- Nuno Seco Christian Reuschling wrote: Hi, looking up the different terms with a common stem can be useful in different scenarios - so I don't want to judge it whether someone needs it or not. E.g., in the case you have

Re: Reverse stemmer?

2009-10-08 Thread Christian Reuschling
Hi, looking up the different terms with a common stem can be useful in different scenarios - so I don't want to judge it whether someone needs it or not. E.g., in the case you have multilingual documents in your index, it is straight forward to determine the language of the documents in order to

Re: InstantiatedIndex questions

2009-10-08 Thread David Causse
On Tue, Oct 06, 2009 at 07:51:44PM +0200, Karl Wettin wrote: > > 6 okt 2009 kl. 18.54 skrev David Causse: > > David, your timing couldn't be better. Just the other day I proposed > that we deprecate InstantiatedIndexWriter. The sum of the reasons to > this is that I'm a bit lazy. Your mail make

Re: Reverse stemmer?

2009-10-08 Thread Dawid Weiss
Stemmers are heuristic transformations aiming at reducing the vocabulary's dimensionality (and for other purposes I don't want to discuss here). For accurate transformations one would use a lemmatization engine (typically dictionary-driven) combined with morphological analysis for ambiguity resolu

Re: Search By Phrase Not Working

2009-10-08 Thread Christian Reuschling
Hi, I had similar behaviour. On an self-build index on german wikipedia I searched for the phrase "blaue blume". I've got 2 results. When I searched for +"blaue blume" "vogel" I've got 59 results...strange. I found out that when I create a plain BooleanQuery with just the phrase "blaue blume" give

Re: Search By Phrase Not Working

2009-10-08 Thread Ian Lea
Could it be as simple as the fact that "Heart of Fire" != "Rain of Fire"? Have you checked, with Luke for example, that the phrases really are in the index? Can't spot anything obviously wrong with the code. You could cut down your example code to a minimal self contained program that demonstrat

Re: Review and questions about Lucene Java 2.9.0

2009-10-08 Thread Paul Libbrecht
Mehdi, your requirements sound to be fulfilled mostly by Apache Solr which is a web-based packaging of Lucene. paul. Le 08-oct.-09 à 10:11, Mehdi Ben Hamida a écrit : Hello, I'm reviewing and doing some researches on Lucene Java 2.9.0, to check if it meets our needs. Unfortunat

Review and questions about Lucene Java 2.9.0

2009-10-08 Thread Mehdi Ben Hamida
Hello, I'm reviewing and doing some researches on Lucene Java 2.9.0, to check if it meets our needs. Unfortunately I don't find answers to some of my questions, and I hope you can answer them, and provide any references that prove your answer. - Do you confirm that Lucene enables load t

Re: FileNotFoundException on index

2009-10-08 Thread Bernd Fondermann
Hi Max just a guess: maybe you deleted all *.c source files in that area and unintentionally deleted this index file, too. Bernd On Fri, Oct 2, 2009 at 17:10, Max Lynch wrote: > I'm getting this error when I try to run my searcher and my indexer: > > Traceback (most recent call last): > self.

Re: Best strategy for reindexing large amount of data

2009-10-08 Thread Maarten_D
Yes it does. Thanks for the tips. I'm going to do some experimenting, and see if I can post some results here. Regards, Maarten Jake Mannix wrote: > > Hi Maarten, > > Five minutes is not tremendously frequently, and I imagine should be > pretty > fine, but again: it depends on how big your