Hi Jake,
Thanks for the great insight and suggestions.
I will investigate different optimize() levels and see if that helps my
particular use case -- if not I'll be considering the Zoie route and let you
know how I get on.
Cheers,
Chris
On Fri, Oct 9, 2009 at 3:40 PM, Jake Mannix wrote:
>
>
>
Jason:
I would really appreciate it if you would stop making false
statements and misinformation. Everyone is entitled to his/her opinions on
technologies, but deliberately making misleading and false information on
such a distribution is just unethical, and you'll end up just discrediting
On Thu, Oct 8, 2009 at 9:32 PM, Chris Were wrote:
> Zoie looks very close to what I'm after, however my whole app is written in
> Python and uses PyLucene, so there is a non-trivial amount of work to make
> things work with Zoie.
>
I've never used PyLucene before, but since it's a wrapper, plugg
Missed your response, thanks Bernd.
I don't think that's it, since I haven't been executing any commands like
that. The only thing I could think of is corruption. I've got the index
backed up in case there is a way to fix it (it won't matter in a week or so
since I cull any documents older than
>
> In this case, I'd say that if you have a reliable, scalable queueing system
> for
> getting indexing events distributed to all of your servers, then indexing
> on
> all replicas simultaneously can be the best way to have maximally realtime
> search, either using the very new feature of "near re
On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote:
> There is the Zoie system which uses the RAMDir
> solution,
>
Also, to clarify: zoie does not index into a RAMDir and then periodically
merge that
down to disk, as for one thing, this has a bad failure mode when the system
crashes,
as you
On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric wrote:
>
> Does anyone have any recommendations? I've looked at Katta, but it doesn't
> seem to support realtime searching. It also uses hdfs, which I've heard can
> be slow. I'm looking to serve 40gb of indexes and support about 1 million
> updates
Jason,
On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote:
> Today near realtime search (with or without SSDs) comes at a
> price, that is reduced indexing speed due to continued in RAM
> merging. People typically hack something together where indexes
> are held in a RAMDir until being flush
For the case where the text contains mixed languages there are
solutions that simutainously use morphological rules of two or more
languages. Coveo search does this but I don't know what their solution
looks like. I suppose one way to do it would be to stem all tokens
with all algorithms an
Eric,
Katta doesn't require HDFS which would be slow to search on,
though Katta can be used to copy indexes out of HDFS onto local
servers. The best bet is hardware that uses SSDs because merges
and update latency will greatly decrease and there won't be a
synchronous IO issue as there is with har
Does anyone have any recommendations? I've looked at Katta, but it
doesn't seem to support realtime searching. It also uses hdfs, which
I've heard can be slow. I'm looking to serve 40gb of indexes and
support about 1 million updates per day.
Thx
---
Thanks. Makes sense.
-Original Message-
From: Jake Mannix [mailto:jake.man...@gmail.com]
Sent: Wednesday, October 07, 2009 10:15 PM
To: java-user@lucene.apache.org
Subject: Re: 2.9: TopScoreDocCollector
Hi Eric,
Different Query classes have different options on whether they can
score
Is it possible a large merge is running? By default IW.close waits
for outstanding merges to complete. Can you post the stacktrace?
Mike
On Thu, Oct 8, 2009 at 5:22 PM, Jamie Band wrote:
> Hi All
>
> I have a long running situation where our indexing thread is getting stuck
> indefinitely in I
Did you do some extra locking around IndexWriter using the IndexWriter
itsself as mutex (e.g. synchronized(writer) {...}). This is not supported
and hangs. IndexWriter itself is thread-safe.
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.d
Hi All
I have a long running situation where our indexing thread is getting
stuck indefinitely in IndexWriter's close method. Yourkit shows the
thread to be stuck in TIME_WAITING. Any idea's on what could be causing
this?
Could it be one of the streams or readers we passed to the document?
I
Out of curiousity and perhaps for practical purposes, how does one
handle mixed language documents? I suppose one could extract the
words of a particular language and place it in a lang specific field?
Are there libraries to perform this (yet)?
On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling
Nigel wrote:
> Thanks, Mark. That makes sense. I guess if you do it in the right order,
> you're guaranteed to have the files in a consistent state, since the only
> thing that's actually overwritten is the segments.gen file at the end.
>
The main thing to do is to copy the segments_N files la
Thanks, Mark. That makes sense. I guess if you do it in the right order,
you're guaranteed to have the files in a consistent state, since the only
thing that's actually overwritten is the segments.gen file at the end.
What about the technique of creating a copy of the directory with hard links
a
Erick Erickson wrote:
I suspect your problem here is the line:
document = indexReader.document( doc );
See the caution in the docs
You could try using lazy loading (so you don't load all
the terms of the document, just those you're interested
in). And I *think* (but it's been a while) that if t
I suspect your problem here is the line:
document = indexReader.document( doc );
See the caution in the docs
You could try using lazy loading (so you don't load all
the terms of the document, just those you're interested
in). And I *think* (but it's been a while) that if the terms
you load are in
Oops, forgot to include the class I mentioned. Here it is:
public class QueryTermBoostingQuery extends CustomScoreQuery {
private Map queryTermWeights;
private float bias;
private IndexReader indexReader;
public QueryTermBoostingQuery( Query q, Map termWeights,
IndexReader indexReader, fl
I am trying to come up with a performant query that will allow me to use a
custom score where the custom score is a sum-product over a set of query
time weights where each weight gets applied only if the query time term
exists in the document . So for example if I have a doc with three fields:
comp
Hello, apologies for the typo before. I mean "Rain of Fire" but sleep
deprivation had gotten the better of me.
I've used Luke to get more details about the problem. Below, I've listed one
of the docs that I would expect to return a hit on a query of (text:"rain
fire").
stored/uncompressed,indexe
restoreState only restores the token contents, not the complete stream. So
you cannot roll back the token stream (and this was also not possible with
the old API). The while loop at the end of you code is not working as you
exspect because of this. You may use CachingTokenFilter, which can be reset
Hi all,
I'm trying to implement a "stop phrases filter" with the new TokenStream
API.
I would like to be able to peek into N tokens ahead, see if the current
token + N subsequent tokens match a "stop phrase" (the set of stop phrases
are saved in a HashSet), then discard all these tokens when they
Hi.
You may want to take a look at:
http://wordlist.sourceforge.net/
--
Nuno Seco
Christian Reuschling wrote:
Hi,
looking up the different terms with a common stem can be useful in different
scenarios - so I don't want to judge it whether someone needs it or not.
E.g., in the case you have
Hi,
looking up the different terms with a common stem can be useful in different
scenarios - so I don't want to judge it whether someone needs it or not.
E.g., in the case you have multilingual documents in your index, it is straight
forward to determine the language of the documents in order to
On Tue, Oct 06, 2009 at 07:51:44PM +0200, Karl Wettin wrote:
>
> 6 okt 2009 kl. 18.54 skrev David Causse:
>
> David, your timing couldn't be better. Just the other day I proposed
> that we deprecate InstantiatedIndexWriter. The sum of the reasons to
> this is that I'm a bit lazy. Your mail make
Stemmers are heuristic transformations aiming at reducing the
vocabulary's dimensionality (and for other purposes I don't want to
discuss here). For accurate transformations one would use a
lemmatization engine (typically dictionary-driven) combined with
morphological analysis for ambiguity resolu
Hi,
I had similar behaviour. On an self-build index on german wikipedia I searched
for the phrase "blaue blume". I've got 2 results. When I searched for +"blaue
blume" "vogel" I've got 59 results...strange.
I found out that when I create a plain BooleanQuery with just the phrase "blaue
blume" give
Could it be as simple as the fact that "Heart of Fire" != "Rain of
Fire"? Have you checked, with Luke for example, that the phrases
really are in the index?
Can't spot anything obviously wrong with the code. You could cut down
your example code to a minimal self contained program that
demonstrat
Mehdi,
your requirements sound to be fulfilled mostly by Apache Solr which is
a web-based packaging of Lucene.
paul.
Le 08-oct.-09 à 10:11, Mehdi Ben Hamida a écrit :
Hello,
I'm reviewing and doing some researches on Lucene Java 2.9.0, to
check if it
meets our needs.
Unfortunat
Hello,
I'm reviewing and doing some researches on Lucene Java 2.9.0, to check if it
meets our needs.
Unfortunately I don't find answers to some of my questions, and I hope you
can answer them, and provide any references that prove your answer.
- Do you confirm that Lucene enables load t
Hi Max
just a guess: maybe you deleted all *.c source files in that area and
unintentionally deleted this index file, too.
Bernd
On Fri, Oct 2, 2009 at 17:10, Max Lynch wrote:
> I'm getting this error when I try to run my searcher and my indexer:
>
> Traceback (most recent call last):
> self.
Yes it does. Thanks for the tips. I'm going to do some experimenting, and see
if I can post some results here.
Regards,
Maarten
Jake Mannix wrote:
>
> Hi Maarten,
>
> Five minutes is not tremendously frequently, and I imagine should be
> pretty
> fine, but again: it depends on how big your
35 matches
Mail list logo