Hi Karl,
I guess I must have individual terms in my query, along side the SHOULD
phrases with slops, since I don't want to miss on results , even if the
terms distance is huge.
Slop - will enrich the phrases with them.
Shingles - Good idea. I'll index bi-grams if performance because an issue.
In
I just realized this mail contained several incomplete sentences. I blame
norwegian beers. Please allow me to try it once again:
The most simple solution is to make use of slop in PhraseQuery, SpanNearQuery,
etc(?). Also consider permutations of #isInOrder() with alternative query
boosts.
Eve
The most simple solution is to use of slop in PhraseQuery, SpanNearQuery,
etc(?). Also consider permutations of #isInOrder() with alternative query
boosts.
Even though slop will create a greater score the closer the terms are, it might
still in some cases (usually when combined with other subq
Isn't this approach somewhat bad for term-frequency?
Words that would appear in several languages would be a lot more frequent
(hence less significative).
I'm still preferring the split-field method with a proper query expansion.
This way, the term-frequency is evaluated on the corpus of one lan
Because it does not find "junks" when you search "junk".
Or... chevaux when you search cheval.
paul
Le 19 janv. 2011 à 18:59, Luca Rondanini a écrit :
> why not just using the StandardAnalyzer? it works pretty well even with
> Asian languages!
>
>
>
> On Wed, Jan 19, 2011 at 12:23 AM, Shai E
why not just using the StandardAnalyzer? it works pretty well even with
Asian languages!
On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera wrote:
> If you index documents, each in a different language, but all its fields
> are
> of the same language, then what you can do is the following:
>
> Creat
If you index documents, each in a different language, but all its fields are
of the same language, then what you can do is the following:
Create separate indexes per language
---
This will work and is not too hard to set up. Requires some mainten
But for this, you need a skillfully designed:
- set of fields
- multiplexing analyzer
- query expansion
In one of my projects, we do not split language by fields and it's a pain...
I'm having recurring issues in one sense or the other.
- the "die" example that Oti s mentioned is a good one: stop-
I think we should be using lucene with snowball jar's which means one
index for all languages (ofcourse size of index is always a matter of
concerns).
Hope this helps.
-vinaya
On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
What is the "best practice" to support multiple languages, i
Hi
There are two types of multi-language docs:
1) Docs in different languages -- every document is one language
2) Each document has fields in different languages
I've dealt with both, and there are different solutions to each. Which of
them is yours?
Shai
On Tue, Jan 18, 2011 at 7:53 PM, Cleme
Hi Clemens,
If you will be searching individual languages, go with language-specific
indices. Wunder likes to give an example of "die" in German vs. English. :)
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Orig
On Thu, 2010-07-15 at 20:53 +0200, Christopher Condit wrote:
[Toke: 140GB single segment is huge]
> Sorry - I wasn't clear here. The total index size ends up being 140GB
> but to try to help improve performance we build 50 separate indexes
> (which end up being a bit under 3gb each) and then ope
> [Toke: No frequent updates]
>
> So everything is rebuild from scratch each time? Or do you mean that you're
> only adding new documents, not changing old ones?
Everything is reindexed from scratch - indexing speed is not essential to us...
> Either way, optimizing to a single 140GB segment is
On Wed, 2010-07-14 at 20:28 +0200, Christopher Condit wrote:
[Toke: No frequent updates]
> Correct - in fact there are no updates and no deletions. We index
> everything offline when necessary and just swap the new index in...
So everything is rebuild from scratch each time? Or do you mean that
Glen, thank you for this very thorough and informative post.
Lance Norskog
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
There are a number of strategies, on the Java or OS side of things:
- Use huge pages[1]. Esp on 64 bit and lots of ram. For long running,
large memory (and GC busy) applications, this has achieved significant
improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article
introducing and benc
Hi Toke-
> > * 20 million documents [...]
> > * 140GB total index size
> > * Optimized into a single segment
>
> I take it that you do not have frequent updates? Have you tried to see if you
> can get by with more segments without significant slowdown?
Correct - in fact there are no updates and n
You can also set the termsIndexDivisor when opening the IndexReader.
The terms index is an in-memory data structure and it an consume ALOT
of RAM when your index has many unique terms.
Flex (only on Lucene's trunk / next major release (4.0)) has reduced
this RAM usage (as well as the RAM required
On Tue, 2010-07-13 at 23:49 +0200, Christopher Condit wrote:
> * 20 million documents [...]
> * 140GB total index size
> * Optimized into a single segment
I take it that you do not have frequent updates? Have you tried to see
if you can get by with more segments without significant slowdown?
> Th
Le 13-juil.-10 à 23:49, Christopher Condit a écrit :
* are there performance optimizations that I haven't thought of?
The first and most important one I'd think of is get rid of NFS.
You can happily do a local copy which might, even for 10 Gb take less
than 30 seconds at server start.
pa
ailto:[EMAIL PROTECTED]
Sent: Thursday, December 06, 2007 12:10 PM
To: java-user@lucene.apache.org
Subject: Re: best practices for reloading an index for a searcher
If by reload you mean closing and opening the reader, then yes. You need
to do this in order to see the changes since the *last* tim
If by reload you mean closing and opening the reader, then yes. You need
to do this in order to see the changes since the *last* time you opened
the reader.
Think of it as the reader taking a snapshot of the index and using that
for its lifetime.
Be aware that opening a reader (and running the fi
e.org
: To: java-user@lucene.apache.org
: Subject: RE: best practices
:
: If that's it, that's fine. I guess I had in mind something else? For
: example, one of mine uses categories (something mentioned quite a bit),
: but it has some slight differences from what I've seen before. Items
e I said, if this is wiki is it, perfect!Maybe it is what I
was thinking of.
-Original Message-
From: Pasha Bizhan [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 17, 2006 11:38 AM
To: java-user@lucene.apache.org
Subject: RE: best practices
Hi,
> -Original Message-
&
Hi,
> -Original Message-
> From: John Powers [mailto:[EMAIL PROTECTED]
>
> Is there any repository of best practices? Does LIA represent that?
> I was thinking about a blog or something that everyone could
> post their solutions into.
I think http://wiki.apache.org/jakarta-lucene/Ho
See the paper at: http://labs.google.com/papers/mapreduce.html
"MapReduce is a programming model and an associated implementation for
processing and generating large data sets. Users specify a map
function that processes a key/value pair to generate a set of
intermediate key/value pairs, and a re
I am thinking of having a cluster of one indexer and a few searchers 1
to n.
The indexer will consist of a number of stages as defined in SEDA. I
must still do this decomposition. the resulting index will be published
via message q to the searchers that will stop doing searches long enough
to upda
Paul Smith wrote:
I'm not sure how generic or Nutch-specific Doug and Mike's MapReduce
code is in Nutch, I haven't been paying close enough attention.
Me too.. :) I didn't even know Nutch was now fully in the ASF, and I'm
a Member... :-$
Let me pipe in on behalf of the Nutch project... T
On 15/07/2005, at 3:57 PM, Otis Gospodnetic wrote:
The problem that I saw (from your email only) with the "ship the full
little index to the Queen" approach is that, from what I understand,
you eventually do addIndexes(Directory[]) in there, and as this
optimizes things in the end, this means y
t;>> an insignificant time. You also have to use bookkeeping to work
>
> >>>> out
> >>>>
> >>>> if a 'job' has not been completed in time (maybe failure by the
> >>>> worker) and decide whether the job should be resubmi
On Jul 14, 2005, at 9:45 PM, Paul Smith wrote:
Cl, I should go have a look at that.. That begs another
question though, where does Nutch stand in terms of the ASF? Did I
read (or dream) that Nutch may be coming in under ASF? I guess I
should get myself subscribed to the Nutch mailing
lucene system based on this architecture? Any advice would be
greatly
appreciated.
Peter Gelderbloem
Registered in England 3186704
-Original Message-----
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 May 2005 22:04
To: java-user@lucene.apache.org
Subject: Re: Best Practices
essage-
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 May 2005 22:04
To: java-user@lucene.apache.org
Subject: Re: Best Practices for Distributing Lucene Indexing and
Searching
On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote:
I don't really consider reading/writing to an NF
edu/~mdw/proj/seda/
I am just reading up on it now. Does anyone have experience
building a
lucene system based on this architecture? Any advice would be
greatly
appreciated.
Peter Gelderbloem
Registered in England 3186704
-Original Message-
From: Luke Francl [mailto:[EMAIL
w/proj/seda/
I am just reading up on it now. Does anyone have experience
building a
lucene system based on this architecture? Any advice would be
greatly
appreciated.
Peter Gelderbloem
Registered in England 3186704
-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECT
ecture? Any advice would be
> greatly
> > appreciated.
> >
> > Peter Gelderbloem
> >
> >Registered in England 3186704
> > -Original Message-
> > From: Luke Francl [mailto:[EMAIL PROTECTED]
> > Sent: 13 May 2005 22:04
> > To: java-user
---Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 May 2005 22:04
To: java-user@lucene.apache.org
Subject: Re: Best Practices for Distributing Lucene Indexing and
Searching
On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote:
I don't really consider reading/writing
Gelderbloem
Registered in England 3186704
-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 May 2005 22:04
To: java-user@lucene.apache.org
Subject: Re: Best Practices for Distributing Lucene Indexing and
Searching
On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote
On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote:
> I don't really consider reading/writing to an NFS mounted FSDirectory to
> be viable for the very reasons you listed; but I haven't really found any
> evidence of problems if you take they approach that a single "writer"
> node indexes to local
Yonik Seeley wrote:
I'm trying to support an interface where documents can be added one at
a time at a high rate (via HTTP POST). You don't know all of the
documents ahead of time, so you can't delete them all ahead of time.
A simple solution is to queue documents as they're posted. When either
I'm trying to support an interface where documents can be added one at
a time at a high rate (via HTTP POST). You don't know all of the
documents ahead of time, so you can't delete them all ahead of time.
Given this constraint, it seems like you can do one of two things:
1) collect all the docume
Yonik Seeley wrote:
This strategy looks very promising.
One drawback is that documents must be added directly to the main
index for this to be efficient. This is a bit of a problem if there
is a document uniqueness requirement (a unique id field).
This is easy to do with a single index. Here's th
This strategy looks very promising.
One drawback is that documents must be added directly to the main
index for this to be efficient. This is a bit of a problem if there
is a document uniqueness requirement (a unique id field).
If one takes the approach of adding docs to a separate lucene index
43 matches
Mail list logo