m [mailto:ansh...@gmail.com]
> Sent: Wednesday, August 11, 2010 10:38 AM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> So, you didn't really use the setRamBuffer.. ?
> Any reasons for that?
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.c
nfosys.com
Phone: (M) 91 992 369 7200, (VoIP)2022978622
-Original Message-
From: Anshum [mailto:ansh...@gmail.com]
Sent: Wednesday, August 11, 2010 10:38 AM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
So, you didn't really use the setRamBuffer.. ?
Any r
: Scaling Lucene to 1bln docs
So, you didn't really use the setRamBuffer.. ?
Any reasons for that?
--
Anshum Gupta
http://ai-cafe.blogspot.com
On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh wrote:
> My final settings are:
> 1. 1.5 gig RAM to the jvm out of 2GB available for my
-
> From: Pablo Mendes [mailto:pablomen...@gmail.com]
> Sent: Tuesday, August 10, 2010 7:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Shelly,
> Do you mind sharing with the list the final settings you used for your best
> results?
compare with regular docs.
-Original Message-
From: Pablo Mendes [mailto:pablomen...@gmail.com]
Sent: Tuesday, August 10, 2010 7:22 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Shelly,
Do you mind sharing with the list the final settings you used for your best
2nd Ed. It'll help you get a hang of a lot of things! :)
>
> --
> Anshum
> http://blog.anshumgupta.net
>
> Sent from BlackBerry®
>
> -Original Message-
> From: Shelly_Singh
> Date: Tue, 10 Aug 2010 19:11:11
> To: java-user@lucene.apache.org
> Reply
2010 19:11:11
To: java-user@lucene.apache.org
Reply-To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
Hi folks,
Thanks for the excellent support n guidance on my very first day on this
mailing list...
At end of day, I have very optimistic results. 100bln search in less tha
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
That won't work...if you'll have something like "A Basic Crazy
Document E-something F-something G-somethingyou get the point" it
will go to all shards so the whole point of shards will be
compromi
anil ŢORIN [mailto:torin...@gmail.com]
Sent: Tuesday, August 10, 2010 6:52 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
That won't work...if you'll have something like "A Basic Crazy
Document E-something F-something G-somethingyou get the point" i
of another option.
>
> Comments welcome.
>
>
> -Original Message-
> From: Danil ŢORIN [mailto:torin...@gmail.com]
> Sent: Tuesday, August 10, 2010 6:11 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> I'd second t
le indices (one for each token). So,
the same document may be indexed into Shard"A", "M", "N" and "D".
I am not able to think of another option.
Comments welcome.
-Original Message-
From: Danil ŢORIN [mailto:torin...@gmail.com]
Sent: Tuesday, August
efficient merging algorithm.
>
> -Original Message-
> From: Dan OConnor [mailto:docon...@acquiremedia.com]
> Sent: Tuesday, August 10, 2010 6:02 PM
> To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Shelly:
>
> You wouldn't
sage-
> From: Shelly_Singh [mailto:shelly_si...@infosys.com]
> Sent: Tuesday, August 10, 2010 8:20 AM
> To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> No sort. I will need relevance based on TF. If I shard, I will have to search
> in al indi
.
-Original Message-
From: Dan OConnor [mailto:docon...@acquiremedia.com]
Sent: Tuesday, August 10, 2010 6:02 PM
To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
Shelly:
You wouldn't necessarily have to use a multisearcher. A suggested alternative
is:
- shard into 10 in
...@gmail.com]
Sent: Tuesday, August 10, 2010 5:59 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Searching on all in dices shouldn't be that bad an idea instead of searching
a single huge index, specially considering you have a constraint on the
usable memory
ly_si...@infosys.com]
Sent: Tuesday, August 10, 2010 8:20 AM
To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
No sort. I will need relevance based on TF. If I shard, I will have to search
in al indices.
-Original Message-
From: anshum.gu...@naukri.com [mailto:ansh...@gmai
gt; -Original Message-
> From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
> Sent: Tuesday, August 10, 2010 1:54 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Would like to know, are you using a particular type of sort? Do you need to
No sort. I will need relevance based on TF. If I shard, I will have to search
in al indices.
-Original Message-
From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
Sent: Tuesday, August 10, 2010 1:54 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Would
me but the search
> time is highly unacceptable.
>
> Help again.
>
> -Original Message-
> From: Anshum [mailto:ansh...@gmail.com]
> Sent: Tuesday, August 10, 2010 12:55 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Hi Shelly,
> That se
o: java-user@lucene.apache.org
> Reply-To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Hi Anshum,
>
> I am already running with the 'setCompoundFile' option off.
> And thanks for pointing out mergeFactor. I had tried a higher mergeFa
, 10 Aug 2010 13:31:38
To: java-user@lucene.apache.org
Reply-To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
Hi Anshum,
I am already running with the 'setCompoundFile' option off.
And thanks for pointing out mergeFactor. I had tried a higher mergeFactor
coup
multisearcher for
searching. Will that help?
-Original Message-
From: Danil ŢORIN [mailto:torin...@gmail.com]
Sent: Tuesday, August 10, 2010 1:06 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
The problem actually won't be the indexing part.
Searching such
java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Hi Shelly,
That seems like a reasonable data set size. I'd suggest you increase your
mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
memory before writing it to a file (and incurring I/O). You could actual
The problem actually won't be the indexing part.
Searching such large dataset will require a LOT of memory.
If you'll need sorting or faceting on one of the fields, jvm will explode ;)
Also GC times on large jvm heap are pretty disturbing (if you care
about your search performance)
So I'd advise
Hi Shelly,
That seems like a reasonable data set size. I'd suggest you increase your
mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
memory before writing it to a file (and incurring I/O). You could actually
flush by RAM usage instead of a Doc count. Turn off using the Co
Hi,
I am developing an application which uses Lucene for indexing and searching 1
bln documents. (the document size is very small though. Each document has a
single field of 5-10 words; so I believe that my data size is within the tested
limits).
I am using the following configuration:
1.
cene's scaling performance and
> relevance tuning
> To: java-user@lucene.apache.org
> Date: Tuesday, July 20, 2010, 2:16 PM
> are there such events in Russia?
>
> Best Regards
> Alexander Aristov
>
>
> On 20 July 2010 17:59, Ivan Provalov
> wrote:
>
&
are there such events in Russia?
Best Regards
Alexander Aristov
On 20 July 2010 17:59, Ivan Provalov wrote:
> We are organizing a meetup in michigan on IR. The first meeting is on
> august 19. We will be talking about lucene's scalability and relevance
> tuning followed by a discussion. Feel
We are organizing a meetup in michigan on IR. The first meeting is on august
19. We will be talking about lucene's scalability and relevance tuning
followed by a discussion. Feel free to sign up:
http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group
Thanks,
Ivan provalov
cores.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Marcus Herou
> To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
> Sent: Wednesday, July 1, 2009 10:31:28 AM
> Subject: Re: Scaling out/up or a mix
>
> Hi ag
Hi agree that faceting might be the thing that defines this app. The app is
mostly snappy during daytime since we optimize the index around 7.00 GMT.
However faceting is never snappy.
We speeded things up a whole bunch by creating various "less cardinal"
fields from the originating publishedDate w
On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> The number of concurrent users today is insignficant but once we push
> for the service we will get into trouble... I know that since even one
> simple faceting query (which we will use to display trend graphs) can
> take forever (talking abo
Hi, like the sound of this.
What I am not familiar with in terms of Lucene is how the index get's
swapped in and out of memory. When it comes to database tables (non
partitionable tables at least) I know that one should have enough memory to
fit the entire index into memory to avoid file-sorts for
Hi.
The number of concurrent users today is insignficant but once we push for
the service we will get into trouble... I know that since even one simple
faceting query (which we will use to display trend graphs) can take forever
(talking about SOLR bytw). "Normal" Lucene queries (title:blah OR
desc
I have improved date-sorted searching performance pretty dramatically by
replacing the two step "search then sort" operation with a one step "use the
date as the score" algorithm. The main gotcha was making sure to not affect
which results get counted as hits in boolean searches, but overall I onl
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which
> is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR
On Tue, 2009-06-30 at 11:29 +0200, Uwe Schindler wrote:
> So the simple answer is always:
> If 64 bit platform with lots of RAM, use MMapDirectory.
Fair enough. That makes the RAM-focused solution much more scalable.
My point still stands though, as Marcus is currently examining his
hardware optio
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which
> is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR
On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"I
uestion:
Based on your findings what is the most challenging part to tune ? Sorting
or querying or what else?
//Marcus
>
>
>
>
>
>
> - Original Message
> > From: Marcus Herou
> > To: java-user@lucene.apache.org
> > Sent: Monday, 29 June, 2009 9:47:
On Sat, 2009-06-27 at 00:00 +0200, Marcus Herou wrote:
> We currently have about 90M documents and it is increasing rapidly so
> getting into the G+ document range is not going to be too far away.
We've performed fairly extensive tests regarding hardware for searches
and some minor tests on hardwa
> From: Marcus Herou
> To: java-user@lucene.apache.org
> Sent: Monday, 29 June, 2009 9:47:13
> Subject: Re: Scaling out/up or a mix
>
> Thanks for the answer.
>
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
>
>
Thanks for the answer.
Don't you think that part 1 of the email would give you a hint of nature of
the index ?
Index size(and growing): 16Gx8 = 128G
Doc size (data): 20k
Num docs: 90M
Num users: Few hundred but most critical is that the admin staff which is
using the index all day long.
Query typ
There is no single answer -- this is always application specific.
Without knowing anything about what you are doing:
1. disk i/o is probably the most critical. Go SSD or even RAM disk if
you can, if performance is absolutely critical
2. Sometimes CPU can become an issue, but 8 cores is probably
Hi. I think I need to be more specific.
What I am trying to find out is if I should aim for:
CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
RAM - if the index does not fit into RAM how much RAM should I then buy ?
Hi.
I currently have an index which is 16GB per machine (8 machines = 128GB)
(data is stored externally, not in index) and is growing like crazy (we are
indexing blogs which is crazy by nature) and have only allocated 2GB per
machine to the Lucene app since we are running some other stuff there in
ntical top results to
the global idf policy for the vast majority of searches.
Cheers
Mark
- Original Message
From: Karl Wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 18 July, 2008 2:33:29 PM
Subject: Re: Scaling
18 jul 2008 kl. 09.49 skrev Eric Bo
>
>> The scaling per machine should be linear. The overhead from the network
>> is
>> minimal because the Lucene object sizes are not impacting. Google
>> mentions
>> in one of their early white papers on scaling
>> http://labs.google.com/papers/googlecluster-
18 jul 2008 kl. 09.49 skrev Eric Bowman:
One thing I have trouble understanding is how scoring works in this
case. Does Lucene really "just work", or are there special things
we have to do to make sure that the scores are coherent so we can
actually decide which was the best match? What
Jason Rutherglen wrote:
The scaling per machine should be linear. The overhead from the network is
minimal because the Lucene object sizes are not impacting. Google mentions
in one of their early white papers on scaling
http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub
The scaling per machine should be linear. The overhead from the network is
minimal because the Lucene object sizes are not impacting. Google mentions
in one of their early white papers on scaling
http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub
indexes which are now
2008/7/16 Karl Wettin <[EMAIL PROTECTED]>:
> Is there some sort of a scaling strategies listing available? I think there
> is a Wiki page missing.
>
> What are the typical promblems I'll encounter when distributing the search
> over multiple machines?
>
> Do people spl
Is there some sort of a scaling strategies listing available? I think
there is a Wiki page missing.
What are the typical promblems I'll encounter when distributing the
search over multiple machines?
Do people split up their index per node or do they use the complete
index and res
cc
PM
Subject
Stable score scaling; LSI again
Please respond to
Hi, I have a couple of questions about how to alter the similarity scores.
I need scores that can be thresholded, and whose thresholds remain stable
even when I add documents to the IndexWriter. ie, identity should be a
fixed value such as 1.0. I know that for efficiency reasons, Lucene
doesn't do
I'm new to lucene and am interested in learning how enterprises deploy
multi-server installations of lucene for large 24x7 operations.
The first question that comes to mind is: are most of the design decisions made
at during development time, or can a simple server be 'grown into' something
as your 8
CPU/64GB ones) and well over 1B docs.
Otis
--
Lucene Consulting -- http://lucene-consulting.com/
- Original Message
From: muraalee <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Saturday, July 7, 2007 2:25:06 AM
Subject: Scaling Lucene to 500 million documents - prefer
currently running a Tomcat web application serving searches
>> over our Lucene index (10GB) on a single server machine (Dual 3GHz
>> CPU, 4GB RAM). Due to performance issues and to scale up to handle
>> more traffic/search requests, we are getting another server machine.
>>
&g
rate brokers to merge these results.
Did anybody did something like this in the past ? Appreciate if you guys can
share your experiences.
thanks
Murali V
--
View this message in context:
http://www.nabble.com/Scaling-Lucene-to-500-million%2B-documents---preferred-architecture-tf4040120.html#a1
Basically you need to separate your web app from your searching, for a
scalable solution. Searching is a different concern. You can develop more
kinds of search when new requirement comes in.
Technorati's way is very similar to one of DBSight configuration. One
machine is dedicated for indexing,
t two ways of scaling:
(1) duplicating the web application and index on the second machine
and load-balancing incoming users between the two servers.
(2) modifying our web application so that one machine will host our
web application (and associated MySQL database), while the other one
will host
Samuel LEMOINE a écrit :
> I'm acutely interrested by this issue too, as I'm working on
> distributed architecture of Lucene. I'm only at the very beginning of
> my study so that I can't help you much, but Hadoop maybe could fit to
> your requirements. It's a sub-project of Lucene aiming to paralle
.
We are looking at two ways of scaling:
(1) duplicating the web application and index on the second machine
and load-balancing incoming users between the two servers.
(2) modifying our web application so that one machine will host our
web application (and associated MySQL database), while the
h requests, we are getting another server machine.
>
> We are looking at two ways of scaling:
> (1) duplicating the web application and index on the second machine
> and load-balancing incoming users between the two servers.
>
> (2) modifying our web application so that one machine
ways of scaling:
(1) duplicating the web application and index on the second machine
and load-balancing incoming users between the two servers.
(2) modifying our web application so that one machine will host our
web application (and associated MySQL database), while the other one
will host the
I'm hoping I'm doing something wrong, because I've been impressed with
Lucene so far. The basic problem I'm seeing is that when I run the same
search several times against box A (with 1 RemoteSearchable), I see X
for an average search response time. When I run the same search several
times agains
66 matches
Mail list logo