Re: Scaling Lucene to 1bln docs

2010-08-16 Thread Danil ŢORIN
m [mailto:ansh...@gmail.com] > Sent: Wednesday, August 11, 2010 10:38 AM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > So, you didn't really use the setRamBuffer.. ? > Any reasons for that? > > -- > Anshum Gupta > http://ai-cafe.blogspot.c

RE: Scaling Lucene to 1bln docs

2010-08-16 Thread Shelly_Singh
nfosys.com Phone: (M) 91 992 369 7200, (VoIP)2022978622 -Original Message- From: Anshum [mailto:ansh...@gmail.com] Sent: Wednesday, August 11, 2010 10:38 AM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs So, you didn't really use the setRamBuffer.. ? Any r

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
: Scaling Lucene to 1bln docs So, you didn't really use the setRamBuffer.. ? Any reasons for that? -- Anshum Gupta http://ai-cafe.blogspot.com On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh wrote: > My final settings are: > 1. 1.5 gig RAM to the jvm out of 2GB available for my

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
- > From: Pablo Mendes [mailto:pablomen...@gmail.com] > Sent: Tuesday, August 10, 2010 7:22 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Shelly, > Do you mind sharing with the list the final settings you used for your best > results?

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
compare with regular docs. -Original Message- From: Pablo Mendes [mailto:pablomen...@gmail.com] Sent: Tuesday, August 10, 2010 7:22 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Shelly, Do you mind sharing with the list the final settings you used for your best

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Pablo Mendes
2nd Ed. It'll help you get a hang of a lot of things! :) > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry® > > -Original Message- > From: Shelly_Singh > Date: Tue, 10 Aug 2010 19:11:11 > To: java-user@lucene.apache.org > Reply

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread anshum.gu...@naukri.com
2010 19:11:11 To: java-user@lucene.apache.org Reply-To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs Hi folks, Thanks for the excellent support n guidance on my very first day on this mailing list... At end of day, I have very optimistic results. 100bln search in less tha

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-somethingyou get the point" it will go to all shards so the whole point of shards will be compromi

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
anil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August 10, 2010 6:52 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-somethingyou get the point" i

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
of another option. > > Comments welcome. > > > -Original Message- > From: Danil ŢORIN [mailto:torin...@gmail.com] > Sent: Tuesday, August 10, 2010 6:11 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > I'd second t

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
le indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". I am not able to think of another option. Comments welcome. -Original Message- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread prashant ullegaddi
efficient merging algorithm. > > -Original Message- > From: Dan OConnor [mailto:docon...@acquiremedia.com] > Sent: Tuesday, August 10, 2010 6:02 PM > To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Shelly: > > You wouldn't

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
sage- > From: Shelly_Singh [mailto:shelly_si...@infosys.com] > Sent: Tuesday, August 10, 2010 8:20 AM > To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > No sort. I will need relevance based on TF. If I shard, I will have to search > in al indi

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
. -Original Message- From: Dan OConnor [mailto:docon...@acquiremedia.com] Sent: Tuesday, August 10, 2010 6:02 PM To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs Shelly: You wouldn't necessarily have to use a multisearcher. A suggested alternative is: - shard into 10 in

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
...@gmail.com] Sent: Tuesday, August 10, 2010 5:59 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Searching on all in dices shouldn't be that bad an idea instead of searching a single huge index, specially considering you have a constraint on the usable memory

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Dan OConnor
ly_si...@infosys.com] Sent: Tuesday, August 10, 2010 8:20 AM To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -Original Message- From: anshum.gu...@naukri.com [mailto:ansh...@gmai

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
gt; -Original Message- > From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] > Sent: Tuesday, August 10, 2010 1:54 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -Original Message- From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] Sent: Tuesday, August 10, 2010 1:54 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Would

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread findbestopensource
me but the search > time is highly unacceptable. > > Help again. > > -Original Message- > From: Anshum [mailto:ansh...@gmail.com] > Sent: Tuesday, August 10, 2010 12:55 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That se

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Michael McCandless
o: java-user@lucene.apache.org > Reply-To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFa

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread anshum.gu...@naukri.com
, 10 Aug 2010 13:31:38 To: java-user@lucene.apache.org Reply-To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor coup

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
multisearcher for searching. Will that help? -Original Message- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August 10, 2010 1:06 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs The problem actually won't be the indexing part. Searching such

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actual

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
The problem actually won't be the indexing part. Searching such large dataset will require a LOT of memory. If you'll need sorting or faceting on one of the fields, jvm will explode ;) Also GC times on large jvm heap are pretty disturbing (if you care about your search performance) So I'd advise

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Co

Scaling Lucene to 1bln docs

2010-08-09 Thread Shelly_Singh
Hi, I am developing an application which uses Lucene for indexing and searching 1 bln documents. (the document size is very small though. Each document has a single field of 5-10 words; so I believe that my data size is within the tested limits). I am using the following configuration: 1.

Re: IR meetup in Michigan - lucene's scaling performance and relevance tuning

2010-07-21 Thread Ivan Provalov
cene's scaling performance and  > relevance tuning > To: java-user@lucene.apache.org > Date: Tuesday, July 20, 2010, 2:16 PM > are there such events in Russia? > > Best Regards > Alexander Aristov > > > On 20 July 2010 17:59, Ivan Provalov > wrote: > &

Re: IR meetup in Michigan - lucene's scaling performance and relevance tuning

2010-07-20 Thread Alexander Aristov
are there such events in Russia? Best Regards Alexander Aristov On 20 July 2010 17:59, Ivan Provalov wrote: > We are organizing a meetup in michigan on IR. The first meeting is on > august 19. We will be talking about lucene's scalability and relevance > tuning followed by a discussion. Feel

IR meetup in Michigan - lucene's scaling performance and relevance tuning

2010-07-20 Thread Ivan Provalov
We are organizing a meetup in michigan on IR. The first meeting is on august 19. We will be talking about lucene's scalability and relevance tuning followed by a discussion. Feel free to sign up: http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group Thanks, Ivan provalov

Re: Scaling out/up or a mix

2009-07-02 Thread Otis Gospodnetic
cores. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Marcus Herou > To: solr-u...@lucene.apache.org; java-user@lucene.apache.org > Sent: Wednesday, July 1, 2009 10:31:28 AM > Subject: Re: Scaling out/up or a mix > > Hi ag

Re: Scaling out/up or a mix

2009-07-01 Thread Marcus Herou
Hi agree that faceting might be the thing that defines this app. The app is mostly snappy during daytime since we optimize the index around 7.00 GMT. However faceting is never snappy. We speeded things up a whole bunch by creating various "less cardinal" fields from the originating publishedDate w

Re: Scaling out/up or a mix

2009-07-01 Thread Toke Eskildsen
On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote: > The number of concurrent users today is insignficant but once we push > for the service we will get into trouble... I know that since even one > simple faceting query (which we will use to display trend graphs) can > take forever (talking abo

Re: Scaling out/up or a mix

2009-06-30 Thread Marcus Herou
Hi, like the sound of this. What I am not familiar with in terms of Lucene is how the index get's swapped in and out of memory. When it comes to database tables (non partitionable tables at least) I know that one should have enough memory to fit the entire index into memory to avoid file-sorts for

Re: Scaling out/up or a mix

2009-06-30 Thread Marcus Herou
Hi. The number of concurrent users today is insignficant but once we push for the service we will get into trouble... I know that since even one simple faceting query (which we will use to display trend graphs) can take forever (talking about SOLR bytw). "Normal" Lucene queries (title:blah OR desc

Re: Scaling out/up or a mix

2009-06-30 Thread Andy Goodell
I have improved date-sorted searching performance pretty dramatically by replacing the two step "search then sort" operation with a one step "use the date as the score" algorithm. The main gotcha was making sure to not affect which results get counted as hits in boolean searches, but overall I onl

RE: Scaling out/up or a mix

2009-06-30 Thread Uwe Schindler
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which > is > > using the index all day long. > > Query types: Example: title:"Iphone" OR

RE: Scaling out/up or a mix

2009-06-30 Thread Toke Eskildsen
On Tue, 2009-06-30 at 11:29 +0200, Uwe Schindler wrote: > So the simple answer is always: > If 64 bit platform with lots of RAM, use MMapDirectory. Fair enough. That makes the RAM-focused solution much more scalable. My point still stands though, as Marcus is currently examining his hardware optio

RE: Scaling out/up or a mix

2009-06-30 Thread Uwe Schindler
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which > is > > using the index all day long. > > Query types: Example: title:"Iphone" OR

Re: Scaling out/up or a mix

2009-06-30 Thread Toke Eskildsen
On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > Index size(and growing): 16Gx8 = 128G > Doc size (data): 20k > Num docs: 90M > Num users: Few hundred but most critical is that the admin staff which is > using the index all day long. > Query types: Example: title:"Iphone" OR description:"I

Re: Scaling out/up or a mix

2009-06-29 Thread Marcus Herou
uestion: Based on your findings what is the most challenging part to tune ? Sorting or querying or what else? //Marcus > > > > > > > - Original Message > > From: Marcus Herou > > To: java-user@lucene.apache.org > > Sent: Monday, 29 June, 2009 9:47:

Re: Scaling out/up or a mix

2009-06-29 Thread Toke Eskildsen
On Sat, 2009-06-27 at 00:00 +0200, Marcus Herou wrote: > We currently have about 90M documents and it is increasing rapidly so > getting into the G+ document range is not going to be too far away. We've performed fairly extensive tests regarding hardware for searches and some minor tests on hardwa

Re: Scaling out/up or a mix

2009-06-29 Thread eks dev
> From: Marcus Herou > To: java-user@lucene.apache.org > Sent: Monday, 29 June, 2009 9:47:13 > Subject: Re: Scaling out/up or a mix > > Thanks for the answer. > > Don't you think that part 1 of the email would give you a hint of nature of > the index ? > >

Re: Scaling out/up or a mix

2009-06-29 Thread Marcus Herou
Thanks for the answer. Don't you think that part 1 of the email would give you a hint of nature of the index ? Index size(and growing): 16Gx8 = 128G Doc size (data): 20k Num docs: 90M Num users: Few hundred but most critical is that the admin staff which is using the index all day long. Query typ

Re: Scaling out/up or a mix

2009-06-28 Thread Eric Bowman
There is no single answer -- this is always application specific. Without knowing anything about what you are doing: 1. disk i/o is probably the most critical. Go SSD or even RAM disk if you can, if performance is absolutely critical 2. Sometimes CPU can become an issue, but 8 cores is probably

Re: Scaling out/up or a mix

2009-06-28 Thread Marcus Herou
Hi. I think I need to be more specific. What I am trying to find out is if I should aim for: CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough. Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough... RAM - if the index does not fit into RAM how much RAM should I then buy ?

Scaling out/up or a mix

2009-06-26 Thread Marcus Herou
Hi. I currently have an index which is 16GB per machine (8 machines = 128GB) (data is stored externally, not in index) and is growing like crazy (we are indexing blogs which is crazy by nature) and have only allocated 2GB per machine to the Lucene app since we are running some other stuff there in

Re: Scaling

2008-07-18 Thread mark harwood
ntical top results to the global idf policy for the vast majority of searches. Cheers Mark - Original Message From: Karl Wettin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 18 July, 2008 2:33:29 PM Subject: Re: Scaling 18 jul 2008 kl. 09.49 skrev Eric Bo

Re: Scaling

2008-07-18 Thread Jason Rutherglen
> >> The scaling per machine should be linear. The overhead from the network >> is >> minimal because the Lucene object sizes are not impacting. Google >> mentions >> in one of their early white papers on scaling >> http://labs.google.com/papers/googlecluster-

Re: Scaling

2008-07-18 Thread Karl Wettin
18 jul 2008 kl. 09.49 skrev Eric Bowman: One thing I have trouble understanding is how scoring works in this case. Does Lucene really "just work", or are there special things we have to do to make sure that the scores are coherent so we can actually decide which was the best match? What

Re: Scaling

2008-07-18 Thread Eric Bowman
Jason Rutherglen wrote: The scaling per machine should be linear. The overhead from the network is minimal because the Lucene object sizes are not impacting. Google mentions in one of their early white papers on scaling http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub

Re: Scaling

2008-07-17 Thread Jason Rutherglen
The scaling per machine should be linear. The overhead from the network is minimal because the Lucene object sizes are not impacting. Google mentions in one of their early white papers on scaling http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub indexes which are now

Re: Scaling

2008-07-16 Thread Glen Newton
2008/7/16 Karl Wettin <[EMAIL PROTECTED]>: > Is there some sort of a scaling strategies listing available? I think there > is a Wiki page missing. > > What are the typical promblems I'll encounter when distributing the search > over multiple machines? > > Do people spl

Scaling

2008-07-16 Thread Karl Wettin
Is there some sort of a scaling strategies listing available? I think there is a Wiki page missing. What are the typical promblems I'll encounter when distributing the search over multiple machines? Do people split up their index per node or do they use the complete index and res

Re: Stable score scaling; LSI again

2008-07-15 Thread Asad Sayeed
cc PM Subject Stable score scaling; LSI again Please respond to

Stable score scaling; LSI again

2008-07-14 Thread Asad Sayeed
Hi, I have a couple of questions about how to alter the similarity scores. I need scores that can be thresholded, and whose thresholds remain stable even when I add documents to the IndexWriter. ie, identity should be a fixed value such as 1.0. I know that for efficiency reasons, Lucene doesn't do

Scaling out Lucene /general architecture Q

2007-10-26 Thread Mankowski, Chris
I'm new to lucene and am interested in learning how enterprises deploy multi-server installations of lucene for large 24x7 operations. The first question that comes to mind is: are most of the design decisions made at during development time, or can a simple server be 'grown into' something

Re: Scaling Lucene to 500 million documents - preferred architecture

2007-07-16 Thread Otis Gospodnetic
as your 8 CPU/64GB ones) and well over 1B docs. Otis -- Lucene Consulting -- http://lucene-consulting.com/ - Original Message From: muraalee <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, July 7, 2007 2:25:06 AM Subject: Scaling Lucene to 500 million documents - prefer

Re: Scaling up to several machines with Lucene

2007-07-07 Thread Chun Wei Ho
currently running a Tomcat web application serving searches >> over our Lucene index (10GB) on a single server machine (Dual 3GHz >> CPU, 4GB RAM). Due to performance issues and to scale up to handle >> more traffic/search requests, we are getting another server machine. >> &g

Scaling Lucene to 500 million+ documents - preferred architecture

2007-07-07 Thread muraalee
rate brokers to merge these results. Did anybody did something like this in the past ? Appreciate if you guys can share your experiences. thanks Murali V -- View this message in context: http://www.nabble.com/Scaling-Lucene-to-500-million%2B-documents---preferred-architecture-tf4040120.html#a1

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Chris Lu
Basically you need to separate your web app from your searching, for a scalable solution. Searching is a different concern. You can develop more kinds of search when new requirement comes in. Technorati's way is very similar to one of DBSight configuration. One machine is dedicated for indexing,

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Grant Ingersoll
t two ways of scaling: (1) duplicating the web application and index on the second machine and load-balancing incoming users between the two servers. (2) modifying our web application so that one machine will host our web application (and associated MySQL database), while the other one will host

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Mathieu Lecarme
Samuel LEMOINE a écrit : > I'm acutely interrested by this issue too, as I'm working on > distributed architecture of Lucene. I'm only at the very beginning of > my study so that I can't help you much, but Hadoop maybe could fit to > your requirements. It's a sub-project of Lucene aiming to paralle

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Samuel LEMOINE
. We are looking at two ways of scaling: (1) duplicating the web application and index on the second machine and load-balancing incoming users between the two servers. (2) modifying our web application so that one machine will host our web application (and associated MySQL database), while the

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Mathieu Lecarme
h requests, we are getting another server machine. > > We are looking at two ways of scaling: > (1) duplicating the web application and index on the second machine > and load-balancing incoming users between the two servers. > > (2) modifying our web application so that one machine

Scaling up to several machines with Lucene

2007-06-28 Thread Chun Wei Ho
ways of scaling: (1) duplicating the web application and index on the second machine and load-balancing incoming users between the two servers. (2) modifying our web application so that one machine will host our web application (and associated MySQL database), while the other one will host the

remote multiSearching not scaling well

2006-08-10 Thread Haines, Ronald C. \(LNG-DAY\)
I'm hoping I'm doing something wrong, because I've been impressed with Lucene so far. The basic problem I'm seeing is that when I run the same search several times against box A (with 1 RemoteSearchable), I see X for an average search response time. When I run the same search several times agains