Thanks for the answer. Don't you think that part 1 of the email would give you a hint of nature of the index ?
Index size(and growing): 16Gx8 = 128G Doc size (data): 20k Num docs: 90M Num users: Few hundred but most critical is that the admin staff which is using the index all day long. Query types: Example: title:"Iphone" OR description:"Iphone" sorted by publishedDate... = Very simple, no fuzzy searches etc. However since the dataset is large it will consume memory on sorting I guess. Could not one draw any conclusions about best-practice in terms of hardware given the above "specs" ? Basically I would like to know if I really need 8 cores since machines with dual-cpu support are the most expensive and I would like to not throw away money so getting it right is a matter of economy. I mean it is very simple: Let's say someone gives me a budget of 50 000 USD and I then want to get the most bang for the buck for my workload. Should I go for X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me 1200USD a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM) or X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me 3400 USD a piece (giving me 15 machines: 60 disks, 120 cores, 540G RAM) Basically I would like to know what factors make the workload IO bound vs CPU bound ? //Marcus On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman <ebow...@boboco.ie> wrote: > There is no single answer -- this is always application specific. > > Without knowing anything about what you are doing: > > 1. disk i/o is probably the most critical. Go SSD or even RAM disk if > you can, if performance is absolutely critical > 2. Sometimes CPU can become an issue, but 8 cores is probably enough > unless you are doing especially cpu-bound searches. > > Unless you are doing something with hard performance requirements, or > really quite unusual, buying "good" kit is probably good enough, and you > won't really know for sure until you measure. Lucene is a general > enough tool that there isn't a terribly universal answer to this. We > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for > instance, but we ended up taking an unusual path. YMMV. > > Marcus Herou wrote: > > Hi. I think I need to be more specific. > > > > What I am trying to find out is if I should aim for: > > > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough. > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough... > > RAM - if the index does not fit into RAM how much RAM should I then buy ? > > > > Please any hints would be appreciated since I am going to invest soon. > > > > //Marcus > > > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou > > <marcus.he...@tailsweep.com>wrote: > > > > > >> Hi. > >> > >> I currently have an index which is 16GB per machine (8 machines = 128GB) > >> (data is stored externally, not in index) and is growing like crazy (we > are > >> indexing blogs which is crazy by nature) and have only allocated 2GB per > >> machine to the Lucene app since we are running some other stuff there in > >> parallell. > >> > >> Each doc should be roughly the size of a blog post, no more than 20k. > >> > >> We currently have about 90M documents and it is increasing rapidly so > >> getting into the G+ document range is not going to be too far away. > >> > >> Now due to search performance I think I need to move these instances to > >> dedicated index/search machines (or index on some machines and search on > >> others). Anyway I would like to get some feedback about two things: > >> > >> 1. What is the most important hardware aspect when it comes to add > document > >> to the index and optimize it. > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?) > >> 1.2 Is it RAM ? > >> 1.3 Is is CPU ? > >> > >> My guess would be disk-io, right, wrong ? > >> > >> 2. What is the most important hardware aspect when it comes to searching > >> documents in my setup ? (result-set is limited to return only the top 10 > >> matches with page handling) > >> 2.1 Is it disk read throughput ? (sequential or random-io ?) > >> 2.2 Is it RAM ? > >> 2.3 Is is CPU ? > >> > >> I have no clue since the data might not fit into memory. What is then > the > >> most important factor ? read-performance while scanning the index ? CPU > >> while comparing fields and collecting results ? > >> > >> What I'm trying to find out is what I can do to get most bang for the > buck > >> with a limited (aren't we all limited?) budget. > >> > >> Kindly > >> > >> //Marcus > >> > >> > >> > >> > >> > >> -- > >> Marcus Herou CTO and co-founder Tailsweep AB > >> +46702561312 > >> marcus.he...@tailsweep.com > >> http://www.tailsweep.com/ > >> > >> > >> > > > > > > > > > -- > Eric Bowman > Boboco Ltd > ebow...@boboco.ie > http://www.boboco.ie/ebowman/pubkey.pgp > +35318394189/+353872801532<http://www.boboco.ie/ebowman/pubkey.pgp%0A+35318394189/+353872801532> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/