Re: Scaling out/up or a mix

Marcus Herou Mon, 29 Jun 2009 00:48:00 -0700

Thanks for the answer.

Don't you think that part 1 of the email would give you a hint of nature of
the index ?


Index size(and growing): 16Gx8 = 128G
Doc size (data): 20k
Num docs: 90M
Num users: Few hundred but most critical is that the admin staff which is
using the index all day long.
Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
publishedDate... = Very simple, no fuzzy searches etc. However since the
dataset is large it will consume memory on sorting I guess.

Could not one draw any conclusions about best-practice in terms of hardware
given the above "specs" ?

Basically I would like to know if I really need 8 cores since machines with
dual-cpu support are the most expensive and I would like to not throw away
money so getting it right is a matter of economy.

I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
and I then want to get the most bang for the buck for my workload.
Should I go for
X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me 1200USD
a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
or
X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)

Basically I would like to know what factors make the workload IO bound vs
CPU bound ?

//Marcus






On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman <ebow...@boboco.ie> wrote:

> There is no single answer -- this is always application specific.
>
> Without knowing anything about what you are doing:
>
> 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> you can, if performance is absolutely critical
> 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> unless you are doing especially cpu-bound searches.
>
> Unless you are doing something with hard performance requirements, or
> really quite unusual, buying "good" kit is probably good enough, and you
> won't really know for sure until you measure.  Lucene is a general
> enough tool that there isn't a terribly universal answer to this.  We
> were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> instance, but we ended up taking an unusual path.  YMMV.
>
> Marcus Herou wrote:
> > Hi. I think I need to be more specific.
> >
> > What I am trying to find out is if I should aim for:
> >
> > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > RAM - if the index does not fit into RAM how much RAM should I then buy ?
> >
> > Please any hints would be appreciated since I am going to invest soon.
> >
> > //Marcus
> >
> > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > <marcus.he...@tailsweep.com>wrote:
> >
> >
> >> Hi.
> >>
> >> I currently have an index which is 16GB per machine (8 machines = 128GB)
> >> (data is stored externally, not in index) and is growing like crazy (we
> are
> >> indexing blogs which is crazy by nature) and have only allocated 2GB per
> >> machine to the Lucene app since we are running some other stuff there in
> >> parallell.
> >>
> >> Each doc should be roughly the size of a blog post, no more than 20k.
> >>
> >> We currently have about 90M documents and it is increasing rapidly so
> >> getting into the G+ document range is not going to be too far away.
> >>
> >> Now due to search performance I think I need to move these instances to
> >> dedicated index/search machines (or index on some machines and search on
> >> others). Anyway I would like to get some feedback about two things:
> >>
> >> 1. What is the most important hardware aspect when it comes to add
> document
> >> to the index and optimize it.
> >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> >> 1.2 Is it RAM ?
> >> 1.3 Is is CPU ?
> >>
> >> My guess would be disk-io, right, wrong ?
> >>
> >> 2. What is the most important hardware aspect when it comes to searching
> >> documents in my setup ? (result-set is limited to return only the top 10
> >> matches with page handling)
> >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> >> 2.2 Is it RAM ?
> >> 2.3 Is is CPU ?
> >>
> >> I have no clue since the data might not fit into memory. What is then
> the
> >> most important factor ? read-performance while scanning the index ? CPU
> >> while comparing fields and collecting results ?
> >>
> >> What I'm trying to find out is what I can do to get most bang for the
> buck
> >> with a limited (aren't we all limited?) budget.
> >>
> >> Kindly
> >>
> >> //Marcus
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Marcus Herou CTO and co-founder Tailsweep AB
> >> +46702561312
> >> marcus.he...@tailsweep.com
> >> http://www.tailsweep.com/
> >>
> >>
> >>
> >
> >
> >
>
>
> --
> Eric Bowman
> Boboco Ltd
> ebow...@boboco.ie
> http://www.boboco.ie/ebowman/pubkey.pgp
> +35318394189/+353872801532<http://www.boboco.ie/ebowman/pubkey.pgp%0A+35318394189/+353872801532>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Reply via email to