2011/8/23 Hetz Ben Hamo <het...@gmail.com> > Hi, > > I'm thinking about some project, and I was wondering about something: lets > say I'm setting up a really-big-web site. Something in the scale of Walla or > Ynet. > My question is about the DB: Where are exactly my limitations that I need > to go master/slave, replications etc. > Is it related to memory? processor? I know that many uses replication etc > due to shortage or RAM and/or CPU power. > > Lets say I have a machine with 512GB RAM, 20 cores, 20TB disks. Such a > machine should handle a huge load without any issue. With such a machine, > theoretically, can a single MySQL / PostgreSQL handle such a load? (few > hundred thousands connected users at once). >
I can't answer these specific questions directly. We have a chubby (2TB right now, and growing) PostgresQL database on Hitachi FC SAN 15k RPM RAID 1+0 SAS disks, 84Gb RAM, 8 CPU cores, controlled to fail-over to an identical stand-by secondary using RedHat Cluster Suite (running on top of CentOS and Xen) and it can hardly handle the load of some queries (it's not a web site, it's mostly data warehouse loading thousands of new items per minute and allowing customers to query and configure stuff through a web "portal"). Our developers are looking at solutions to cluster Postgres, and using SSD for disks. I'm not sure how much a larger single PostgresQL instance would help. There are quite a few anecdotes and howto's about large PostgresQL databases on the web (blogs, wiki.postgresql.org, etc). I'm also helping a friend with an idea for a web site which might also grow a lot and am fiddling with the idea of using a NoSQL database for it. Right now I'm looking at Cassandra (e.g. see http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/). It's a completely new way of thinking about a database development but the benefits from the Operations and scalability perspective could be huge - the high availability (replication means that if a node fails than another node with identical data will serve queries), cost of hardware (start with small cheap boxes and add as you go) and ease of management (which I expect to have once I get the hang of it) look very attractive, and that's even before I mentioned the options for Map-Reduce and distributed "data mining". See also the SNA Project: http://sna-projects.com/ (actually a cluster of related projects from LinkedIn). You might also want to look at combining technologies, e.g. stick to a large PostgresQL database to manipulate data but export the information into smaller data stores which the web front ends will access through a simpler interface. e.g. I heard that Google's famous Map-Reduce search intelligence algorithms run in batch jobs on the main data gathered by the crawlers and produces simpler key-value stores which are actually what your web search queries from. This is just the tip of the iceberg from someone who's still just watching and reading, I didn't even get around to dip my legs in the water (except the HBase quick start tutorial: http://hbase.apache.org/book/quickstart.html, which succeeded on one desktop and failed on another). There is plenty of material on the web about "large web sites". Cheers, --Amos
_______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il