On Thu, Jan 7, 2010 at 8:23 AM, Gary Warner <g...@cis.uab.edu> wrote: > Hello, > > I've been lurking on this list a couple weeks now, and have asked some "side > questions" to some of the list members, who have been gracious, and helpful, > and encouraged me to just dive in and participate on the list. > > I'll not tell you the whole story right off the bat, but let me give you a > basic outline. > > I dual report into two academic departments at the University of Alabama at > Birmingham - Computer & Information Sciences and Justice Sciences. Although > my background is on the CS side, I specialize in cybercrime investigations > and our research focuses on things that help to train or equip cybercrime > investigators. > > One of our research projects is called the "UAB Spam Data Mine". Basically, > we collect spam, use it to detect emerging malware threats, phishing sites, > or spam campaigns, and share our findings with law enforcement and our > corporate partners. > > We started off small, with only around 10,000 to 20,000 emails per day > running on a smallish server. Once we had our basic workflow down, we moved > to nicer hardware, and opened the floodgates a bit. We're currently > receiving about 1.2 million emails per day, and hope to very quickly grow > that to more than 5 million emails per day. > > I've got very nice hardware - many TB of very fast disks, and several servers > with 12GB of RAM and 8 pentium cores each.
Are you running 8.3.x? I'd go with that as a minimum for now. You'll want to make sure those fast disks are under a fast RAID setup like RAID-10, perhaps with a high quality RAID controller with battery backed cache as well. I/O is going to be your real issue here, not CPU, most likely. Also look at increasing RAM to 48 or 96Gig if if you can afford it. I assume your pentium cores are Nehalem since you've got 12G of ram (multiple of 3). Those are a good choice here, they're fast and have memory access. > For the types of investigative support we do, some of our queries are of the > 'can you tell me what this botnet was spamming for the past six months', but > most of them are more "real time", of the "what is the top spam campaign > today?" or "what domains are being spammed by this botnet RIGHT NOW". Then you'll probably want to look at partitioning your data. > We currently use 15 minute batch queues, where we parse between 10,000 to > 20,000 emails every 15 minutes. Each message is assigned a unique > message_id, which is a combination of what date and time "batch" it is in, > followed by a sequential number, so the most recent batch processed this > morning starts with "10Jan07.0" and goes through "10Jan07.13800". > I don't know what this list considers "large databases", but I'm going to go > ahead and call 170 million records a "large" table. > - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what > sorts of memory settings would you have in your start up tables? crank up shared_buffers and effective cache size. Probably could use a bump for work_mem, something in the 16 to 32 Meg range. Especially since you're only looking at a dozen or so, not hundreds, of concurrent users. work_mem is per sort, so it can get out of hand fast if you crank it up too high, and for most users higher settings won't help anyway. > My biggest question mark there really has to do with how many users I have > and how that might alter the results. My research team has about 12 folks > who might be using the UAB Spam Data Mine at any given time, plus we have the > "parser" running pretty much constantly, and routines that are fetching IP > addresses for all spammed URLs and nameservers for all spammed domains and > constantly updating the databases with that information. In the very near > future, we'll be accepting queries directly from law enforcement through a > web interface and may have as many as another 12 simultaneous users, so maybe > 25 max users. We plan to limit "web users" to a "recent" subset of the data, > probably allowing "today" "previous 24 hour" and "previous 7 days" as query > options within the web interface. The onsite researchers will bang the heck > out of much larger datasets. You might want to look at pre-rolling common requests if that isn't what you're already doing with the routines you're mentioning up there. > (I'll let this thread run a bit, and then come back to ask questions about > "vacuum analyze" and "partitioned tables" as a second and third round of > questions.) You shouldn't have a lot of vacuum problems since you won't be deleting anything. You might wanna crank up autovacuum aggressiveness as regards analyze though. Partitioning is in your future. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance