On Thu, Jan 7, 2010 at 8:23 AM, Gary Warner <g...@cis.uab.edu> wrote:
> Hello,
>
> I've been lurking on this list a couple weeks now, and have asked some "side 
> questions" to some of the list members, who have been gracious, and helpful, 
> and encouraged me to just dive in and participate on the list.
>
> I'll not tell you the whole story right off the bat, but let me give you a 
> basic outline.
>
> I dual report into two academic departments at the University of Alabama at 
> Birmingham - Computer & Information Sciences and Justice Sciences.  Although 
> my background is on the CS side, I specialize in cybercrime investigations 
> and our research focuses on things that help to train or equip cybercrime 
> investigators.
>
> One of our research projects is called the "UAB Spam Data Mine".  Basically, 
> we collect spam, use it to detect emerging malware threats, phishing sites, 
> or spam campaigns, and share our findings with law enforcement and our 
> corporate partners.
>
> We started off small, with only around 10,000 to 20,000 emails per day 
> running on a smallish server.  Once we had our basic workflow down, we moved 
> to nicer hardware, and opened the floodgates a bit.  We're currently 
> receiving about 1.2 million emails per day, and hope to very quickly grow 
> that to more than 5 million emails per day.
>
> I've got very nice hardware - many TB of very fast disks, and several servers 
> with 12GB of RAM and 8 pentium cores each.

Are you running 8.3.x?  I'd go with that as a minimum for now.

You'll want to make sure those fast disks are under a fast RAID setup
like RAID-10, perhaps with a high quality RAID controller with battery
backed cache as well.  I/O is going to be your real issue here, not
CPU, most likely.  Also look at increasing RAM to 48 or 96Gig if if
you can afford it.  I assume your pentium cores are Nehalem since
you've got 12G of ram (multiple of 3).  Those are a good choice here,
they're fast and have memory access.

> For the types of investigative support we do, some of our queries are of the 
> 'can you tell me what this botnet was spamming for the past six months', but 
> most of them are more "real time", of the "what is the top spam campaign 
> today?" or "what domains are being spammed by this botnet RIGHT NOW".

Then you'll probably want to look at partitioning your data.

> We currently use 15 minute batch queues, where we parse between 10,000 to 
> 20,000 emails every 15 minutes.  Each message is assigned a unique 
> message_id, which is a combination of what date and time "batch" it is in, 
> followed by a sequential number, so the most recent batch processed this 
> morning starts with "10Jan07.0" and goes through "10Jan07.13800".

> I don't know what this list considers "large databases", but I'm going to go 
> ahead and call 170 million records a "large" table.


>  - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what 
> sorts of memory settings would you have in your start up tables?

crank up shared_buffers and effective cache size.  Probably could use
a bump for work_mem, something in the 16 to 32 Meg range.  Especially
since you're only looking at a dozen or so, not hundreds, of
concurrent users.  work_mem is per sort, so it can get out of hand
fast if you crank it up too high, and for most users higher settings
won't help anyway.

>  My biggest question mark there really has to do with how many users I have 
> and how that might alter the results.  My research team has about 12 folks 
> who might be using the UAB Spam Data Mine at any given time, plus we have the 
> "parser" running pretty much constantly, and routines that are fetching IP 
> addresses for all spammed URLs and nameservers for all spammed domains and 
> constantly updating the databases with that information. In the very near 
> future, we'll be accepting queries directly from law enforcement through a 
> web interface and may have as many as another 12 simultaneous users, so maybe 
> 25 max users.  We plan to limit "web users" to a "recent" subset of the data, 
> probably allowing "today" "previous 24 hour" and "previous 7 days" as query 
> options within the web interface.  The onsite researchers will bang the heck 
> out of much larger datasets.

You might want to look at pre-rolling common requests if that isn't
what you're already doing with the routines you're mentioning up
there.

> (I'll let this thread run a bit, and then come back to ask questions about 
> "vacuum analyze" and "partitioned tables" as a second and third round of 
> questions.)

You shouldn't have a lot of vacuum problems since you won't be
deleting anything.  You might wanna crank up autovacuum aggressiveness
as regards analyze though.  Partitioning is in your future.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to