You can also tell splunk to save the results of a search for X min so that if someone else does the same search it just gives back the same answer, so that if you have a dozen people viewing a dashboard it's not calculating the results a dozen times.
Splunk is a _very_ powerful tool, but it can't add to the capabilities of the hardware, if you use all of it's power for lots of people on lots of logs you can end up spending quite a bit on the hardware. But you can also install splunk on a run-of-the-mill box and try it, only upgrading the hardware if you feel that you need higher performance. You don't have to make the decision up front. David Lang On Mon, 1 Mar 2010, Rob Das wrote: > By the way, real-time searching in Splunk (4.1) utilizes the same map-reduce > style parallelized architecture as historical search. > > You can use "summary indexing" to make aggregated dashboards much faster. > Basically the aggregation is done in the background as opposed to each time > the user displays the dashboard. > > Rob > > On Mon, Mar 1, 2010 at 2:19 PM, <da...@lang.hm> wrote: > >> As for the hardware requirements for splunk, it depends on a couple of >> things. >> >> Is the data you are searching for 'rare', 'common', or 'extremely common' >> in your log data ('common' being something along the lines of more than 0.1% >> or so of your logs in the date range you are searching and 'extremely >> common' being when you get up above 10% or so of the logs) >> >> If you are searching for 'rare' data, then your primary bottleneck is going >> to be seek time on your disks. splunk is very happy with raid5/6 disks, >> getting the same performance form a raid6 set of disks as from the same >> number of disks in raid 10 (it's very close to a read-only app) >> >> If you are searching for 'common' data, then your bottleneck can shift to >> cpu cycles while still saturating your disks (the raw data is gzipped, if >> you end up needing to read a lot of data you end up gunzipping a lot of logs >> to extract the line that you are looking for) >> >> If you are searching for 'extremely common' data, then the bottleneck will >> be the CPU as you are gunzipping and processing the data. >> >> If you have very common searches (especially on subsets of the data) you >> can teach splunk about these searches and it can optimize the data as it >> arrives to make it faster to search. By default splunk does all it's >> analysis at search time, which can cause it to pull up a lot of log messages >> only to discover that they aren't what you are looking for and discarding >> them. >> >> Splunk also parallelizes very nicely, if you split your logs across N >> machines you get very close to N times the performance of a single machine. >> >> Like any database, if you have lots of people contending for the capacity, >> they will have to share it and performance will drop. How big a problem this >> is depends on what you are doing. The Dashboards are very nice, but to >> generate them (and update them) requires a lot of queries, so one person >> useing a dashboard can be the same as a dozen or more people doing >> individual queries. >> >> >> David Lang >> >> On Mon, 1 Mar 2010, Rob Das wrote: >> >> Splunk free can handle (index) 500Meg/day. You can keep as much >>> historical >>> data around as you wish, and therefore your search may be across many many >>> Gigs. >>> >>> Rob Das >>> >>> On Mon, Mar 1, 2010 at 1:42 PM, Dustin Puryear <dpury...@puryear-it.com >>>> wrote: >>> >>> Maybe I'm missing something, but isn't Splunk free for smaller needs? >>>> Also, despite what the hardware req says, it really doesn't take much to >>>> run Splunk unless you are really pumping out the logs. Maybe I missed >>>> it, but did you state how many logs you are producing as a log/sec or >>>> kb/sec estimate? >>>> >>>> --- >>>> Puryear IT, LLC - Baton Rouge, LA - http://www.puryear-it.com/ >>>> Active Directory Integration : Web & Enterprise Single Sign-On >>>> Identity and Access Management : Linux/UNIX technologies >>>> >>>> Download our free ebook "Best Practices for Linux and UNIX Servers" >>>> http://www.puryear-it.com/pubs/linux-unix-best-practices/ >>>> >>>> >>>> -----Original Message----- >>>> From: discuss-boun...@lopsa.org [mailto:discuss-boun...@lopsa.org] On >>>> Behalf Of da...@lang.hm >>>> Sent: Monday, March 01, 2010 1:00 PM >>>> To: Rob Das >>>> Cc: discuss@lopsa.org >>>> Subject: Re: [lopsa-discuss] splunk alternatives >>>> >>>> Rob, >>>> one comment on the real-time search capability of splunk4, per recent >>>> conversations with splunk, the real-time search is not going to be able >>>> to be integrated with past data. >>>> >>>> in other words, you can say, 'do this search on data that arrives after >>>> now', but you cannot say 'do this search on data that arrived/arrives >>>> after 5 min ago' >>>> >>>> David Lang >>>> >>>> On Mon, 1 Mar 2010, Rob Das wrote: >>>> >>>> Date: Mon, 1 Mar 2010 10:26:38 -0800 >>>>> From: Rob Das <rob...@gmail.com> >>>>> To: discuss@lopsa.org >>>>> Subject: [lopsa-discuss] splunk alternatives >>>>> >>>>> First, please forgive me if this email is overly long. >>>>> >>>>> Yes, SEC and Splunk are different in many ways - both useful in the >>>>> right context. I have a few questions. How much data per day are you >>>>> >>>> >>>> talking about? Are you interested in looking at historical data and >>>>> comparing it against current data? Do you need any sort of roll-ups >>>>> or more advanced aggregations/analytics on your data? You may not >>>>> now, but will you ever be interested in gathering events that cannot >>>>> be captured via syslog (extra large, application or multi-line events >>>>> for example)? Do you want different people to have access to >>>>> different types of data. Do you want different roles of users to see >>>>> different views? Do you foresee that the data volumes will grow over >>>>> time? Are your 20 users really concurrent or will they be searching >>>>> >>>> randomly throughout the day? >>>> >>>>> >>>>> First of all the new version of Splunk (version 4.1), which will be >>>>> out very soon, includes real-time support. What this means is that >>>>> searches can optionally be executed at data input time as the data is >>>>> acquired. If events as they come in match a search, alerts can be >>>>> >>>> triggered. >>>> >>>>> Furthermore, Splunk's dashboards, graphs and tables will update in >>>>> real time as the data comes in effectively providing a "heartbeat". >>>>> >>>>> If you need to "find the needle in the haystack", you can't find a >>>>> better tool. >>>>> >>>>> Simple stuff like "Tell me the top ten logins by IP address over the >>>>> last 24 hours or month" can't be done with SEC without writing code. >>>>> Splunk handles this via it's GUI and graphs like this can be placed on >>>>> >>>> dashboards which >>>> >>>>> update in real-time. Splunk can easily filter out data that you are >>>>> >>>> not >>>> >>>>> interested in or keep it for as long as you like - your choice. >>>>> >>>>> Splunk provides role-based access controls that can optionally filter >>>>> data at search time depending on who is allowed to see what. >>>>> >>>>> One of the most important concepts is that Splunk doesn't require or >>>>> impose any structure on the incoming data. You can apply structure at >>>>> >>>> >>>> search time, which means that as data changes in your data center >>>>> (because new versions of software/hardware are installed, etc), you >>>>> will not need to re-do any regular expressions. >>>>> >>>>> Depending on daily data volumes, Splunk will run very well on >>>>> commodity type hardware. As your business grows, it can scale to >>>>> handle it (to terrabytes/day). If your daily volume doesn't exceed >>>>> 500M/day, you can use the free version of the software. >>>>> >>>>> SEC is a low level tool written in Pearl that requires you to create >>>>> regular expressions that match patterns in your data. It also >>>>> requires quite a bit of scripting to make it work in many >>>>> environments. As things change, you will need to update your regular >>>>> >>>> expressions or things will break. >>>> >>>>> >>>>> SEC implements a state machine that operates over incoming data. >>>>> There are many cool things you can do with it, but like David L says >>>>> keeps all of it's state in memory. Splunk does not currently >>>>> implement a state machine in the same way as SEC. However, Splunk's >>>>> search language, which is extremely robust, can handle many of the >>>>> same use cases - especially with the introduction of real-time >>>>> >>>> searching version 4.1. >>>> >>>>> >>>>> I have not taken a look at logsurfer, so I can't comment on it. I'll >>>>> check it out. >>>>> >>>>> I am more than happy to field questions directly if you wish. >>>>> >>>>> Rob Das >>>>> r...@splunk.com >>>>> Co-founder / Chief Architect >>>>> Splunk, Inc. >>>>> >>>>> Paul DiSciascio wrote: >>>>>> >>>>>>> I'm looking for a good way to share log files on a centralized >>>>>>> syslog server with about 10-20 people/developers who are familiar >>>>>>> with the log formats but not very much with unix tools. They want >>>>>>> an easy way to dig thru the logs and filter out junk they're not >>>>>>> interested in, but still have near realtime visibility. Obviously, >>>>>>> splunk can do this, but it's pricey and their documentation seems to >>>>>>> >>>>>> >>>> indicate that 20 concurrent users would be a lot to ask for without >>>>>>> >>>>>> a lot of hardware. >>>> >>>>> I really only need an interface capable of some rudimentary >>>>>>> filtering, and if possible the ability to save those searches or >>>>>>> filters. Does anyone have any suggestions short of writing this >>>>>>> >>>>>> myself? >>>> >>>>> >>>>>>> >>>>>>> You might be interested in SEC (simple event correlator) for this >>>>>> purpose. But, if you just want a presentation interface, logsurfer >>>>>> might be more what you are looking for. SEC is much more like splunk >>>>>> while logsurfer is more of a realtime filtering monitor. >>>>>> >>>>> >>>>> I'm not sure what you have seen of splunk, but it and SEC have very >>>>> little in common. >>>>> >>>>> splunk allows for arbatrary search queries against your past log data >>>>> (and indexes it like crazy to make the search fairly efficiant) >>>>> >>>>> SEC watches for patterns (or combinations of patterns) to appear in >>>>> the logs and generates alerts. >>>>> >>>>> splunk can simulate SEC's functionality by doing repeated queries >>>>> against the logs, but that's fairly inefficant. >>>>> >>>>> >>>>> the answer to the original question, it depends a lot on the amount of >>>>> >>>> >>>> data that you are working with. >>>>> >>>>> If you can fit it all in ram on a machine, then there are a lot of >>>>> things that you can use to query it. The problem comes when you can no >>>>> >>>> >>>> longer fit it in ram and have to go to disk, at that point you need an >>>>> >>>> >>>> application that does a lot of indexing (and/or spreads the load >>>>> across multiple machines, depending on how much data you have and how >>>>> fast you want your >>>>> answers) >>>>> >>>>> you say that your users are not familiar with unix tools, are they >>>>> familiar with using SQL for queries? >>>>> >>>>> David Lang >>>>> >>>>> >>>> >>> >>> >>> >>> > > > _______________________________________________ Discuss mailing list Discuss@lopsa.org http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/