It seems to me that you need to put weights on your requirements because I think it's going to be pretty tough to meet all of them with just one solution. For example, you can use something like Redis to do fast writes, but it doesn't have Map-Reduce queries. So, you can use Redis to write the data and then you can have another program which moves (look into Redis's awesome Pub/Sub features) the data from Redis to Riak or Hadoop where you can then perform your Map-Reduce query. Just my two cents.
--Andrew On Tue, Jun 28, 2011 at 8:17 AM, Evans, Matthew <mev...@verivue.com> wrote: > Hi, > > I've been looking at a number of technologies for a simple application. > > We are saving large amounts of data to disc; this data is event-log/sensor > data which may look something like: > > Version, Account, RequestID, Timestamp, Duration, IPAddr, Method, URL, HTTP > Version, Response_Code, Size, Hit_Rate, Range_From, Range_To, Referrer, > Agent, Content_Type, Accept_Encoding, Redirect_Code, Progress > > > For Example: > > 1 agora 27050938271286652285000000000368375 1289589216.893 1989.938 > 79.7.41.29 GET http://bi.sciagnij.pl/0/4/TWEE_Upgrade.exe HTTP/1.1 200 > 953772216 725098308 713834308 -1 -1 - > Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1) application/octet-stream gzip - > 0 progress > > The data has no specific key to index off (we will be doing some parsing of > the data on ingest to get basic information allowing for fast queries, but > this is outside of Riak). > > Really the issue is that we need to be able to apply "analytic" (map-reduce) > type queries on the data. These queries do not need to be real-time, but > should not take days to run. > > For example: All GET requests for a specific URL within a specific time range. > > The amount of data saved could be quite large (forcing us to use InnoDB > instead of BitCask). One estimate is ~1 billion records. Architecturally this > data could be split over multiple nodes. > > The choice of client-side language is still open, with Erlang as the current > favorite. As I see it the advantages of Riak are: > > 1) HTTP based API as well as Erlang and other client APIs (the system has a > mix of programming languages including Python and C/C++). > > 2) More flexible/extensible data model (Cassandra requires you to predefine > the key spaces, columns etc ahead of time) > > 3) Easier to install/setup without the apparent bloat and complexity of > Cassandra (which also includes Java setup) > > 4) Map-reduce queries > > The disadvantages of Riak are: > > 1) Write performance. We need to handle ~50,000 writes per second. > > I would recommend running our client app from within the same Erlang VM as > Riak so hopefully we can gain something here. Alternatively use innostore > Erlang API directly for writes. > > Questions: > > 1) Is Riak a good database for this application? > > 2) Can we write to InnoDB directly and still leverage the map-reduce queries > on the data? > > Regards > > Matt > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com