This project may be what you are looking for, or provide some inspiration https://github.com/jbohman/logsandra
Cloud Kick has an example or rolling up time series data https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/ The schema below sounds reasonable. If you will always bring back the entire log record, consider using a Standard CF rather than a Super CF. Then pack the log message using your favourite serialisation, e.g. JSON. Hope that helps. Aaron On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote: > I am also working on a system store logs from hundreds system. > In my scenario, most query will like this: "let's look at login logs > (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time > range)." > My data model like this: > . only 1 CF. that's enough for this scenario. > . group logs from each host and day to one row. Key format is > "hostname.category.date" > . store each log entry as a super column, super olumn name is TimeUUID of the > log. each attribute as a column. > > Then this query can be done as 3 GET, no need to do key range scan. > Then I can use RP instead of OPP. If I use OPP, I have to worry about load > balance myself. I hate that. > However, if I need to do a time range access, I can still use column slice. > > An additional benefit is, I can clean old logs very easily. We only store > logs in 1 year. Just deleting by keys can do this job well. > > I think storing all logs for a host in a single row is not a good choice. 2 > reason: > 1, too few keys, so your data will not distributing well. > 2, data under a key will always increase. So Cassandra have to do more > SSTable compaction. > > -----邮件原件----- > 发件人: William R Speirs [mailto:bill.spe...@gmail.com] > 发送时间: 2011年1月27日 9:15 > 收件人: user@cassandra.apache.org > 主题: Re: Schema Design > > It makes sense that the single row for a system (with a growing number of > columns) will reside on a single machine. > > With that in mind, here is my updated schema: > > - A single column family for all the messages. The row keys will be the > TimeUUID > of the message with the following columns: date/time (in UTC POSIX), system > name/id (with an index for fast/easy gets), the actual message payload. > > - A column family for each system. The row keys will be UTC POSIX time with 1 > second (maybe 1 minute) bucketing, and the column names will be the TimeUUID > of > any messages that were logged during that time bucket. > > My only hesitation with this design is that buddhasystem warned that each > column > family, "is allocated a piece of memory on the server." I'm not sure what the > implications of this are and/or if this would be a problem if a I had a > number > of systems on the order of hundreds. > > Thanks... > > Bill- > > On 01/26/2011 06:51 PM, Shu Zhang wrote: >> Each row can have a maximum of 2 billion columns, which a logging system >> will probably hit eventually. >> >> More importantly, you'll only have 1 row per set of system logs. Every row >> is stored on the same machine(s), which you means you'll definitely not be >> able to distribute your load very well. >> ________________________________________ >> From: Bill Speirs [bill.spe...@gmail.com] >> Sent: Wednesday, January 26, 2011 1:23 PM >> To: user@cassandra.apache.org >> Subject: Re: Schema Design >> >> I like this approach, but I have 2 questions: >> >> 1) what is the implications of continually adding columns to a single >> row? I'm unsure how Cassandra is able to grow. I realize you can have >> a virtually infinite number of columns, but what are the implications >> of growing the number of columns over time? >> >> 2) maybe it's just a restriction of the CLI, but how do I do issue a >> slice request? Also, what if start (or end) columns don't exist? I'm >> guessing it's smart enough to get the columns in that range. >> >> Thanks! >> >> Bill- >> >> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis >> <dmcne...@agentisenergy.com> wrote: >>> I would say in that case you might want to try a single column family >>> where the key to the column is the system name. >>> Then, you could name your columns as the timestamp. Then when retrieving >>> information from the data store you can can, in your slice request, specify >>> your start column as X and end column as Y. >>> Then you can use the stored column name to know when an event occurred. >>> >>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com> wrote: >>>> >>>> I'm looking to use Cassandra to store log messages from various >>>> systems. A log message only has a message (UTF8Type) and a data/time. >>>> My thought is to create a column family for each system. The row key >>>> will be a TimeUUIDType. Each row will have 7 columns: year, month, >>>> day, hour, minute, second, and message. I then have indexes setup for >>>> each of the date/time columns. >>>> >>>> I was hoping this would allow me to answer queries like: "What are all >>>> the log messages that were generated between X& Y?" The problem is >>>> that I can ONLY use the equals operator on these column values. For >>>> example, I cannot issuing: get system_x where month> 1; gives me this >>>> error: "No indexed columns present in index clause with operator EQ." >>>> The equals operator works as expected though: get system_x where month >>>> = 1; >>>> >>>> What schema would allow me to get date ranges? >>>> >>>> Thanks in advance... >>>> >>>> Bill- >>>> >>>> * ColumnFamily description * >>>> ColumnFamily: system_x_msg >>>> Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type >>>> Row cache size / save period: 0.0/0 >>>> Key cache size / save period: 200000.0/3600 >>>> Memtable thresholds: 1.1671875/249/60 >>>> GC grace seconds: 864000 >>>> Compaction min/max thresholds: 4/32 >>>> Read repair chance: 1.0 >>>> Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, >>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, >>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172] >>>> Column Metadata: >>>> Column Name: year (year) >>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>> Index Type: KEYS >>>> Column Name: month (month) >>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>> Index Type: KEYS >>>> Column Name: second (second) >>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>> Index Type: KEYS >>>> Column Name: minute (minute) >>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>> Index Type: KEYS >>>> Column Name: hour (hour) >>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>> Index Type: KEYS >>>> Column Name: day (day) >>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>> Index Type: KEYS >>> >>> >>> >>> -- >>> David McNelis >>> Lead Software Engineer >>> Agentis Energy >>> www.agentisenergy.com >>> o: 630.359.6395 >>> c: 219.384.5143 >>> A Smart Grid technology company focused on helping consumers of energy >>> control an often under-managed resource. >>> >>>