fwiw, https://github.com/thobbs/logsandra is more recent.
2011/1/30 aaron morton <aa...@thelastpickle.com>: > This project may be what you are looking for, or provide some inspiration > https://github.com/jbohman/logsandra > > Cloud Kick has an example or rolling up time series data > https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/ > > The schema below sounds reasonable. If you will always bring back the entire > log record, consider using a Standard CF rather than a Super CF. Then pack > the log message using your favourite serialisation, e.g. JSON. > > Hope that helps. > Aaron > > On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote: > >> I am also working on a system store logs from hundreds system. >> In my scenario, most query will like this: "let's look at login logs >> (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time >> range)." >> My data model like this: >> . only 1 CF. that's enough for this scenario. >> . group logs from each host and day to one row. Key format is >> "hostname.category.date" >> . store each log entry as a super column, super olumn name is TimeUUID of >> the log. each attribute as a column. >> >> Then this query can be done as 3 GET, no need to do key range scan. >> Then I can use RP instead of OPP. If I use OPP, I have to worry about load >> balance myself. I hate that. >> However, if I need to do a time range access, I can still use column slice. >> >> An additional benefit is, I can clean old logs very easily. We only store >> logs in 1 year. Just deleting by keys can do this job well. >> >> I think storing all logs for a host in a single row is not a good choice. 2 >> reason: >> 1, too few keys, so your data will not distributing well. >> 2, data under a key will always increase. So Cassandra have to do more >> SSTable compaction. >> >> -----邮件原件----- >> 发件人: William R Speirs [mailto:bill.spe...@gmail.com] >> 发送时间: 2011年1月27日 9:15 >> 收件人: user@cassandra.apache.org >> 主题: Re: Schema Design >> >> It makes sense that the single row for a system (with a growing number of >> columns) will reside on a single machine. >> >> With that in mind, here is my updated schema: >> >> - A single column family for all the messages. The row keys will be the >> TimeUUID >> of the message with the following columns: date/time (in UTC POSIX), system >> name/id (with an index for fast/easy gets), the actual message payload. >> >> - A column family for each system. The row keys will be UTC POSIX time with 1 >> second (maybe 1 minute) bucketing, and the column names will be the TimeUUID >> of >> any messages that were logged during that time bucket. >> >> My only hesitation with this design is that buddhasystem warned that each >> column >> family, "is allocated a piece of memory on the server." I'm not sure what the >> implications of this are and/or if this would be a problem if a I had a >> number >> of systems on the order of hundreds. >> >> Thanks... >> >> Bill- >> >> On 01/26/2011 06:51 PM, Shu Zhang wrote: >>> Each row can have a maximum of 2 billion columns, which a logging system >>> will probably hit eventually. >>> >>> More importantly, you'll only have 1 row per set of system logs. Every row >>> is stored on the same machine(s), which you means you'll definitely not be >>> able to distribute your load very well. >>> ________________________________________ >>> From: Bill Speirs [bill.spe...@gmail.com] >>> Sent: Wednesday, January 26, 2011 1:23 PM >>> To: user@cassandra.apache.org >>> Subject: Re: Schema Design >>> >>> I like this approach, but I have 2 questions: >>> >>> 1) what is the implications of continually adding columns to a single >>> row? I'm unsure how Cassandra is able to grow. I realize you can have >>> a virtually infinite number of columns, but what are the implications >>> of growing the number of columns over time? >>> >>> 2) maybe it's just a restriction of the CLI, but how do I do issue a >>> slice request? Also, what if start (or end) columns don't exist? I'm >>> guessing it's smart enough to get the columns in that range. >>> >>> Thanks! >>> >>> Bill- >>> >>> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis >>> <dmcne...@agentisenergy.com> wrote: >>>> I would say in that case you might want to try a single column family >>>> where the key to the column is the system name. >>>> Then, you could name your columns as the timestamp. Then when retrieving >>>> information from the data store you can can, in your slice request, specify >>>> your start column as X and end column as Y. >>>> Then you can use the stored column name to know when an event occurred. >>>> >>>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com> wrote: >>>>> >>>>> I'm looking to use Cassandra to store log messages from various >>>>> systems. A log message only has a message (UTF8Type) and a data/time. >>>>> My thought is to create a column family for each system. The row key >>>>> will be a TimeUUIDType. Each row will have 7 columns: year, month, >>>>> day, hour, minute, second, and message. I then have indexes setup for >>>>> each of the date/time columns. >>>>> >>>>> I was hoping this would allow me to answer queries like: "What are all >>>>> the log messages that were generated between X& Y?" The problem is >>>>> that I can ONLY use the equals operator on these column values. For >>>>> example, I cannot issuing: get system_x where month> 1; gives me this >>>>> error: "No indexed columns present in index clause with operator EQ." >>>>> The equals operator works as expected though: get system_x where month >>>>> = 1; >>>>> >>>>> What schema would allow me to get date ranges? >>>>> >>>>> Thanks in advance... >>>>> >>>>> Bill- >>>>> >>>>> * ColumnFamily description * >>>>> ColumnFamily: system_x_msg >>>>> Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type >>>>> Row cache size / save period: 0.0/0 >>>>> Key cache size / save period: 200000.0/3600 >>>>> Memtable thresholds: 1.1671875/249/60 >>>>> GC grace seconds: 864000 >>>>> Compaction min/max thresholds: 4/32 >>>>> Read repair chance: 1.0 >>>>> Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, >>>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, >>>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172] >>>>> Column Metadata: >>>>> Column Name: year (year) >>>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>>> Index Type: KEYS >>>>> Column Name: month (month) >>>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>>> Index Type: KEYS >>>>> Column Name: second (second) >>>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>>> Index Type: KEYS >>>>> Column Name: minute (minute) >>>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>>> Index Type: KEYS >>>>> Column Name: hour (hour) >>>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>>> Index Type: KEYS >>>>> Column Name: day (day) >>>>> Validation Class: org.apache.cassandra.db.marshal.IntegerType >>>>> Index Type: KEYS >>>> >>>> >>>> >>>> -- >>>> David McNelis >>>> Lead Software Engineer >>>> Agentis Energy >>>> www.agentisenergy.com >>>> o: 630.359.6395 >>>> c: 219.384.5143 >>>> A Smart Grid technology company focused on helping consumers of energy >>>> control an often under-managed resource. >>>> >>>> > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com