Re: Schema Design

aaron morton Sat, 29 Jan 2011 23:56:28 -0800

This project may be what you are looking for, or provide some inspiration 
https://github.com/jbohman/logsandra


Cloud Kick has an example or rolling up time series data 
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

The schema below sounds reasonable. If you will always bring back the entire 
log record, consider using a Standard CF rather than a Super CF. Then pack the 
log message using your favourite serialisation, e.g. JSON. 

Hope that helps.
Aaron

On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote:

> I am also working on a system store logs from hundreds system.
> In my scenario, most query will like this: "let's look at login logs 
> (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time 
> range)."
> My data model like this:
> . only 1 CF. that's enough for this scenario.
> . group logs from each host and day to one row. Key format is 
> "hostname.category.date"
> . store each log entry as a super column, super olumn name is TimeUUID of the 
> log. each attribute as a column.
> 
> Then this query can be done as 3 GET, no need to do key range scan.
> Then I can use RP instead of OPP. If I use OPP, I have to worry about load 
> balance myself. I hate that.
> However, if I need to do a time range access, I can still use column slice.
> 
> An additional benefit is, I can clean old logs very easily. We only store 
> logs in 1 year. Just deleting by keys can do this job well.
> 
> I think storing all logs for a host in a single row is not a good choice. 2 
> reason:
> 1, too few keys, so your data will not distributing well.
> 2, data under a key will always increase. So Cassandra have to do more 
> SSTable compaction.
> 
> -----邮件原件-----
> 发件人: William R Speirs [mailto:bill.spe...@gmail.com] 
> 发送时间: 2011年1月27日 9:15
> 收件人: user@cassandra.apache.org
> 主题: Re: Schema Design
> 
> It makes sense that the single row for a system (with a growing number of 
> columns) will reside on a single machine.
> 
> With that in mind, here is my updated schema:
> 
> - A single column family for all the messages. The row keys will be the 
> TimeUUID 
> of the message with the following columns: date/time (in UTC POSIX), system 
> name/id (with an index for fast/easy gets), the actual message payload.
> 
> - A column family for each system. The row keys will be UTC POSIX time with 1 
> second (maybe 1 minute) bucketing, and the column names will be the TimeUUID 
> of 
> any messages that were logged during that time bucket.
> 
> My only hesitation with this design is that buddhasystem warned that each 
> column 
> family, "is allocated a piece of memory on the server." I'm not sure what the 
> implications of this are and/or if this would be a problem if a I had a 
> number 
> of systems on the order of hundreds.
> 
> Thanks...
> 
> Bill-
> 
> On 01/26/2011 06:51 PM, Shu Zhang wrote:
>> Each row can have a maximum of 2 billion columns, which a logging system 
>> will probably hit eventually.
>> 
>> More importantly, you'll only have 1 row per set of system logs. Every row 
>> is stored on the same machine(s), which you means you'll definitely not be 
>> able to distribute your load very well.
>> ________________________________________
>> From: Bill Speirs [bill.spe...@gmail.com]
>> Sent: Wednesday, January 26, 2011 1:23 PM
>> To: user@cassandra.apache.org
>> Subject: Re: Schema Design
>> 
>> I like this approach, but I have 2 questions:
>> 
>> 1) what is the implications of continually adding columns to a single
>> row? I'm unsure how Cassandra is able to grow. I realize you can have
>> a virtually infinite number of columns, but what are the implications
>> of growing the number of columns over time?
>> 
>> 2) maybe it's just a restriction of the CLI, but how do I do issue a
>> slice request? Also, what if start (or end) columns don't exist? I'm
>> guessing it's smart enough to get the columns in that range.
>> 
>> Thanks!
>> 
>> Bill-
>> 
>> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
>> <dmcne...@agentisenergy.com>  wrote:
>>> I would say in that case you might want  to try a  single column family
>>> where the key to the column is the system name.
>>> Then, you could name your columns as the timestamp.  Then when retrieving
>>> information from the data store you can can, in your slice request, specify
>>> your start column as  X and end  column as Y.
>>> Then you can use the stored column name to know when an event  occurred.
>>> 
>>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com>  wrote:
>>>> 
>>>> I'm looking to use Cassandra to store log messages from various
>>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>>> My thought is to create a column family for each system. The row key
>>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>>> day, hour, minute, second, and message. I then have indexes setup for
>>>> each of the date/time columns.
>>>> 
>>>> I was hoping this would allow me to answer queries like: "What are all
>>>> the log messages that were generated between X&  Y?" The problem is
>>>> that I can ONLY use the equals operator on these column values. For
>>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>>> error: "No indexed columns present in index clause with operator EQ."
>>>> The equals operator works as expected though: get system_x where month
>>>> = 1;
>>>> 
>>>> What schema would allow me to get date ranges?
>>>> 
>>>> Thanks in advance...
>>>> 
>>>> Bill-
>>>> 
>>>> * ColumnFamily description *
>>>>    ColumnFamily: system_x_msg
>>>>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>>      Row cache size / save period: 0.0/0
>>>>      Key cache size / save period: 200000.0/3600
>>>>      Memtable thresholds: 1.1671875/249/60
>>>>      GC grace seconds: 864000
>>>>      Compaction min/max thresholds: 4/32
>>>>      Read repair chance: 1.0
>>>>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>>      Column Metadata:
>>>>        Column Name: year (year)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: month (month)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: second (second)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: minute (minute)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: hour (hour)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: day (day)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>> 
>>> 
>>> 
>>> --
>>> David McNelis
>>> Lead Software Engineer
>>> Agentis Energy
>>> www.agentisenergy.com
>>> o: 630.359.6395
>>> c: 219.384.5143
>>> A Smart Grid technology company focused on helping consumers of energy
>>> control an often under-managed resource.
>>> 
>>>

Re: Schema Design

Reply via email to