Re: Schema Design

Jonathan Ellis Sun, 30 Jan 2011 07:54:33 -0800

fwiw, https://github.com/thobbs/logsandra is more recent.


2011/1/30 aaron morton <aa...@thelastpickle.com>:
> This project may be what you are looking for, or provide some inspiration 
> https://github.com/jbohman/logsandra
>
> Cloud Kick has an example or rolling up time series data 
> https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
>
> The schema below sounds reasonable. If you will always bring back the entire 
> log record, consider using a Standard CF rather than a Super CF. Then pack 
> the log message using your favourite serialisation, e.g. JSON.
>
> Hope that helps.
> Aaron
>
> On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote:
>
>> I am also working on a system store logs from hundreds system.
>> In my scenario, most query will like this: "let's look at login logs 
>> (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time 
>> range)."
>> My data model like this:
>> . only 1 CF. that's enough for this scenario.
>> . group logs from each host and day to one row. Key format is 
>> "hostname.category.date"
>> . store each log entry as a super column, super olumn name is TimeUUID of 
>> the log. each attribute as a column.
>>
>> Then this query can be done as 3 GET, no need to do key range scan.
>> Then I can use RP instead of OPP. If I use OPP, I have to worry about load 
>> balance myself. I hate that.
>> However, if I need to do a time range access, I can still use column slice.
>>
>> An additional benefit is, I can clean old logs very easily. We only store 
>> logs in 1 year. Just deleting by keys can do this job well.
>>
>> I think storing all logs for a host in a single row is not a good choice. 2 
>> reason:
>> 1, too few keys, so your data will not distributing well.
>> 2, data under a key will always increase. So Cassandra have to do more 
>> SSTable compaction.
>>
>> -----邮件原件-----
>> 发件人: William R Speirs [mailto:bill.spe...@gmail.com]
>> 发送时间: 2011年1月27日 9:15
>> 收件人: user@cassandra.apache.org
>> 主题: Re: Schema Design
>>
>> It makes sense that the single row for a system (with a growing number of
>> columns) will reside on a single machine.
>>
>> With that in mind, here is my updated schema:
>>
>> - A single column family for all the messages. The row keys will be the 
>> TimeUUID
>> of the message with the following columns: date/time (in UTC POSIX), system
>> name/id (with an index for fast/easy gets), the actual message payload.
>>
>> - A column family for each system. The row keys will be UTC POSIX time with 1
>> second (maybe 1 minute) bucketing, and the column names will be the TimeUUID 
>> of
>> any messages that were logged during that time bucket.
>>
>> My only hesitation with this design is that buddhasystem warned that each 
>> column
>> family, "is allocated a piece of memory on the server." I'm not sure what the
>> implications of this are and/or if this would be a problem if a I had a 
>> number
>> of systems on the order of hundreds.
>>
>> Thanks...
>>
>> Bill-
>>
>> On 01/26/2011 06:51 PM, Shu Zhang wrote:
>>> Each row can have a maximum of 2 billion columns, which a logging system 
>>> will probably hit eventually.
>>>
>>> More importantly, you'll only have 1 row per set of system logs. Every row 
>>> is stored on the same machine(s), which you means you'll definitely not be 
>>> able to distribute your load very well.
>>> ________________________________________
>>> From: Bill Speirs [bill.spe...@gmail.com]
>>> Sent: Wednesday, January 26, 2011 1:23 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Schema Design
>>>
>>> I like this approach, but I have 2 questions:
>>>
>>> 1) what is the implications of continually adding columns to a single
>>> row? I'm unsure how Cassandra is able to grow. I realize you can have
>>> a virtually infinite number of columns, but what are the implications
>>> of growing the number of columns over time?
>>>
>>> 2) maybe it's just a restriction of the CLI, but how do I do issue a
>>> slice request? Also, what if start (or end) columns don't exist? I'm
>>> guessing it's smart enough to get the columns in that range.
>>>
>>> Thanks!
>>>
>>> Bill-
>>>
>>> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
>>> <dmcne...@agentisenergy.com>  wrote:
>>>> I would say in that case you might want  to try a  single column family
>>>> where the key to the column is the system name.
>>>> Then, you could name your columns as the timestamp.  Then when retrieving
>>>> information from the data store you can can, in your slice request, specify
>>>> your start column as  X and end  column as Y.
>>>> Then you can use the stored column name to know when an event  occurred.
>>>>
>>>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com>  wrote:
>>>>>
>>>>> I'm looking to use Cassandra to store log messages from various
>>>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>>>> My thought is to create a column family for each system. The row key
>>>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>>>> day, hour, minute, second, and message. I then have indexes setup for
>>>>> each of the date/time columns.
>>>>>
>>>>> I was hoping this would allow me to answer queries like: "What are all
>>>>> the log messages that were generated between X&  Y?" The problem is
>>>>> that I can ONLY use the equals operator on these column values. For
>>>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>>>> error: "No indexed columns present in index clause with operator EQ."
>>>>> The equals operator works as expected though: get system_x where month
>>>>> = 1;
>>>>>
>>>>> What schema would allow me to get date ranges?
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>> Bill-
>>>>>
>>>>> * ColumnFamily description *
>>>>>    ColumnFamily: system_x_msg
>>>>>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>>>      Row cache size / save period: 0.0/0
>>>>>      Key cache size / save period: 200000.0/3600
>>>>>      Memtable thresholds: 1.1671875/249/60
>>>>>      GC grace seconds: 864000
>>>>>      Compaction min/max thresholds: 4/32
>>>>>      Read repair chance: 1.0
>>>>>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>>>      Column Metadata:
>>>>>        Column Name: year (year)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: month (month)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: second (second)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: minute (minute)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: hour (hour)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: day (day)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>
>>>>
>>>>
>>>> --
>>>> David McNelis
>>>> Lead Software Engineer
>>>> Agentis Energy
>>>> www.agentisenergy.com
>>>> o: 630.359.6395
>>>> c: 219.384.5143
>>>> A Smart Grid technology company focused on helping consumers of energy
>>>> control an often under-managed resource.
>>>>
>>>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Schema Design

Reply via email to