Re: Schema Design

William R Speirs Wed, 26 Jan 2011 17:15:11 -0800

It makes sense that the single row for a system (with a growing number ofcolumns) will reside on a single machine.


With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the TimeUUIDof the message with the following columns: date/time (in UTC POSIX), systemname/id (with an index for fast/easy gets), the actual message payload.

- A column family for each system. The row keys will be UTC POSIX time with 1second (maybe 1 minute) bucketing, and the column names will be the TimeUUID ofany messages that were logged during that time bucket.

My only hesitation with this design is that buddhasystem warned that each columnfamily, "is allocated a piece of memory on the server." I'm not sure what theimplications of this are and/or if this would be a problem if a I had a numberof systems on the order of hundreds.


Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:

Each row can have a maximum of 2 billion columns, which a logging system will 
probably hit eventually.

More importantly, you'll only have 1 row per set of system logs. Every row is 
stored on the same machine(s), which you means you'll definitely not be able to 
distribute your load very well.
________________________________________
From: Bill Speirs [bill.spe...@gmail.com]
Sent: Wednesday, January 26, 2011 1:23 PM
To: user@cassandra.apache.org
Subject: Re: Schema Design

I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
<dmcne...@agentisenergy.com>  wrote:

I would say in that case you might want  to try a  single column family
where the key to the column is the system name.
Then, you could name your columns as the timestamp.  Then when retrieving
information from the data store you can can, in your slice request, specify
your start column as  X and end  column as Y.
Then you can use the stored column name to know when an event  occurred.

On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com>  wrote:


I'm looking to use Cassandra to store log messages from various
systems. A log message only has a message (UTF8Type) and a data/time.
My thought is to create a column family for each system. The row key
will be a TimeUUIDType. Each row will have 7 columns: year, month,
day, hour, minute, second, and message. I then have indexes setup for
each of the date/time columns.

I was hoping this would allow me to answer queries like: "What are all
the log messages that were generated between X&  Y?" The problem is
that I can ONLY use the equals operator on these column values. For
example, I cannot issuing: get system_x where month>  1; gives me this
error: "No indexed columns present in index clause with operator EQ."
The equals operator works as expected though: get system_x where month
= 1;

What schema would allow me to get date ranges?

Thanks in advance...

Bill-

* ColumnFamily description *
    ColumnFamily: system_x_msg
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 0.0/0
      Key cache size / save period: 200000.0/3600
      Memtable thresholds: 1.1671875/249/60
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
proj_1_msg.7365636f6e64, proj_1_msg.79656172]
      Column Metadata:
        Column Name: year (year)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: month (month)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: second (second)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: minute (minute)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: hour (hour)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: day (day)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS




--
David McNelis
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143
A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.

Re: Schema Design

Reply via email to