It makes sense that the single row for a system (with a growing number of
columns) will reside on a single machine.
With that in mind, here is my updated schema:
- A single column family for all the messages. The row keys will be the TimeUUID
of the message with the following columns: date/time (in UTC POSIX), system
name/id (with an index for fast/easy gets), the actual message payload.
- A column family for each system. The row keys will be UTC POSIX time with 1
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of
any messages that were logged during that time bucket.
My only hesitation with this design is that buddhasystem warned that each column
family, "is allocated a piece of memory on the server." I'm not sure what the
implications of this are and/or if this would be a problem if a I had a number
of systems on the order of hundreds.
Thanks...
Bill-
On 01/26/2011 06:51 PM, Shu Zhang wrote:
Each row can have a maximum of 2 billion columns, which a logging system will
probably hit eventually.
More importantly, you'll only have 1 row per set of system logs. Every row is
stored on the same machine(s), which you means you'll definitely not be able to
distribute your load very well.
________________________________________
From: Bill Speirs [bill.spe...@gmail.com]
Sent: Wednesday, January 26, 2011 1:23 PM
To: user@cassandra.apache.org
Subject: Re: Schema Design
I like this approach, but I have 2 questions:
1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?
2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.
Thanks!
Bill-
On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
<dmcne...@agentisenergy.com> wrote:
I would say in that case you might want to try a single column family
where the key to the column is the system name.
Then, you could name your columns as the timestamp. Then when retrieving
information from the data store you can can, in your slice request, specify
your start column as X and end column as Y.
Then you can use the stored column name to know when an event occurred.
On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com> wrote:
I'm looking to use Cassandra to store log messages from various
systems. A log message only has a message (UTF8Type) and a data/time.
My thought is to create a column family for each system. The row key
will be a TimeUUIDType. Each row will have 7 columns: year, month,
day, hour, minute, second, and message. I then have indexes setup for
each of the date/time columns.
I was hoping this would allow me to answer queries like: "What are all
the log messages that were generated between X& Y?" The problem is
that I can ONLY use the equals operator on these column values. For
example, I cannot issuing: get system_x where month> 1; gives me this
error: "No indexed columns present in index clause with operator EQ."
The equals operator works as expected though: get system_x where month
= 1;
What schema would allow me to get date ranges?
Thanks in advance...
Bill-
* ColumnFamily description *
ColumnFamily: system_x_msg
Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
Row cache size / save period: 0.0/0
Key cache size / save period: 200000.0/3600
Memtable thresholds: 1.1671875/249/60
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 1.0
Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
proj_1_msg.7365636f6e64, proj_1_msg.79656172]
Column Metadata:
Column Name: year (year)
Validation Class: org.apache.cassandra.db.marshal.IntegerType
Index Type: KEYS
Column Name: month (month)
Validation Class: org.apache.cassandra.db.marshal.IntegerType
Index Type: KEYS
Column Name: second (second)
Validation Class: org.apache.cassandra.db.marshal.IntegerType
Index Type: KEYS
Column Name: minute (minute)
Validation Class: org.apache.cassandra.db.marshal.IntegerType
Index Type: KEYS
Column Name: hour (hour)
Validation Class: org.apache.cassandra.db.marshal.IntegerType
Index Type: KEYS
Column Name: day (day)
Validation Class: org.apache.cassandra.db.marshal.IntegerType
Index Type: KEYS
--
David McNelis
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143
A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.