I’m fairly new to cassandra, but here’s my input.

Think of your column families as a projection of how the application needs
them. Thinking with CQRS in mind helps. So with more CFs that may require
more space, as data may be written differently in different column families
for different usage. For that reason you have to think about the disk
usage, considering the growth of the data, the space needed for cassandra
to perform compaction and other stuff.

Also on the modeling front, pay attention to growing wide rows, i.e. when
updating or deleting column in such row may adds too many tombstones (
tombstone_failure_threshold default is 100 000), which may cause cassandra
to abort queries on such rows (before compaction) because it have to load
this partition in memory to actually output the actual data.
This is especially important for time series. We had to rework our model to
bucket by period, to avoid such cases. However this will require some work
on the business code to query such a column family.

Avoid secondary indexes, which somehow relate to modeling per usage hence
removing their need.

Cheers,
— Brice

On Sat, Sep 20, 2014 at 6:55 AM, Jack Krupansky <j...@basetechnology.com>
wrote:

  Start by asking how you intend to query the data. That should drive the
> data model.
>
> Is there existing app client code or an app layer that is already using
> the current schema, or are you intending to rewrite that as well.
>
> FWIW, you could place the numeric columns in a numeric map collection, and
> the string columns in a string map collection, but... it’s best to first
> step back and look at the big picture of what the data actually looks like
> as well as how you want to query it.
>
> -- Jack Krupansky
>
>  *From:* Les Hartzman <lhartz...@gmail.com>
> *Sent:* Friday, September 19, 2014 5:46 PM
> *To:* user@cassandra.apache.org
> *Subject:* Help with approach to remove RDBMS schema from code to move to
> C*?
>
>  My company is using an RDBMS for storing time-series data. This
> application was developed before Cassandra and NoSQL. I'd like to move to
> C*, but ...
>
> The application supports data coming from multiple models of devices.
> Because there is enough variability in the data, the main table to hold the
> device data only has some core columns defined. The other columns are
> non-specific; a set of columns for numeric and a set for character. So for
> these non-specific columns, their use is defined in the code. The use of
> column 'numeric_1' might hold a millisecond time for one device and a fault
> code for another device. This appears to have been done to keep from
> modifying the schema whenever a new device was introduced. And they rolled
> their own db interface to support this mess.
>
> Now, we could just use C* like an RDBMS - defining CFs to mimic the
> tables. But this just pushes a bad design from one platform to another.
>
> Clearly there needs to be a code re-write. But what suggestions does
> anyone have on how to make this shift to C*?
>
> Would you just layout all of the columns represented by the different
> devices, naming them as they are used, and having jagged rows? Or is there
> some other way to approach this?
>
> Of course, the data miners already have scripts/methods for accessing the
> data from the RDBMS now in the user-unfriendly form it's in now. This would
> have to be addressed as well, but until I know how to store it, mining it
> gets ahead of things.
>
> Thanks.
>
> Les
>
>
​

Reply via email to