Statically defining columsn using EAV table approach is totally a wrong fit
for Cassandra.

Taking a step back, EAV tables generally don't scale at no matter the
database. I've done this on SqlServer, Oracle and DB2. Many products that
use EAV approach like master data management products suffer from this
issue. On RDBMS, people get around this by creating indexed (aka
materialized) views to make queries perform "mostly" acceptable when
there's millions of rows.

Ignoring all of that, the biggest issue with EAV approach is long term
maintenance and performance. If the data model or object model is
moderately complex, you end up with an explosion of queries to rebuild each
object instance. Something as simple as reading 100 objects could balloon
to 1000 queries. Having seen this first hand over the years, don't go there
unless your object model is trivially simple without any lists/sets/maps,
have less than 10 fields and the database will be less than 100K records.

The proper way to use Cassandra for dynamic object model use case is to use
dynamic columns. The downside though is this, CQL is not ideally suited to
this use case, which means using thrift. All "SQL inspired" languages
suffer from this design limitation. To do this type of stuff, you want a
strongly typed API that lets you control the exactly type of column/value
goes in and comes out. I've mentioned this in the past on the mailing list.
It's one of the biggest advantages of Cassandra, which other NoSql db's
can't do or do poorly. Don't take my word for it, go look at MDM products
from IBM, Oracle and Tibco to see how slow those systems run with millions
of records using EAV design.

When you combine bi-temporal support, the queries get even slower and
management gets more ugly over time. It's one of the reasons I built a
temporal database on top of Cassandra instead of MySql, Postgresql, Sql
Server, Oracle, Sybase or DB2. If you want to talk about specifics of
exactly how EAV approach doesn't scale, email me directly since it's
probably off topic woolfel AT gmail DOT com





On Mon, Oct 6, 2014 at 11:56 PM, Todd Fast <t...@toddfast.com> wrote:

> There is a team at my work building a entity-attribute-value (EAV) store
> using Cassandra. There is a column family, called Entity, where the
> partition key is the UUID of the entity, and the columns are the attributes
> names with their values. Each entity will contain hundreds to thousands of
> attributes, out of a list of up to potentially ten thousand known attribute
> names.
>
> However, instead of using wide rows with dynamic columns (and serializing
> type info with the value), they are trying to use a static column family
> and modifying the schema dynamically as new named attributes are created.
>
> (I believe one of the main drivers of this approach is to use collection
> columns for certain attributes, and perhaps to preserve type metadata for a
> given attribute.)
>
> This approach goes against everything I've seen and done in Cassandra, and
> is generally an anti-pattern for most persistence stores, but I want to
> gather feedback before taking the next step with the team.
>
> Do others consider this approach an anti-pattern, and if so, what are the
> practical downsides?
>
> For one, this means that the Entity schema would contain the superset of
> all columns for all rows. What is the impact of having thousands of columns
> names in the schema? And what are the implications of modifying the schema
> dynamically on a decent sized cluster (5 nodes now, growing to 10s later)
> under load?
>
> Thanks,
> Todd
>

Reply via email to