Re: Cassandra data model right definition

Russell Bradberry Mon, 03 Oct 2016 07:35:22 -0700

A couple things I would like to note:

1. Cassandra does not determine how data is stored on disk, the compaction
strategy does.  One could, in theory, (and I believe some are trying) could
create a column-store compaction strategy.  There is a large effort in the
database community overall to separate the query execution from the storage
engine, it is becoming increasingly more incorrect to say a database is an
"X store" database.


2. "X-store" is not used, and never has been, to describe how data is
represented or queried.  When most database storage engines describe their
storage as "X-store" they are referring to contiguous bytes on disk.  In
traditional rows-store engines, on a single node, the definition is as
follows: "All data for a row is stored as a single block of contiguous
bytes on disk".  Traditional column-stores are also defined as "All data
for a column is stored contiguously on disk".  Old-style Cassandra was a
key-value column-family store in that "all data for a family of columns
belonging to a given key were stored contiguously on disk"

So when talking about Cassandra and all currently merged compaction
strategies, yes, it fits the definition of a row-store in that "All data
for a row is stored as contiguous bytes on disk", however, it goes further
because "All data for all rows in a given partition are stored as
contiguous bytes on disk".  So at the highest level one could say it is a
"Partition-store" but that is pretty vague.   I think it is deserving of a
different naming definition which is why I like the term
"Partitioned-row-store"  which gives insight into the fact that it is rows
being stored on disk, in a partitioned format.

PS.
To address the pedants, yes, by these definitions you would have to assume
that a partition resides in a single SSTable. While most compaction
strategies try hard to achieve this it currently only exists in one that I
know. You could call it a
"Partitioned-row-depenendent-upon-compaction-strategy-store" but that is
just terrible.



On Mon, Oct 3, 2016 at 10:02 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Sorry Ed, but you're really stretching here. A table in Cassandra is
> structured by a schema with the data for each row stored together in each
> data file. Just because it uses log structured storage, sparse fields, and
> semi-flexible collections doesn't disqualify it from calling it a "row
> store"
>
> Postgres added flexible storage through hstore, I don't hear anyone
> arguing that it needs to be renamed.
>
> Any relational db could (and I'm sure one does!) allow for sparse fields
> as well. MySQL can be backed by rocksdb now, does that make it not a row
> store?
>
> You're arguing that everything is wrong but you're not proposing an
> alternative, which is not productive.
>
> On Mon, Oct 3, 2016 at 9:40 AM Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> Also every piece of techincal information that describes a rowstore
>>
>> http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod08-slides.pdf
>> https://en.wikipedia.org/wiki/Column-oriented_DBMS#Row-oriented_systems
>>
>> Does it like this:
>>
>> 001:10,Smith,Joe,40000;
>> 002:12,Jones,Mary,50000;
>> 003:11,Johnson,Cathy,44000;
>> 004:22,Jones,Bob,55000;
>>
>>
>>
>> The never depict a scenario where a the data looks like this on disk:
>>
>> 001:10,Smith
>>
>> 001:10,40000;
>>
>> Which is much closer to how Cassandra *stores* it's data.
>>
>>
>>
>> On Fri, Sep 30, 2016 at 5:12 PM, Benedict Elliott Smith <
>> bened...@apache.org> wrote:
>>
>> Absolutely.  A "partitioned row store" is exactly what I would call it.
>> As it happens, our README thinks the same, which is fantastic.
>>
>> I thought I'd take a look at the rest of our cohort, and didn't get far
>> before disappointment.  HBase literally calls itself a "*column-oriented* 
>> store"
>> - which is so totally wrong it's simultaneously hilarious and tragic.
>>
>> I guess we can't blame the wider internet for misunderstanding/misnaming
>> us poor "wide column stores" if even one of the major examples doesn't know
>> what it, itself, is!
>>
>>
>>
>>
>> On 30 September 2016 at 21:47, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>
>> +1000 to what Benedict says. I usually call it a "partitioned row store"
>> which usually needs some extra explanation but is more accurate than
>> "column family" or whatever other thrift era terminology people still use.
>> On Fri, Sep 30, 2016 at 1:53 PM DuyHai Doan <doanduy...@gmail.com> wrote:
>>
>> I used to present Cassandra as a NoSQL datastore with "distributed"
>> table. This definition is closer to CQL and has some academic background
>> (distributed hash table).
>>
>>
>> On Fri, Sep 30, 2016 at 7:43 PM, Benedict Elliott Smith <
>> bened...@apache.org> wrote:
>>
>> Cassandra is not a "wide column store" anymore.  It has a schema.  Only
>> thrift users no longer think they have a schema (though they do), and
>> thrift is being deprecated.
>>
>> I really wish everyone would kill the term "wide column store" with
>> fire.  It seems to have never meant anything beyond "schema-less,
>> row-oriented", and a "column store" means literally the opposite of this.
>>
>> Not only that, but people don't even seem to realise the term "column
>> store" existed long before "wide column store" and the latter is often
>> abbreviated to the former, as here: http://www.planetcassandra.
>> org/what-is-nosql/
>>
>> Since it no longer applies, let's all agree as a community to forget this
>> awful nomenclature ever existed.
>>
>>
>>
>> On 30 September 2016 at 18:09, Joaquin Casares <joaq...@thelastpickle.com
>> > wrote:
>>
>> Hi Mehdi,
>>
>> I can help clarify a few things.
>>
>> As Carlos said, Cassandra is a Wide Column Store. Theoretically a row can
>> have 2 billion columns, but in practice it shouldn't have more than 100
>> million columns.
>>
>> Cassandra partitions data to certain nodes based on the partition key(s),
>> but does provide the option of setting zero or more clustering keys.
>> Together, the partition key(s) and clustering key(s) form the primary key.
>>
>> When writing to Cassandra, you will need to provide the full primary key,
>> however, when reading from Cassandra, you only need to provide the full
>> partition key.
>>
>> When you only provide the partition key for a read operation, you're able
>> to return all columns that exist on that partition with low latency. These
>> columns are displayed as "CQL rows" to make it easier to reason about.
>>
>> Consider the schema:
>>
>> CREATE TABLE foo (
>>   bar uuid,
>>
>>   boz uuid,
>>
>>   baz timeuuid,
>>   data1 text,
>>
>>   data2 text,
>>
>>   PRIMARY KEY ((bar, boz), baz)
>>
>> );
>>
>>
>> When you write to Cassandra you will need to send bar, boz, and baz and
>> optionally data*, if it's relevant for that CQL row. If you chose not to
>> define a data* field for a particular CQL row, then nothing is stored nor
>> allocated on disk. But I wouldn't consider that caveat to be "schema-less".
>>
>> However, all writes to the same bar/boz will end up on the same Cassandra
>> replica set (a configurable number of nodes) and be stored on the same
>> place(s) on disk within the SSTable(s). And on disk, each field that's not
>> a partition key is stored as a column, including clustering keys (this is
>> optimized in Cassandra 3+, but now we're getting deep into internals).
>>
>> In this way you can get fast responses for all activity for bar/boz
>> either over time, or for a specific time, with roughly the same number of
>> disk seeks, with varying lengths on the disk scans.
>>
>> Hope that helps!
>>
>> Joaquin Casares
>> Consultant
>> Austin, TX
>>
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> On Fri, Sep 30, 2016 at 11:40 AM, Carlos Alonso <i...@mrcalonso.com>
>> wrote:
>>
>> Cassandra is a Wide Column Store http://db-engines.com/
>> en/system/Cassandra
>>
>> Carlos Alonso | Software Engineer | @calonso
>> <https://twitter.com/calonso>
>>
>> On 30 September 2016 at 18:24, Mehdi Bada <mehdi.b...@dbi-services.com>
>> wrote:
>>
>> Hi all,
>>
>> I have a theoritical question:
>> - Is Apache Cassandra really a column store?
>> Column store mean storing the data as column rather than as a rows.
>>
>> In fact C* store the data as row, and data is partionned with row key.
>>
>> Finally, for me, Cassandra is a row oriented schema less DBMS.... Is it
>> true for you also???
>>
>> Many thanks in advance for your reply
>>
>> Best Regards
>> Mehdi Bada
>> ----
>>
>> *Mehdi Bada* | Consultant
>> Phone: +41 32 422 96 00 | Mobile: +41 79 928 75 48 | Fax: +41 32 422 96
>> 15
>> dbi services, Rue de la Jeunesse 2, CH-2800 Delémont
>> mehdi.b...@dbi-services.com
>> www.dbi-services.com
>>
>>
>>
>>
>> *⇒ dbi services is recruiting Oracle & SQL Server experts ! – Join the
>> team
>> <http://www.dbi-services.com/fr/dbi-services-et-ses-collaborateurs/offres-emplois-opportunites-carrieres/>*
>>
>>
>>
>>
>>
>>
>>
>>

Re: Cassandra data model right definition

Reply via email to