Jirka, > But I am really interested how it can work well with Spark/Hadoop where > you basically needs to read all the data as well (as far as I understand > that).
I can't give you any benchmarking between technologies (nor am i particularly interested in getting involved in such a discussion) but i can share our experiences with Cassandra, Hadoop, and Spark, over the past 4+ years, and hopefully assure you that Cassandra+Spark is a smart choice. On a four node cluster we were running 5000+ small hadoop jobs each day each finishing within two minutes, often within one minute, resulting in (give or take) a billion records read and 150 millions records written from and to c*. These small jobs are incrementally processing on limited partition key sets each time. These jobs are primarily reading data from a "raw events store" that has a ttl of 3 months and 22+Gb of tombstones a day (reads over old partition keys are rare). We also run full-table-scan jobs and have never come across any issues particular to that. There are hadoop map/reduce settings to increase durability if you have tables with troublesome partition keys. This is also a cluster that serves requests to web applications that need low latency. We recently wrote a spark job that does full table scans over 100 million+ rows, involves a handful of stages (two tables, 9 maps, 4 reduce, and 2 joins), and writes back to a new table 5 millions rows. This job runs in ~260 seconds. Spark is becoming a natural complement to schema evolution for cassandra, something you'll want to do to keep your schema optimised against your read request patterns, even little things like switching cluster keys around. With any new technology hitting some hurdles (especially if you go wondering outside recommended practices) will of course be part of the game, but that said I've only had positive experiences with this community's ability to help out (and do so quickly). Starting from scratch i'd use Spark (on scala) over Hadoop no questions asked. Otherwise Cassandra has always been our 'big data' platform, hadoop/spark is just an extra tool on top. We've never kept data in hdfs and are very grateful for having made that choice. ~mck ref https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/