HDFS / S3 is a great place to dump this data. You can also consider other types of compaction strategies for “COLD DATA” in not so powerful C* clusters for which the purpose is write only. C* is still better in my opinion for data management than S3/HDFS. It depends on how easy you want the retrieval and analysis to be.
-- Rahul Singh rahul.si...@anant.us Anant Corporation On Mar 12, 2018, 8:30 AM -0400, Javier Pareja <pareja.jav...@gmail.com>, wrote: > Hi, > > I understand that a well designed cassandra system will allow to query ANY > data within it at an incredible speed as well as ingesting data at a very > fast pace. > > However this data is going to grow until it is archived. As I see it, data > has two stages, HOT DATA when data is accessible to be queried on very low > latency and COLD DATA when data can be queried and processed but we can allow > a (relatively long) delay. Cassandra is VERY good with the HOT DATA but it is > not very cost effective when the COLD DATA starts to grow because each node > only stores a tiny amount (1TB recommended). The number of nodes needed start > to grow even if this data is rarely queried!! > > Has anyone implemented a solution that "archives" data into a cold(er) > storage outside of cassandra, while still being available for (offline) > processing with spark? For example into a separate cluster with Hadoop/HIVE? > What is the standard in this cases? > > F Javier Pareja