Hi everyone, I want to highlight to the dev community CASSANDRA-16222 <https://issues.apache.org/jira/browse/CASSANDRA-16222>, a Spark library we have been working on that can compact and read raw Cassandra SSTables into SparkSQL.
By reading the sstables directly from a snapshot directory we are able to achieve high performance with minimal impact to a production cluster. As an example, we successfully exported a ~32TB Cassandra table (~46bn cql rows) to HDFS in Parquet format in around 1h10m, a 20x improvement on previous solutions. You can find the code on GitHub: https://github.com/jberragan/spark-cassandra-bulkreader. We would like to contribute the code to the project and open to more Cassandra users. James.