Spark-Cassandra Bulk Reader: CASSANDRA-16222

James Berragan Fri, 23 Oct 2020 09:17:07 -0700

Hi everyone,

I want to highlight to the dev community CASSANDRA-16222
<https://issues.apache.org/jira/browse/CASSANDRA-16222>, a Spark library we
have been working on that can compact and read raw Cassandra SSTables into
SparkSQL.


By reading the sstables directly from a snapshot directory we are able to
achieve high performance with minimal impact to a production cluster. As an
example, we successfully exported a ~32TB Cassandra table (~46bn cql rows)
to HDFS in Parquet format in around 1h10m, a 20x improvement on previous
solutions.

You can find the code on GitHub:
https://github.com/jberragan/spark-cassandra-bulkreader.

We would like to contribute the code to the project and open to more
Cassandra users.

James.

Spark-Cassandra Bulk Reader: CASSANDRA-16222

Reply via email to