Hi all,
I have been struggling with Cassandra’s lack of adhoc query support (I know
this is an anti-pattern of Cassandra, but sometimes management come over
and ask me to run stuff and it’s impossible to explain that it will take me
a while when it would take about 10 seconds in MySQL) so I have put
together the following code snippet that bundles DataStax’s Cassandra Spark
connector and allows you to submit Spark SQL to it, outputting the results
in a text file.
Does anyone spot any obvious flaws in this plan?? (I have a lot more error
handling etc in my code, but removed it here for brevity)
*private* *void* run(String sqlQuery) {
SparkContext scc = *new* SparkContext(conf);
CassandraSQLContext csql = *new* CassandraSQLContext(scc);
DataFrame sql = csql.sql(sqlQuery);
String folderName = "/tmp/output_" + System.*currentTimeMillis*();
*LOG*.info("Attempting to save SQL results in folder: " +
folderName);
sql.rdd().saveAsTextFile(folderName);
*LOG*.info("SQL results saved");
}
*public* *static* *void* main(String[] args) {
String sparkMasterUrl = args[0];
String sparkHost = args[1];
String sqlQuery = args[2];
SparkConf conf = *new* SparkConf();
conf.setAppName("Java Spark SQL");
conf.setMaster(sparkMasterUrl);
conf.set("spark.cassandra.connection.host", sparkHost);
JavaSparkSQL app = *new* JavaSparkSQL(conf);
app.run(sqlQuery, printToConsole);
}
I can then submit this to Spark with ‘spark-submit’:
Ø *./spark-submit --class com.algomi.spark.JavaSparkSQL --master
spark://sales3:7077
spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar
spark://sales3:7077 sales3 "select * from mykeyspace.operationlog" *
It seems to work pretty well, so I’m pretty happy, but wondering why this
isn’t common practice (at least I haven’t been able to find much about it
on Google) – is there something terrible that I’m missing?
Thanks!
Matthew