Spark + AccumuloInputFormat

Russ Weeks Tue, 09 Sep 2014 17:14:36 -0700

Hi,

I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.


My Spark SQL app is running in yarn-client mode. The query I'm running is
"select count(*) from audit_log" (or a similarly simple query) where my
audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly
across 8 tablet servers. Looking at the Accumulo monitor app, I only ever
see a maximum of 2 tablet servers with active scans. Since the data is
spread across all the tablet servers, I hoped to see 8!

I realize there are a lot of moving parts here but I'd any advice about
where to start looking.

Using Spark 1.0.1 with Accumulo 1.6.

Thanks!
-Russ

Spark + AccumuloInputFormat

Reply via email to