Hi, I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat. Not sure if I should be asking on the Spark list or the Accumulo list, but I'll try here. The problem is that the workload to process SQL queries doesn't seem to be distributed across my cluster very well.
My Spark SQL app is running in yarn-client mode. The query I'm running is "select count(*) from audit_log" (or a similarly simple query) where my audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly across 8 tablet servers. Looking at the Accumulo monitor app, I only ever see a maximum of 2 tablet servers with active scans. Since the data is spread across all the tablet servers, I hoped to see 8! I realize there are a lot of moving parts here but I'd any advice about where to start looking. Using Spark 1.0.1 with Accumulo 1.6. Thanks! -Russ