Hi, I am currently using spark 1.5.2 and I have been able to run benchmarks in spark (SQL specifically) in single user mode. For benchmarking with multiple users, I have tried some of the following approaches, but each has its own disadvantage
1. Start thrift server in Spark. - Execute queries via JDBC from Jmeter. (Disadvantage is that, it is not possible to execute custom code to load tables as DataFrames) 2. Start custom thrift server in Spark. Custom thrift server would create HiveContext and could load all relevant tables as temp tables (with DF). Later it could start thrift-server via "HiveThriftServer2.startWithContext(hiveContext); " - Execute queries in Jmeter via JDBC. (Disadvantage is that, it can simulate single user. When multiple threads submit the queries, they are executed in serial fashion) - Even if number of executors is increased, it does not solve this problem. With more executors, the response times of small queries tend to be higher with multiple runs (may be consecutive executions are happening in different executors where the data wasn’t cached). 3. Create multiple SparkContexts when Jmeter initializes the benchmark. This is more like a pool of SparkContexts and every user can make use of different SparkContext. - This leads to SPARK-2243 <https://issues.apache.org/jira/browse/SPARK-2243> and "spark.driver.allowMultipleContexts=true” is not helpful in this case. 4. Another option could be to just launch multiple spark-shells to simulate multiple users with dynamic resource allocation enabled. I haven’t tried this yet. Are there any standard approaches for benchmarking with multiple users in Spark? Any pointers on this would be helpful. ~Rajesh.B