Hi,

I am currently using spark 1.5.2 and I have been able to run benchmarks in
spark (SQL specifically) in single user mode.  For benchmarking with
multiple users, I have tried some of the following approaches, but each has
its own disadvantage

   1. Start thrift server in Spark.
      - Execute queries via JDBC from Jmeter. (Disadvantage is that, it is
      not possible to execute custom code to load tables as DataFrames)
   2. Start custom thrift server in Spark. Custom thrift server would
   create HiveContext and could load all relevant tables as temp tables (with
   DF).  Later it could start thrift-server via
   "HiveThriftServer2.startWithContext(hiveContext); "
      - Execute queries in Jmeter via JDBC. (Disadvantage is that, it can
      simulate single user. When multiple threads submit the queries, they are
      executed in serial fashion)
      - Even if number of executors is increased, it does not solve this
      problem.  With more executors, the response times of small
queries tend to
      be higher with multiple runs (may be consecutive executions are happening
      in different executors where the data wasn’t cached).
   3. Create multiple SparkContexts when Jmeter initializes the benchmark.
   This is more like a pool of SparkContexts and every user can make use of
   different SparkContext.
      - This leads to SPARK-2243
      <https://issues.apache.org/jira/browse/SPARK-2243> and
      "spark.driver.allowMultipleContexts=true” is not helpful in this case.
   4. Another option could be to just launch multiple spark-shells to
   simulate multiple users with dynamic resource allocation enabled.  I
   haven’t tried this yet.

Are there any standard approaches for benchmarking with multiple users in
Spark? Any pointers on this would be helpful.

~Rajesh.B

Reply via email to