Hello, I am doing some performance testing on Spark and
1. I would like to verify Spark's parallelism, empirically. For example, I would like to determine if the number of "performance-heavy" threads is equal to SPARK_WORKER_CORES in standalone mode at a given moment. Essentially, I want to ensure that the right amount of parallelism is achieved when fiddling around with certain parameters in Spark. a. I've skimmed what Ganglia does but it doesn't seem to have thread-level information. b. In addition, running jconsole on the worker (executor) process on a worker node yields the number of active threads, but I found this kind of difficult to work with. I've been using a McGyver-ish technique where I observe the process number that is spawned when running an application, and then running "jconsole procNum" to get a visualization of number of threads across the application. The thing is this takes a few seconds and so by the time jconsole starts running, I already see 40 active threads. Here's an image describing the sort of thing I'm getting: <http://apache-spark-developers-list.1001551.n3.nabble.com/file/n12898/proc.png> In addition, jconsole doesn't show the "performance" associated with specific threads. Ideally I would like to monitor the process from the time before it's spawned (starting out with 0 active threads) and observe the thread growth / utilization over time. Not sure if this is possible to do with Spark/jconsole. c. Maybe someone who is familiar with JVM's / OS has a suggestion about how to go about this, or if this is feasible to do. I am able to do some verification about the utilization of cores by checking "top" on Linux for low task numbers, but my machines only have 4 physical cores -i.e. if only 1 task is running, top shows only one core being utilized, 2 tasks shows two cores being utilized, etc. I want to do some testing with larger numbers of threads, for example 8 or 16, which is greater than 4. 2. There is a listing called "active tasks" in the executor tab, which yields an indication of parallelism at a certain moment. Is it safe to assume one task is assigned to one (separate) thread on a worker node? For example, if there are 3 tasks running on a node, are there three separate threads running? 3. In addition, could you point me to where in the source cluster-wide / slave-node-specific "parallelism" is actually implemented in the Spark source code, or is threading simply deferred to details specified by JVM or YARN? For example, in in a comment by Sean Owen <http://stackoverflow.com/questions/27606541/why-does-spark-streaming-need-a-certain-number-of-cpu-cores-to-run-correctly> , it is stated that YARN needs to reserve entire physical cores and so it seems that you can't set your SPARK_WORKER_CORES to a number greater than the number of physical cores. Furthermore, it seems that "cores" is not a very good name for specifying these parameters (I've been playing around with local / standalone mode mostly), and perhaps, should possibly be changed to just "threads" in the future? Thanks, Jonathon -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Verifying-Empirically-Number-of-Performance-Heavy-Threads-and-Parallelism-tp12898.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org