One cassandra node. Best Regards, Junaid Nasir
> > On May 18, 2017 at 3:56 AM, <ayan guha (mailto:guha.a...@gmail.com)> wrote: > > > > How many nodes do you have in casandra cluster? > > > > > On Thu, 18 May 2017 at 1:33 am, Jörn Franke <jornfra...@gmail.com > (mailto:jornfra...@gmail.com)> wrote: > > > > > > > > > > > The issue might be group by , which under certain circumstances can cause a > > lot of traffic to one node. This transfer is of course obsolete the less > > nodes you have. > > > > Have you checked in the UI what it reports? > > > > > > > > > > On 17. May 2017, at 17:13, Junaid Nasir <jna...@an10.io > > (mailto:jna...@an10.io)> wrote: > > > > > > > > > > > > > > > > I have a large data set of 1B records and want to run analytics using > > > Apache spark because of the scaling it provides, but I am seeing an anti > > > pattern here. The more nodes I add to spark cluster, completion time > > > increases. Data store is Cassandra, and queries are run by Zeppelin. I > > > have tried many different queries but even a simple query of > > > `dataframe.count()` behaves like this. > > > > > > > > > > > > Here is the zeppelin notebook, temp table has 18M records > > > > > > > > > > > > > > > > > > val df = sqlContext > > > > > > > > > > > > .read > > > > > > > > > > > > .format("org.apache.spark.sql.cassandra") > > > > > > > > > > > > .options(Map( "table" -> "temp", "keyspace" > > > -> "mykeyspace")) > > > > > > > > > > > > .load().cache() > > > > > > > > > > > > df.registerTempTable("table") > > > > > > > > > > > > > > > %sql > > > > > > > > > > > > SELECT first(devid),date,count(1) FROM table group by > > > date,rtu order by date > > > > > > > > > > > > > > > > > > when tested against different no. of spark worker nodes these were the > > > results > > > > > > > > > > > > > > > > > > Spark nodes > > > > > > Time > > > > > > > > > > > > 4 nodes > > > > > > 22 min 58 sec > > > > > > > > > > > > 3 nodes > > > > > > 15 min 49 sec > > > > > > > > > > > > 2 nodes > > > > > > 12 min 51 sec > > > > > > > > > > > > 1 node > > > > > > 17 min 59 sec > > > > > > > > > > > > > > > > > > > > > > > > Increasing the no. of nodes decreases performance. which should not > > > happen as it defeats the purpose of using Spark. > > > > > > > > > > > > If you want me to run any query or further info about the setup please > > > ask. > > > > > > Any cues on why this is happening would be very helpful, been stuck on > > > this for two days now. Thank you for your time. > > > > > > > > > > > > > > > > > > **versions** > > > > > > > > > > > > Zeppelin: 0.7.1 > > > > > > Spark: 2.1.0 > > > > > > Cassandra: 2.2.9 > > > > > > Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11 > > > > > > > > > > > > Spark cluster specs > > > > > > > > > > > > 6 vCPUs, 32 GB memory = 1 node > > > > > > > > > > > > Cassandra + Zeppelin server specs > > > > > > 8 vCPUs, 52 GB memory > > > > > > > > > > -- > > Best Regards, > Ayan Guha >