Re: spark cluster performance decreases by adding more nodes

jnasir Thu, 18 May 2017 01:43:11 -0700

  
  
One cassandra node.
    
  
 Best Regards,  
Junaid Nasir


  
  
>   
> On May 18, 2017 at 3:56 AM,  <ayan guha (mailto:guha.a...@gmail.com)>  wrote:
>   
>   
>   
> How many nodes do you have in casandra cluster?
>   
>
>   
>   
> On Thu, 18 May 2017 at 1:33 am, Jörn Franke  <jornfra...@gmail.com 
> (mailto:jornfra...@gmail.com)>  wrote:
>   
> >   
> >   
> >
> >   
> > The issue might be group by , which under certain circumstances can cause a 
> > lot of traffic to one node. This transfer is of course obsolete the less 
> > nodes you have.
> >   
> > Have you checked in the UI what it reports?
> >   
> >   
> >   
> >
> >  On 17. May 2017, at 17:13, Junaid Nasir  <jna...@an10.io 
> > (mailto:jna...@an10.io)>  wrote:
> >   
> >   
> > >   
> > >   
> > >   
> > > I have a large data set of 1B records and want to run analytics using 
> > > Apache spark because of the scaling it provides, but I am seeing an anti 
> > > pattern here. The more nodes I add to spark cluster, completion time 
> > > increases. Data store is Cassandra, and queries are run by Zeppelin. I 
> > > have tried many different queries but even a simple query of 
> > > `dataframe.count()` behaves like this.   
> > >   
> > >
> > >   
> > > Here is the zeppelin notebook, temp table has 18M records   
> > >   
> > >
> > >   
> > >     
> > >
> > >      val df    =    sqlContext
> > >
> > >   
> > >
> > >               .read
> > >
> > >   
> > >
> > >               .format("org.apache.spark.sql.cassandra")
> > >
> > >   
> > >
> > >               .options(Map(     "table"     ->     "temp",     "keyspace" 
> > >     ->     "mykeyspace"))
> > >
> > >   
> > >
> > >               .load().cache()
> > >
> > >   
> > >
> > >           df.registerTempTable("table")
> > >
> > >   
> > >   
> > >
> > >    %sql
> > >
> > >   
> > >
> > >    SELECT first(devid),date,count(1)    FROM table    group     by    
> > > date,rtu order    by    date
> > >
> > >   
> > >   
> > >
> > >   
> > > when tested against different no. of spark worker nodes these were the 
> > > results
> > >   
> > >   
> > >   
> > >   
> > >   
> > >  Spark nodes
> > >   
> > >  Time
> > >   
> > >   
> > >   
> > > 4 nodes
> > >   
> > > 22 min 58 sec
> > >   
> > >   
> > >   
> > > 3 nodes
> > >   
> > > 15 min 49 sec
> > >   
> > >   
> > >   
> > > 2 nodes
> > >   
> > > 12 min 51 sec
> > >   
> > >   
> > >   
> > > 1 node
> > >   
> > > 17 min 59 sec
> > >   
> > >   
> > >   
> > >   
> > >   
> > >
> > >   
> > > Increasing the no. of nodes decreases performance. which should not 
> > > happen as it defeats the purpose of using Spark.   
> > >   
> > >
> > >   
> > > If you want me to run any query or further info about the setup please 
> > > ask.
> > >   
> > > Any cues on why this is happening would be very helpful, been stuck on 
> > > this for two days now. Thank you for your time.
> > >   
> > >
> > >   
> > >
> > >   
> > > **versions**
> > >   
> > >
> > >   
> > > Zeppelin: 0.7.1
> > >   
> > > Spark: 2.1.0
> > >   
> > > Cassandra: 2.2.9
> > >   
> > > Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
> > >   
> > >
> > >   
> > > Spark cluster specs
> > >   
> > >
> > >   
> > > 6 vCPUs, 32 GB memory = 1 node
> > >   
> > >
> > >   
> > > Cassandra + Zeppelin server specs
> > >   
> > > 8 vCPUs, 52 GB memory
> > >   
> > >
> > >                 
> --
>   
> Best Regards,
>  Ayan Guha
>

Re: spark cluster performance decreases by adding more nodes

Reply via email to