So, there is some input: So the problem could be in spark-sql-thriftserver. When I use spark console to submit SQL query, it takes 10 seconds and reasonable count of tasks.
import com.datastax.spark.connector._; val cc = new CassandraSQLContext(sc); cc.sql("select su.user_id from appdata.site_users su join appdata.user_orders uo on uo.user_id=su.user_id").count(); res8: Long = 2 If the same query submitted through beeline, it takes minutes and spark creates up to 2000 tasks to read 3 lines of data. We think spark-sql-thriftserver has bugs in it. 2015-06-17 14:14 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: > >version > We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1 > > >cqlsh > select * from site_users > returns fast, subsecond, only 3 rows > > >Can you show some code how you're doing the reads? > dse beeline > !connect ... > select * from site_users > --table has 3 rows, several columns in each row. Spark eunts 769 tasks and > estimates input as 800000 TB > > 0: jdbc:hive2://dsenode01:10000> select count(*) from site_users; > > +------+ > > | _c0 | > > +------+ > > | 3 | > > +------+ > > 1 row selected (41.635 seconds) > > > >Spark and Cassandra-connector > > /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar > > /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar > > 2015-06-17 13:52 GMT+02:00 Yana Kadiyska <yana.kadiy...@gmail.com>: > >> Can you show some code how you're doing the reads? Have you successfully >> read other stuff from Cassandra (i.e. do you have a lot of experience with >> this path and this particular table is causing issues or are you trying to >> figure out the right way to do a read). >> >> What version of Spark and Cassandra-connector are you using? >> Also, what do you get for "select count(*) from foo" -- is that just as >> bad? >> >> On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak <serega.shey...@gmail.com >> > wrote: >> >>> Hi, can somebody suggest me the way to reduce quantity of task? >>> >>> 2015-06-15 18:26 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: >>> >>>> Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, >>>> Each of them has spark worker. >>>> The problem is that spark runs 869 task to read 3 lines: select bar >>>> from foo. >>>> I've tried these properties: >>>> >>>> #try to avoid 769 tasks per dummy select foo from bar qeury >>>> spark.cassandra.input.split.size_in_mb=32mb >>>> spark.cassandra.input.fetch.size_in_rows=1000 >>>> spark.cassandra.input.split.size=10000 >>>> >>>> but it doesn't help. >>>> >>>> Here are mean metrics for the job : >>>> input1= 8388608.0 TB >>>> input2 = -320 B >>>> input3 = -400 B >>>> >>>> I'm confused with input, there are only 3 rows in C* table. >>>> Definitely, I don't have 8388608.0 TB of data :) >>>> >>>> >>>> >>>> >>> >> >