So, there is some input:

So the problem could be in spark-sql-thriftserver.
When I use spark console to submit SQL query, it takes 10 seconds and
reasonable count of tasks.

import com.datastax.spark.connector._;

val cc = new CassandraSQLContext(sc);

cc.sql("select su.user_id from appdata.site_users su join
appdata.user_orders uo on uo.user_id=su.user_id").count();

res8: Long = 2

If the same query submitted through beeline, it takes minutes and spark
creates up to 2000 tasks to read 3 lines of data.

We think spark-sql-thriftserver has bugs in it.

2015-06-17 14:14 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:

> >version
> We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1
>
> >cqlsh
> select * from site_users
> returns fast, subsecond, only 3 rows
>
> >Can you show some code how you're doing the reads?
> dse beeline
> !connect ...
> select * from site_users
> --table has 3 rows, several columns in each row. Spark eunts 769 tasks and
> estimates input as 800000 TB
>
> 0: jdbc:hive2://dsenode01:10000> select count(*) from site_users;
>
> +------+
>
> | _c0  |
>
> +------+
>
> | 3    |
>
> +------+
>
> 1 row selected (41.635 seconds)
>
>
> >Spark and Cassandra-connector
>
> /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar
>
> /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar
>
> 2015-06-17 13:52 GMT+02:00 Yana Kadiyska <yana.kadiy...@gmail.com>:
>
>> Can you show some code how you're doing the reads? Have you successfully
>> read other stuff from Cassandra (i.e. do you have a lot of experience with
>> this path and this particular table is causing issues or are you trying to
>> figure out the right way to do a read).
>>
>> What version of Spark and Cassandra-connector are you using?
>> Also, what do you get for "select count(*) from foo" -- is that just as
>> bad?
>>
>> On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak <serega.shey...@gmail.com
>> > wrote:
>>
>>> Hi, can somebody suggest me the way to reduce quantity of task?
>>>
>>> 2015-06-15 18:26 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
>>>
>>>> Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes,
>>>> Each of them has spark worker.
>>>> The problem is that spark runs 869 task to read 3 lines: select bar
>>>> from foo.
>>>> I've tried these properties:
>>>>
>>>> #try to avoid 769 tasks per dummy select foo from bar qeury
>>>> spark.cassandra.input.split.size_in_mb=32mb
>>>> spark.cassandra.input.fetch.size_in_rows=1000
>>>> spark.cassandra.input.split.size=10000
>>>>
>>>> but it doesn't help.
>>>>
>>>> Here are  mean metrics for the job :
>>>> input1= 8388608.0 TB
>>>> input2 = -320 B
>>>> input3 = -400 B
>>>>
>>>> I'm confused with input, there are only 3 rows in C* table.
>>>> Definitely, I don't have 8388608.0 TB of data :)
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to