Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

Serega Sheypak Wed, 17 Jun 2015 05:14:58 -0700

>version
We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1

>cqlsh
select * from site_users
returns fast, subsecond, only 3 rows


>Can you show some code how you're doing the reads?
dse beeline
!connect ...
select * from site_users
--table has 3 rows, several columns in each row. Spark eunts 769 tasks and
estimates input as 800000 TB

0: jdbc:hive2://dsenode01:10000> select count(*) from site_users;

+------+

| _c0  |

+------+

| 3    |

+------+

1 row selected (41.635 seconds)


>Spark and Cassandra-connector

/usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar

/usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar

2015-06-17 13:52 GMT+02:00 Yana Kadiyska <yana.kadiy...@gmail.com>:

> Can you show some code how you're doing the reads? Have you successfully
> read other stuff from Cassandra (i.e. do you have a lot of experience with
> this path and this particular table is causing issues or are you trying to
> figure out the right way to do a read).
>
> What version of Spark and Cassandra-connector are you using?
> Also, what do you get for "select count(*) from foo" -- is that just as
> bad?
>
> On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
>> Hi, can somebody suggest me the way to reduce quantity of task?
>>
>> 2015-06-15 18:26 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
>>
>>> Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes,
>>> Each of them has spark worker.
>>> The problem is that spark runs 869 task to read 3 lines: select bar from
>>> foo.
>>> I've tried these properties:
>>>
>>> #try to avoid 769 tasks per dummy select foo from bar qeury
>>> spark.cassandra.input.split.size_in_mb=32mb
>>> spark.cassandra.input.fetch.size_in_rows=1000
>>> spark.cassandra.input.split.size=10000
>>>
>>> but it doesn't help.
>>>
>>> Here are  mean metrics for the job :
>>> input1= 8388608.0 TB
>>> input2 = -320 B
>>> input3 = -400 B
>>>
>>> I'm confused with input, there are only 3 rows in C* table.
>>> Definitely, I don't have 8388608.0 TB of data :)
>>>
>>>
>>>
>>>
>>
>

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

Reply via email to