According to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
I tried the following but it still looks like it is taking forever sc.cassandraTable(keyspace, table).cassandraCount On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com> wrote: > I would be glad if SELECT COUNT(*) FROM hello can return any value for > that size :) I can say for sure it didn't return anything for 30 mins and I > probably need to build more patience to sit for few more hours after that! > Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which > will give a pretty good estimate but not an accurate value. > > On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com> > wrote: > >> How fast is Cassandra without Spark on the count operation? >> >> cqsh> SELECT COUNT(*) FROM hello >> >> (this is not equivalent with what you are doing but might help you find >> the root of the cause) >> >> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote: >> >>> I have the following code >>> >>> I invoke spark-shell as follows >>> >>> ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 >>> --executor-memory 15G --executor-cores 12 --conf >>> spark.cassandra.input.split.size_in_mb=67108864 >>> >>> code >>> >>> scala> val df = spark.sql("SELECT test from hello") // Billion rows >>> in hello and test column is 1KB >>> >>> df: org.apache.spark.sql.DataFrame = [test: binary] >>> >>> scala> df.count >>> >>> [Stage 0:> (0 + 2) / 13] // I dont know what these numbers mean >>> precisely. >>> >>> If I invoke spark-shell as follows >>> >>> ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 >>> >>> code >>> >>> >>> val df = spark.sql("SELECT test from hello") // This has about >>> billion rows >>> >>> scala> df.count >>> >>> >>> [Stage 0:=> (686 + 2) / 24686] // What are these numbers precisely? >>> >>> >>> Both of these versions didn't work Spark keeps running forever and I >>> have been waiting for more than 15 mins and no response. Any ideas on what >>> could be wrong and how to fix this? >>> >>> I am using Spark 2.0.2 >>> and spark-cassandra-connector_2.11-2.0.0-M3.jar >>> >>> >> >> >> -- >> -- Anastasios Zouzias >> <a...@zurich.ibm.com> >> > >