some accurate numbers here. so it took me 1hr:30 mins to count 698705723 rows (~700 Million)
and my code is just this sc.cassandraTable("cuneiform", "blocks").cassandraCount On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth...@gmail.com> wrote: > Take a look at this https://github.com/brianmhess/cassandra-count > > Now It is just matter of incorporating it into spark-cassandra-connector I > guess. > > On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote: > >> According to this link https://github.com/datastax/sp >> ark-cassandra-connector/blob/master/doc/3_selection.md >> >> I tried the following but it still looks like it is taking forever >> >> sc.cassandraTable(keyspace, table).cassandraCount >> >> >> On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com> wrote: >> >>> I would be glad if SELECT COUNT(*) FROM hello can return any value for >>> that size :) I can say for sure it didn't return anything for 30 mins and I >>> probably need to build more patience to sit for few more hours after that! >>> Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which >>> will give a pretty good estimate but not an accurate value. >>> >>> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com> >>> wrote: >>> >>>> How fast is Cassandra without Spark on the count operation? >>>> >>>> cqsh> SELECT COUNT(*) FROM hello >>>> >>>> (this is not equivalent with what you are doing but might help you find >>>> the root of the cause) >>>> >>>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> >>>> wrote: >>>> >>>>> I have the following code >>>>> >>>>> I invoke spark-shell as follows >>>>> >>>>> ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 >>>>> --executor-memory 15G --executor-cores 12 --conf >>>>> spark.cassandra.input.split.size_in_mb=67108864 >>>>> >>>>> code >>>>> >>>>> scala> val df = spark.sql("SELECT test from hello") // Billion >>>>> rows in hello and test column is 1KB >>>>> >>>>> df: org.apache.spark.sql.DataFrame = [test: binary] >>>>> >>>>> scala> df.count >>>>> >>>>> [Stage 0:> (0 + 2) / 13] // I dont know what these numbers mean >>>>> precisely. >>>>> >>>>> If I invoke spark-shell as follows >>>>> >>>>> ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 >>>>> >>>>> code >>>>> >>>>> >>>>> val df = spark.sql("SELECT test from hello") // This has about >>>>> billion rows >>>>> >>>>> scala> df.count >>>>> >>>>> >>>>> [Stage 0:=> (686 + 2) / 24686] // What are these numbers >>>>> precisely? >>>>> >>>>> >>>>> Both of these versions didn't work Spark keeps running forever and I >>>>> have been waiting for more than 15 mins and no response. Any ideas on what >>>>> could be wrong and how to fix this? >>>>> >>>>> I am using Spark 2.0.2 >>>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar >>>>> >>>>> >>>> >>>> >>>> -- >>>> -- Anastasios Zouzias >>>> <a...@zurich.ibm.com> >>>> >>> >>> >> >