Hi, How do we deal with headers in csv file. For example: id, counts 1,2 1,5 2,20 2,25 ... and so on
And I want to do a frequency count of counts for each id. So result will be : 1,7 2,45 and so on.. My code: counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a + b)) But I see this error: ValueError: invalid literal for int() with base 10: 'counts' at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679) I guess because of the header... Q1) How do i exclude header from this Q2) Rather than using pyspark.. how do i run python programs on spark? Thanks