Hi,
How do we deal with headers in csv file.
For example:
id, counts
1,2
1,5
2,20
2,25
... and so on
And I want to do a frequency count of counts for each id. So result will be
:
1,7
2,45
and so on..
My code:
counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
b))
But I see this error:
ValueError: invalid literal for int() with base 10: 'counts'
at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
I guess because of the header...
Q1) How do i exclude header from this
Q2) Rather than using pyspark.. how do i run python programs on spark?
Thanks