Hi,
  How do we deal with headers in csv file.
For example:
id, counts
1,2
1,5
2,20
2,25
... and so on


And I want to do a frequency count of counts for each id. So result will be
:

1,7
2,45

and so on..
My code:
counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
b))

But I see this error:
ValueError: invalid literal for int() with base 10: 'counts'

    at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
    at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
    at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)


I guess because of the header...

Q1) How do i exclude header from this
Q2) Rather than using pyspark.. how do i run python programs on spark?

Thanks

Reply via email to