Bad solution is to run a mapper through the data and null the counts , good solution is to trim the header before hand without Spark. On Feb 26, 2014 9:28 AM, "Chengi Liu" <chengi.liu...@gmail.com> wrote:
> Hi, > How do we deal with headers in csv file. > For example: > id, counts > 1,2 > 1,5 > 2,20 > 2,25 > ... and so on > > > And I want to do a frequency count of counts for each id. So result will > be : > > 1,7 > 2,45 > > and so on.. > My code: > counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a + > b)) > > But I see this error: > ValueError: invalid literal for int() with base 10: 'counts' > > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) > at > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694) > at > org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679) > > > I guess because of the header... > > Q1) How do i exclude header from this > Q2) Rather than using pyspark.. how do i run python programs on spark? > > Thanks > > >