I am not sure.. the suggestion is to open a TB file and remove a line? That doesnt sounds that good. I am hacking my way by using a filter.. Can I put a try:except clause in my lambda function.. Maybe i should just try that out. But thanks for the suggestion.
Also, can i run scripts against spark rather than using pyspark shell.. On Wed, Feb 26, 2014 at 9:34 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > Bad solution is to run a mapper through the data and null the counts , > good solution is to trim the header before hand without Spark. > On Feb 26, 2014 9:28 AM, "Chengi Liu" <chengi.liu...@gmail.com> wrote: > >> Hi, >> How do we deal with headers in csv file. >> For example: >> id, counts >> 1,2 >> 1,5 >> 2,20 >> 2,25 >> ... and so on >> >> >> And I want to do a frequency count of counts for each id. So result will >> be : >> >> 1,7 >> 2,45 >> >> and so on.. >> My code: >> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a >> + b)) >> >> But I see this error: >> ValueError: invalid literal for int() with base 10: 'counts' >> >> at >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) >> at >> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) >> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) >> at >> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694) >> at >> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679) >> >> >> I guess because of the header... >> >> Q1) How do i exclude header from this >> Q2) Rather than using pyspark.. how do i run python programs on spark? >> >> Thanks >> >> >>