Bad solution is to run a mapper through the data and null the counts , good
solution is to trim the header before hand without Spark.
On Feb 26, 2014 9:28 AM, "Chengi Liu" <chengi.liu...@gmail.com> wrote:

> Hi,
>   How do we deal with headers in csv file.
> For example:
> id, counts
> 1,2
> 1,5
> 2,20
> 2,25
> ... and so on
>
>
> And I want to do a frequency count of counts for each id. So result will
> be :
>
> 1,7
> 2,45
>
> and so on..
> My code:
> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
> b))
>
> But I see this error:
> ValueError: invalid literal for int() with base 10: 'counts'
>
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>     at
> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>
>
> I guess because of the header...
>
> Q1) How do i exclude header from this
> Q2) Rather than using pyspark.. how do i run python programs on spark?
>
> Thanks
>
>
>

Reply via email to