Re: Dealing with headers in csv file pyspark

Chengi Liu Wed, 26 Feb 2014 09:39:11 -0800

I am not sure.. the suggestion is to open a TB file and remove a line?
That doesnt sounds that good.
I am hacking my way by using a filter..
Can I put a try:except clause in my lambda function.. Maybe i should just
try that out.
But thanks for the suggestion.


Also, can i run scripts against spark rather than using pyspark shell..




On Wed, Feb 26, 2014 at 9:34 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> Bad solution is to run a mapper through the data and null the counts ,
> good solution is to trim the header before hand without Spark.
> On Feb 26, 2014 9:28 AM, "Chengi Liu" <chengi.liu...@gmail.com> wrote:
>
>> Hi,
>>   How do we deal with headers in csv file.
>> For example:
>> id, counts
>> 1,2
>> 1,5
>> 2,20
>> 2,25
>> ... and so on
>>
>>
>> And I want to do a frequency count of counts for each id. So result will
>> be :
>>
>> 1,7
>> 2,45
>>
>> and so on..
>> My code:
>> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a
>> + b))
>>
>> But I see this error:
>> ValueError: invalid literal for int() with base 10: 'counts'
>>
>>     at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>>     at
>> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>     at
>> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>>     at
>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>>
>>
>> I guess because of the header...
>>
>> Q1) How do i exclude header from this
>> Q2) Rather than using pyspark.. how do i run python programs on spark?
>>
>> Thanks
>>
>>
>>

Re: Dealing with headers in csv file pyspark

Reply via email to