Re: Dealing with headers in csv file pyspark

Ewen Cheslack-Postava Wed, 26 Feb 2014 09:47:31 -0800

You must be parsing each line of the file at some point anyway, so adding a step to filter out the header should work fine. It'll get executed at the same time as your parsing/conversion to ints, so there's no significant overhead aside from the check itself.

For standalone programs, there's an section in the pyspark programming guide, along with a link to a complete example: http://spark.incubator.apache.org/docs/latest/python-programming-guide.html#standalone-programs

Chengi Liu

February 26, 2014 at 9:38 AM

I am not sure.. the suggestion is to open a TB file and remove a line?
That doesnt sounds that good.
I am hacking my way by using a filter..
Can I put a try:except clause in my lambda function.. Maybe i should just try that out.
But thanks for the suggestion.

Also, can i run scripts against spark rather than using pyspark shell..

Mayur Rustagi

February 26, 2014 at 9:34 AM

Bad solution is to run a mapper through the data and null the counts , good solution is to trim the header before hand without Spark.

Chengi Liu

February 26, 2014 at 9:28 AM

Hi,
How do we deal with headers in csv file.
For example:
id, counts
1,2
1,5
2,20
2,25
... and so on

And I want to do a frequency count of counts for each id. So result will be :

1,7
2,45

and so on..
My code:
counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a + b))

But I see this error:
ValueError: invalid literal for int() with base 10: 'counts'

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
    at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
    at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)

I guess because of the header...

Q1) How do i exclude header from this
Q2) Rather than using pyspark.. how do i run python programs on spark?

Thanks

Re: Dealing with headers in csv file pyspark

Reply via email to