filter operation in pyspark

Mohit Singh Mon, 03 Mar 2014 16:14:41 -0800

Hi,
   I have a csv file... (say "n" columns )

I am trying to do a filter operation like:


query = rdd.filter(lambda x:x[1] == "1234")
query.take(20)
Basically this would return me rows with that specific value?
This manipulation is taking quite some time to execute.. (if i can
compare.. maybe slower than hadoop operation..)

I am seeing this on my console:
14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8,
finish = 5234
14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took
5.249082169 s
14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1
14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with 1
output partitions (allowLocal=true)
14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at
<stdin>:1)
14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition
locally
14/03/03 16:13:03 INFO HadoopRDD: Input split:
hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728

Am I not doing this correctly?


-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

filter operation in pyspark

Reply via email to