Hi, I have a csv file... (say "n" columns ) I am trying to do a filter operation like:
query = rdd.filter(lambda x:x[1] == "1234") query.take(20) Basically this would return me rows with that specific value? This manipulation is taking quite some time to execute.. (if i can compare.. maybe slower than hadoop operation..) I am seeing this on my console: 14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8, finish = 5234 14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took 5.249082169 s 14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1 14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with 1 output partitions (allowLocal=true) 14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at <stdin>:1) 14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List() 14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List() 14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition locally 14/03/03 16:13:03 INFO HadoopRDD: Input split: hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728 Am I not doing this correctly? -- Mohit "When you want success as badly as you want the air, then you will get it. There is no other secret of success." -Socrates