Could be a number of issues.. maybe your csv is not allowing map tasks to be broken, of the file is not process-node local.. how many tasks are you seeing in spark web ui for map & store data. are all the nodes being used when you look at task level .. is the time taken by each task roughly equal or very skewed... Regards Mayur
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Mon, Mar 3, 2014 at 4:13 PM, Mohit Singh <mohit1...@gmail.com> wrote: > Hi, > I have a csv file... (say "n" columns ) > > I am trying to do a filter operation like: > > query = rdd.filter(lambda x:x[1] == "1234") > query.take(20) > Basically this would return me rows with that specific value? > This manipulation is taking quite some time to execute.. (if i can > compare.. maybe slower than hadoop operation..) > > I am seeing this on my console: > 14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8, > finish = 5234 > 14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took > 5.249082169 s > 14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1 > 14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with > 1 output partitions (allowLocal=true) > 14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at > <stdin>:1) > 14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List() > 14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List() > 14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition > locally > 14/03/03 16:13:03 INFO HadoopRDD: Input split: > hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728 > > Am I not doing this correctly? > > > -- > Mohit > > "When you want success as badly as you want the air, then you will get it. > There is no other secret of success." > -Socrates >