Re: filter operation in pyspark

Mayur Rustagi Mon, 03 Mar 2014 16:28:30 -0800

Could be a number of issues.. maybe your csv is not allowing map tasks to
be broken, of the file is not process-node local.. how many tasks are you
seeing in spark web ui for map & store data. are all the nodes being used
when you look at task level .. is the time taken by each task roughly equal
or very skewed...
Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 3, 2014 at 4:13 PM, Mohit Singh <mohit1...@gmail.com> wrote:

> Hi,
>    I have a csv file... (say "n" columns )
>
> I am trying to do a filter operation like:
>
> query = rdd.filter(lambda x:x[1] == "1234")
> query.take(20)
> Basically this would return me rows with that specific value?
> This manipulation is taking quite some time to execute.. (if i can
> compare.. maybe slower than hadoop operation..)
>
> I am seeing this on my console:
> 14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8,
> finish = 5234
> 14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took
> 5.249082169 s
> 14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1
> 14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with
> 1 output partitions (allowLocal=true)
> 14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at
> <stdin>:1)
> 14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
> 14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
> 14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition
> locally
> 14/03/03 16:13:03 INFO HadoopRDD: Input split:
> hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728
>
> Am I not doing this correctly?
>
>
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates
>

Re: filter operation in pyspark

Reply via email to