Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
Heh that could make sense, but that definitely was not my mental model of how python binds variables! Definitely is not how Scala works. On Wed, Jan 20, 2021 at 10:00 AM Marco Wong wrote: > Hmm, I think I got what Jingnan means. The lambda function is x != i and i > is not evaluated when the lam

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
Hmm, I think I got what Jingnan means. The lambda function is x != i and i is not evaluated when the lambda function was defined. So the pipelined rdd is rdd.filter(lambda x: x != i).filter(lambda x: x != i), rather than having the values of i substituted. Does that make sense to you, Sean? On Wed

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
No, because the final rdd is really the result of chaining 3 filter operations. They should all execute. It _should_ work like "rdd.filter(...).filter(..).filter(...)" On Wed, Jan 20, 2021 at 9:46 AM Zhu Jingnan wrote: > I thought that was right result. > > As rdd runs on a lacy basis. so every

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Zhu Jingnan
I thought that was right result. As rdd runs on a lacy basis. so every time rdd.collect() executed, the i will be updated to the latest i value, so only one will be filter out. Regards Jingnan On Wed, Jan 20, 2021 at 9:01 AM Sean Owen wrote: > That looks very odd indeed. Things like this wor

Re: RDD filter in for loop gave strange results

2021-01-20 Thread יורי אולייניקוב
A. global scope and global variables are bad habits in Python (this is about an 'rdd' and 'i' variable used in lambda). B. lambdas are usually misused and abused in Python especially when they used in global context: ideally you'd like to use pure functions and use something like: ``` def my_rdd_f

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
That looks very odd indeed. Things like this work as expected: rdd = spark.sparkContext.parallelize([0,1,2]) def my_filter(data, i): return data.filter(lambda x: x != i) for i in range(3): rdd = my_filter(rdd, i) rdd.collect() ... as does unrolling the loop. But your example behaves as if

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Jacek Laskowski
Hi Marco, A Scala dev here. In short: yet another reason against Python :) Honestly, I've got no idea why the code gives the output. Ran it with 3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will chime in and shed more light on this. Pozdrawiam, Jacek Laskowski https:

RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
Dear Spark users, I ran the Python code below on a simple RDD, but it gave strange results. The filtered RDD contains non-existent elements which were filtered away earlier. Any idea why this happened? ``` rdd = spark.sparkContext.parallelize([0,1,2]) for i in range(3): print("RDD is ", rdd.co