Heh that could make sense, but that definitely was not my mental model of
how python binds variables! Definitely is not how Scala works.
On Wed, Jan 20, 2021 at 10:00 AM Marco Wong wrote:
> Hmm, I think I got what Jingnan means. The lambda function is x != i and i
> is not evaluated when the lam
Hmm, I think I got what Jingnan means. The lambda function is x != i and i
is not evaluated when the lambda function was defined. So the pipelined rdd
is rdd.filter(lambda x: x != i).filter(lambda x: x != i), rather than
having the values of i substituted. Does that make sense to you, Sean?
On Wed
No, because the final rdd is really the result of chaining 3 filter
operations. They should all execute. It _should_ work like
"rdd.filter(...).filter(..).filter(...)"
On Wed, Jan 20, 2021 at 9:46 AM Zhu Jingnan wrote:
> I thought that was right result.
>
> As rdd runs on a lacy basis. so every
I thought that was right result.
As rdd runs on a lacy basis. so every time rdd.collect() executed, the i
will be updated to the latest i value, so only one will be filter out.
Regards
Jingnan
On Wed, Jan 20, 2021 at 9:01 AM Sean Owen wrote:
> That looks very odd indeed. Things like this wor
A. global scope and global variables are bad habits in Python (this is
about an 'rdd' and 'i' variable used in lambda).
B. lambdas are usually misused and abused in Python especially when they
used in global context: ideally you'd like to use pure functions and use
something like:
```
def my_rdd_f
That looks very odd indeed. Things like this work as expected:
rdd = spark.sparkContext.parallelize([0,1,2])
def my_filter(data, i):
return data.filter(lambda x: x != i)
for i in range(3):
rdd = my_filter(rdd, i)
rdd.collect()
... as does unrolling the loop.
But your example behaves as if
Hi Marco,
A Scala dev here.
In short: yet another reason against Python :)
Honestly, I've got no idea why the code gives the output. Ran it with
3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will
chime in and shed more light on this.
Pozdrawiam,
Jacek Laskowski
https:
Dear Spark users,
I ran the Python code below on a simple RDD, but it gave strange results.
The filtered RDD contains non-existent elements which were filtered away
earlier. Any idea why this happened?
```
rdd = spark.sparkContext.parallelize([0,1,2])
for i in range(3):
print("RDD is ", rdd.co