Hi Marco, A Scala dev here.
In short: yet another reason against Python :) Honestly, I've got no idea why the code gives the output. Ran it with 3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will chime in and shed more light on this. Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jan 20, 2021 at 2:07 PM Marco Wong <mck...@gmail.com> wrote: > Dear Spark users, > > I ran the Python code below on a simple RDD, but it gave strange results. > The filtered RDD contains non-existent elements which were filtered away > earlier. Any idea why this happened? > ``` > rdd = spark.sparkContext.parallelize([0,1,2]) > for i in range(3): > print("RDD is ", rdd.collect()) > print("Filtered RDD is ", rdd.filter(lambda x:x!=i).collect()) > rdd = rdd.filter(lambda x:x!=i) > print("Result is ", rdd.collect()) > print() > ``` > which gave > ``` > RDD is [0, 1, 2] > Filtered RDD is [1, 2] > Result is [1, 2] > > RDD is [1, 2] > Filtered RDD is [0, 2] > Result is [0, 2] > > RDD is [0, 2] > Filtered RDD is [0, 1] > Result is [0, 1] > ``` > > Thanks, > > Marco >