I thought that was right result. As rdd runs on a lacy basis. so every time rdd.collect() executed, the i will be updated to the latest i value, so only one will be filter out.
Regards Jingnan On Wed, Jan 20, 2021 at 9:01 AM Sean Owen <sro...@gmail.com> wrote: > That looks very odd indeed. Things like this work as expected: > > rdd = spark.sparkContext.parallelize([0,1,2]) > > def my_filter(data, i): > return data.filter(lambda x: x != i) > > for i in range(3): > rdd = my_filter(rdd, i) > rdd.collect() > > ... as does unrolling the loop. > > But your example behaves as if only the final filter is applied. Is this > is some really obscure Python scoping thing with lambdas that I don't > understand, like the lambda only binds i once? but then you'd expect to > only filter the first number. > > I also keep looking in the code to figure out if these are somehow being > erroneously 'collapsed' as the same function, but the RDD APIs don't do > that kind of thing. They get put into a chain of pipeline_funcs, but, still > shouldn't be an issue. I wonder if this is some strange interaction with > serialization of the lambda and/or scoping? > > Really strange! python people? > > On Wed, Jan 20, 2021 at 7:14 AM Marco Wong <mck...@gmail.com> wrote: > >> Dear Spark users, >> >> I ran the Python code below on a simple RDD, but it gave strange results. >> The filtered RDD contains non-existent elements which were filtered away >> earlier. Any idea why this happened? >> ``` >> rdd = spark.sparkContext.parallelize([0,1,2]) >> for i in range(3): >> print("RDD is ", rdd.collect()) >> print("Filtered RDD is ", rdd.filter(lambda x:x!=i).collect()) >> rdd = rdd.filter(lambda x:x!=i) >> print("Result is ", rdd.collect()) >> print() >> ``` >> which gave >> ``` >> RDD is [0, 1, 2] >> Filtered RDD is [1, 2] >> Result is [1, 2] >> >> RDD is [1, 2] >> Filtered RDD is [0, 2] >> Result is [0, 2] >> >> RDD is [0, 2] >> Filtered RDD is [0, 1] >> Result is [0, 1] >> ``` >> >> Thanks, >> >> Marco >> >