Re: RDD filter in for loop gave strange results

Zhu Jingnan Wed, 20 Jan 2021 07:46:48 -0800

I thought that was right result.

As rdd runs on a lacy basis.  so every time rdd.collect() executed, the i
will be updated to the latest i value, so only one will be filter out.


Regards

Jingnan

On Wed, Jan 20, 2021 at 9:01 AM Sean Owen <sro...@gmail.com> wrote:

> That looks very odd indeed. Things like this work as expected:
>
> rdd = spark.sparkContext.parallelize([0,1,2])
>
> def my_filter(data, i):
>   return data.filter(lambda x: x != i)
>
> for i in range(3):
>   rdd = my_filter(rdd, i)
> rdd.collect()
>
> ... as does unrolling the loop.
>
> But your example behaves as if only the final filter is applied. Is this
> is some really obscure Python scoping thing with lambdas that I don't
> understand, like the lambda only binds i once? but then you'd expect to
> only filter the first number.
>
> I also keep looking in the code to figure out if these are somehow being
> erroneously 'collapsed' as the same function, but the RDD APIs don't do
> that kind of thing. They get put into a chain of pipeline_funcs, but, still
> shouldn't be an issue. I wonder if this is some strange interaction with
> serialization of the lambda and/or scoping?
>
> Really strange! python people?
>
> On Wed, Jan 20, 2021 at 7:14 AM Marco Wong <mck...@gmail.com> wrote:
>
>> Dear Spark users,
>>
>> I ran the Python code below on a simple RDD, but it gave strange results.
>> The filtered RDD contains non-existent elements which were filtered away
>> earlier. Any idea why this happened?
>> ```
>> rdd = spark.sparkContext.parallelize([0,1,2])
>> for i in range(3):
>>     print("RDD is ", rdd.collect())
>>     print("Filtered RDD is ", rdd.filter(lambda x:x!=i).collect())
>>     rdd = rdd.filter(lambda x:x!=i)
>>     print("Result is ", rdd.collect())
>>     print()
>> ```
>> which gave
>> ```
>> RDD is  [0, 1, 2]
>> Filtered RDD is  [1, 2]
>> Result is  [1, 2]
>>
>> RDD is  [1, 2]
>> Filtered RDD is  [0, 2]
>> Result is  [0, 2]
>>
>> RDD is  [0, 2]
>> Filtered RDD is  [0, 1]
>> Result is  [0, 1]
>> ```
>>
>> Thanks,
>>
>> Marco
>>
>

Re: RDD filter in for loop gave strange results

Reply via email to