li-boxuan opened a new pull request, #50949:
URL: https://github.com/apache/spark/pull/50949

   ### What changes were proposed in this pull request?
   
   This PR proposed to make equality checks stricter in BatchScanExec, 
ContinuousScanExec, and MicroBatchScanExec, by making two objects non-equal if 
they are of different concrete classes.
   
   ### Why are the changes needed?
   
   
https://github.com/apache/spark/commit/e97ab1d9807134bb557ae73920af61e8534b2b08#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38
 introduced `equals` functions to a few v2 data source operators, while none of 
the other operators has `equals` override.
   
   This means equivalence checks of BatchScanExec, ContinuousScanExec, and 
MicroBatchScanExec are much looser than all other operators. It doesn't seem to 
be intentional; it looks like an overlook to me - different operators should 
follow the same set of basic contracts if possible, if not, they shall not be 
too different from each other. Notably, the original author also left a TODO to 
"unify" them.
   
   Now we live in a world where most operators have strictest equivalence 
checks, while a few operators have loose equivalence checks. What could go 
wrong? Well, since Spark is extensible, it is possible to inherit Spark's 
operators with modified runtime implementation while delivering same results. 
In fact, that's what https://github.com/apache/incubator-gluten project does, 
whereas (most) Spark operators are inherited by Gluten operators. Given the 
loose equivalence checks of Spark operators, we could end up declaring 
equivalence between a Spark operator and a Gluten operator.
   
   If Spark starts with a clear contract that operators are "equal" as long as 
they deliver same results, it would be probably fine. Now we live in a world 
where most operators don't do this except for the 3 operators I mentioned 
above. This is very easy to miss, and has caused unexpected behavior/bugs in 
downstream applications.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, unless "user" means downstream applications that integrates with Spark 
(like Gluten).
   
   
   ### How was this patch tested?
   
   No. Unittests, if any, would be to test that two instances of subclasses of 
`BatchScanExec` (let's say `BatchScanExec` and `MockBatchScanExec`) are not 
equivalent, but that is kinda obvious just from the code itself.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to