sahnib commented on code in PR #44323:
URL: https://github.com/apache/spark/pull/44323#discussion_r1600366874
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging {
attributesWithEventWatermark =
AttributeSet(otherSideInputAttributes),
condition,
eventTimeWatermarkForEviction)
- val inputAttributeWithWatermark =
oneSideInputAttributes.find(_.metadata.contains(delayKey))
- val expr = watermarkExpression(inputAttributeWithWatermark,
stateValueWatermark)
- expr.map(JoinStateValueWatermarkPredicate.apply _)
+ // If the condition itself is empty (for example, left_time <
left_time + INTERVAL ...),
+ // then we will not have generated a stateValueWatermark.
+ if (stateValueWatermark.isEmpty) {
+ None
+ } else {
+ // For example, if the condition is of the form:
+ // left_time > right_time + INTERVAL 30 MINUTES
+ // Then this extracts left_time and right_time.
+ val attributesInCondition = AttributeSet(
+ condition.get.collect { case a: AttributeReference => a }
+ )
+
+ // Construct an AttributeSet so that we can perform equality between
attributes,
+ // which we do in the filter below.
+ val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+ // oneSideInputAttributes could be [left_value, left_time], and we
just
+ // want the attribute _in_ the time-interval condition.
+ val oneSideStateWatermarkAttributes = attributesInCondition.filter {
a =>
+ oneSideInputAttributeSet.contains(a)
+ }
+
+ // There should be a single attribute per side in the time-interval
condition, so,
+ // filtering for oneSideInputAttributes as done above should lead us
with 1 attribute.
+ if (oneSideStateWatermarkAttributes.size == 1) {
+ val expr =
Review Comment:
Discussed offline as well. This assumption does not seem to be correct. We
actually need to find the partial join condition where the otherSide has
eventTime attribute, and use that attribute to calculate watermark predicate.
As an aside, it might be beneficial to combine this function with
`getStateWatermark` as both of these have similar logic.
##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala:
##########
@@ -257,6 +257,75 @@ class StreamingInnerJoinSuite extends StreamingJoinSuite {
)
}
+
Review Comment:
Lets also add a testcase for join condition where we compare eventTime and
some other attribute (example - id)
See https://github.com/apache/spark/pull/44323/files#r1582300122 for
context.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]