I am looking at the physical plan for the following query:

SELECT f1,f2,f3,...
FROM T1
LEFT ANTI JOIN T2 ON T1.id = T2.id
WHERE  f1 = 'bla'
       AND f2 = 'bla2'
       AND some_date >= date_sub(current_date(), 1)
LIMIT 100

An important detail: the table 'T1' can be very large (hundreds of
thousands of rows), but table T2 is rather small. Maximun in the thousands.
In this particular case, the table T2 has 2 rows.

In the physical plan, I see that a SortMergeJoin is performed. Despite it
being the perfect candidate for a broadcast join.

What could be the reason for this?
Is there a way to hint the optimizer to perform a broadcast join in the sql
syntax?

I am writing this in pyspark and the query itself is over parquets stored
in Azure blob storage.

Reply via email to