zhangfengcdt commented on issue #1768:
URL: https://github.com/apache/sedona/issues/1768#issuecomment-2634719231
@mfilan I was trying to reproduce the issue and looks like the code you are
using need some changes due to the query / object sides confusion.
```
df_demographics_score =
sedona.read.format('geoparquet').load(f'{path_to_data_directory}/score').alias('score')
df_demographics_reference =
sedona.read.format('geoparquet').load(f'{path_to_data_directory}/reference').alias('reference')
print(f"score: {df_demographics_score.count()}")
print(f"reference: {df_demographics_reference.count()}")
join_condition = f.expr(f"ST_KNN(reference.GEOMETRY, score.GEOMETRY, 1,
FALSE)")
df_joined = df_demographics_reference.join(df_demographics_score,
on=join_condition)
df_joined.explain()
df_joined.count()
print(f"join: {df_joined.count()}")
```
Note that:
1. you don't need to explicitly call broadcast on the dataframe loaded, KNN
will automatically use BroadcastQuerySideKNNJoin if the query side is small.
2. the expression needs to switch the first two parameters to
(reference.GEOMETRY, score.GEOMETRY) since the knn algorithm treats the first
parameter as left (query) side.
3. the join statement also need to switch to use df_demographics_reference
as the query side.
4. If you intend to use df_demographics_score as query side, then make sure
the logic gets switch all.
5. The result joined dataframe will have the same count as the query (left)
side dataframe. In you case, df_demographics_reference
Here is my notebook screenshot:
<img width="1160" alt="Image"
src="https://github.com/user-attachments/assets/9c5700c8-a9e2-4874-a1b7-7a2349e469e2"
/>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]