Re: [I] ST_KNN results in missing rows [sedona]

via GitHub Tue, 04 Feb 2025 10:14:56 -0800


zhangfengcdt commented on issue #1768:
URL: https://github.com/apache/sedona/issues/1768#issuecomment-2634719231


   @mfilan I was trying to reproduce the issue and looks like the code you are 
using need some changes due to the query / object sides confusion.
   
   ```
   df_demographics_score = 
sedona.read.format('geoparquet').load(f'{path_to_data_directory}/score').alias('score')
   df_demographics_reference = 
sedona.read.format('geoparquet').load(f'{path_to_data_directory}/reference').alias('reference')
   
   print(f"score: {df_demographics_score.count()}")
   print(f"reference: {df_demographics_reference.count()}")
   
   join_condition = f.expr(f"ST_KNN(reference.GEOMETRY, score.GEOMETRY, 1, 
FALSE)")
   df_joined = df_demographics_reference.join(df_demographics_score, 
on=join_condition)
   
   df_joined.explain()
   df_joined.count()
   
   print(f"join: {df_joined.count()}")
   ```
   
   Note that:
   1. you don't need to explicitly call broadcast on the dataframe loaded, KNN 
will automatically use BroadcastQuerySideKNNJoin if the query side is small.
   2. the expression needs to switch the first two parameters to 
(reference.GEOMETRY, score.GEOMETRY) since the knn algorithm treats the first 
parameter as left (query) side.
   3. the join statement also need to switch to use df_demographics_reference 
as the query side.
   4. If you intend to use df_demographics_score as query side, then make sure 
the logic gets switch all. 
   5. The result joined dataframe will have the same count as the query (left) 
side dataframe. In you case, df_demographics_reference
   
   Here is my notebook screenshot:
   
   <img width="1160" alt="Image" 
src="https://github.com/user-attachments/assets/9c5700c8-a9e2-4874-a1b7-7a2349e469e2";
 />
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] ST_KNN results in missing rows [sedona]

Reply via email to