Re: [I] ST_KNN results in missing rows [sedona]

via GitHub Mon, 27 Jan 2025 12:31:44 -0800


zhangfengcdt commented on issue #1768:
URL: https://github.com/apache/sedona/issues/1768#issuecomment-2616828343


   > Sure, here is the link to Gdrive with anonymised data: 
[link](https://drive.google.com/file/d/1DoEV3RPZGELMBgaMjL5W0ZZTHpVp7_-Q/view?usp=sharing).
 There are no null or empty geometries. Furthermore, I played with the code a 
bit and saw that after performing coalesce(1) this problem does not occur. 
However, it does not seem to be an optimal solution.
   > 
   > Here is the code snippet:
   > 
   > ```
   > from pyspark.sql import functions as f
   > from sedona.register.geo_registrator import SedonaRegistrator
   > SedonaRegistrator.registerAll(spark)
   > path_to_data_directory = 
"/FileStore/maciej_filanowicz/whs_data/troublesome_data/shared_files_with_apache"
   > 
   > df_demographics_score = 
broadcast(spark.read.format('geoparquet').load(f'{path_to_data_directory}/score')).alias('score')
   > df_demographics_reference = 
spark.read.format('geoparquet').load(f'{path_to_data_directory}/reference').alias('reference')
   > 
   > join_condition = f.expr(f"ST_KNN(score.geometry, reference.geometry, 1, 
True)")
   > df_joined = df_demographics_score.join(df_demographics_reference, 
on=join_condition).cache()
   > assert df_joined.count() == df_demographics_score.count(), "Some rows are 
missing!"
   > ```
   
   Thanks @mfilan ! I will take a look soon and let you know my findings. 
   
   At the same time, would you help to quickly print the query plans with cache 
and without cache call? Thanks!
   
   e.g., 
   df_joined.explain()
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] ST_KNN results in missing rows [sedona]

Reply via email to