zhangfengcdt commented on issue #1768: URL: https://github.com/apache/sedona/issues/1768#issuecomment-2616828343
> Sure, here is the link to Gdrive with anonymised data: [link](https://drive.google.com/file/d/1DoEV3RPZGELMBgaMjL5W0ZZTHpVp7_-Q/view?usp=sharing). There are no null or empty geometries. Furthermore, I played with the code a bit and saw that after performing coalesce(1) this problem does not occur. However, it does not seem to be an optimal solution. > > Here is the code snippet: > > ``` > from pyspark.sql import functions as f > from sedona.register.geo_registrator import SedonaRegistrator > SedonaRegistrator.registerAll(spark) > path_to_data_directory = "/FileStore/maciej_filanowicz/whs_data/troublesome_data/shared_files_with_apache" > > df_demographics_score = broadcast(spark.read.format('geoparquet').load(f'{path_to_data_directory}/score')).alias('score') > df_demographics_reference = spark.read.format('geoparquet').load(f'{path_to_data_directory}/reference').alias('reference') > > join_condition = f.expr(f"ST_KNN(score.geometry, reference.geometry, 1, True)") > df_joined = df_demographics_score.join(df_demographics_reference, on=join_condition).cache() > assert df_joined.count() == df_demographics_score.count(), "Some rows are missing!" > ``` Thanks @mfilan ! I will take a look soon and let you know my findings. At the same time, would you help to quickly print the query plans with cache and without cache call? Thanks! e.g., df_joined.explain() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
