nullbutt opened a new issue, #2589:
URL: https://github.com/apache/sedona/issues/2589

   Hi, flagging that I've noticed a significant performance drop in DBSCAN 
following upgrade to 1.8.1. My jobs are taking ~4x the time they took before 
when using 1.8.0 and tasks seem to have doubled.
   
   Perhaps related to the graphframes upgrade and changes to connected 
components?
   https://github.com/graphframes/graphframes/issues/758
   
   ### Perf Comparison
   
   | Metric | Before (GF 0.9.2) | After (GF 0.10.0) |
   |--------|-------------------|-------------------|
   | Duration | 4m 1s | 15m 53s |
   | Spark tasks | 25,310 | 53,188 |
   | Shuffle write | 12.3MB | 20.2MB |
   
   ### What I've Tried
   
   None of these resolved the issue:
   
   1. **Disabling AQE**
   ```python
      spark.conf.set("spark.sql.adaptive.enabled", "false")
   ```
      Result: Duration improved slightly to ~14 min, but tasks increased to 75K
   
   2. **Setting broadcastThreshold to -1 with AQE enabled** (as recommended in 
GraphFrames issue)
   ```python
      spark.conf.set("spark.sql.adaptive.enabled", "true")
      spark.conf.set("graphframes.connected.components.broadcastThreshold", 
"-1")
   ```
      Result: No significant improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to