timsaucer opened a new pull request, #1184:
URL: https://github.com/apache/datafusion-python/pull/1184

   # Which issue does this PR close?
   
   Closes #1173
   
   # Rationale for this change
   
   In the current version of the code when you do a join and there is a common 
`on` column name, then you end up with two columns in the output dataframe with 
ambiguous names. This is an annoyance for users where they have to work around 
by renaming the column to join on. With this change, it makes the interface 
more user friendly.
   
   <img width="583" alt="Screenshot 2025-07-07 at 7 25 54 PM" 
src="https://github.com/user-attachments/assets/ced7a1d8-7ee8-4448-a658-2c9872375c8a";
 />
   
   # What changes are included in this PR?
   
   - Adds an option `keep_duplicate_keys` if the user does not want to drop 
duplicate column names
   - By default, adds a select on the join to keep only the first (left) 
dataframe column
   - Small change to unit test to fix error when user would get a deprecation 
warning when they passed `on` and not `join_on`
   - Small formatter change where the user would get very large rendering for 
narrow dataframes.
   
   # Are there any user-facing changes?
   
   Yes.
   
   `DataFrame.join()` by default will now only return a single column for 
duplicate `on` keys. The user can revert to the previous version by setting 
`keep_duplicate_keys` to `True`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to