GitHub user logan-keede edited a discussion: More thorough contribution 
guideline

I am opening this discussion to discuss about how to approach refactoring and 
perhaps changes in general to make it easier for downstream repos and be more 
efficient with review process. 

This came up while discussing my GSoC 2025 proposal for "Optimizing compile 
time and binary size" with @ozankabak which expects a large amount of 
refactoring. 

After some research, I found that almost no Open Source Repository has 
something like Refactoring Guideline and it is reasonable generally it is not 
needed, general contribution guideline is enough. However, Datafusion is 
perhaps a bit too refactoring happy/needy.
DataFusion :- 
![image](https://github.com/user-attachments/assets/158150ed-4683-42db-ba10-7bfcdb0e580d)
A repo with 17 times more commit then datafusion:-
![image](https://github.com/user-attachments/assets/253878bd-e7a7-4845-bcb6-e75717100da1)

Perhaps a direct comparison is not fair, because we do need refactoring. So the 
best we can do is to make it easier for everyone. 

## Proposed Solution 
1. Make a feature branch, Do all the Major refactoring there publish a Roadmap 
on Why this refactoring/change is necessary and what does it change. This is 
perhaps more useful for refactoring Epics like #14444.
_suggested by @ozankabak over discord_

2. Use 'cargo-semver-checks' to detect unintentional API breakages. Smallest 
things can break APIs in ways we can not predict. 
[Here](https://predr.ag/blog/semver-in-rust-tooling-breakage-and-edge-cases/) 
is an article about this.

3. add do's and don'ts in Guideline. Start with a tentative version and refine 
it over time.
DataFusion already has a Contribution Guideline, which explain the general 
style with which we handle PRs and Issues but it does not go into great detail 
what to do and to not do. While this is not a big problem(if a problem at all) 
for more experienced member of community it is still good highlight Good and 
Bad Practice for the newer members. 

This also make sure that we have a DataFusion way of dealing with problems and 
make sure that there is no unexpected or uninformed(as much as possible) API 
changes/breaking. It will also save some reviewing bandwidth as reviewer will 
not have to explain same old common reasons for rejection again and again.

It will be valuable to collect community's ideas on this and reviews of 
downstream maintainers on what kind of Datafusion issues they face that can be 
avoided through better policy in this discussion. 


GitHub link: https://github.com/apache/datafusion/discussions/15365

----
This is an automatically sent email for github@datafusion.apache.org.
To unsubscribe, please send an email to: 
github-unsubscr...@datafusion.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to