Personal Information

  *
Name: Siddharth Shehria
  *
GitHub ID: <https://github.com/sidshehria> 
sidshehria<https://github.com/sidshehria>
  *
Email: sidsheh...@gmail.com
  *
LinkedIn: <https://linkedin.com/in/sidshehria> 
linkedin.com/in/sidshehria<https://linkedin.com/in/sidshehria>
  *
Time Zone & Available Hours Per Week: IST (Indian Standard Time), 25-30 hours 
per week

Project Proposal
Title
Improving Python Bindings in Apache DataFusion
Synopsis
Apache DataFusion offers Python bindings that enable users to build data 
systems using Python. However, these bindings are relatively low-level and do 
not expose all APIs that libraries like Pandas and Polars provide with an 
end-user focus. This project aims to enhance DataFusion’s Python bindings by 
adding high-level abstractions and better API support to improve usability and 
performance, making it more accessible to the broader data science and 
analytics community.
Benefits to the Community

  *
Improves the Python API usability of Apache DataFusion, making it more 
accessible for data engineers and analysts.
  *
Bridges the gap between low-level bindings and high-level usability found in 
Pandas and Polars.
  *
Expands DataFusion's reach by making it easier to integrate with data science 
workflows.
  *
Enhances performance by optimizing APIs and query execution, making DataFusion 
a competitive choice for analytics applications.
  *
Aligns DataFusion with modern data processing libraries, encouraging adoption 
within the open-source and industry ecosystem.

Deliverables & Milestones
Timeline
Deliverable
Community Bonding (May-June)
Engage with mentors, understand the existing Python bindings, and finalize the 
project roadmap.
Phase 1 (June-July)
Implement missing high-level APIs, improve type annotations, and ensure feature 
parity with Pandas and Polars where applicable.
Phase 2 (July-August)
Optimize performance, improve documentation, and write comprehensive unit tests.
Final Evaluation (August-September)
Deliver production-ready bindings, complete tutorials, and submit final reports.
Technical Details

  *
Programming Languages: Python, Rust
  *
Libraries & Tools: DataFusion, PyO3, Pandas, Polars
  *
Key Focus Areas:
     *
Exposing additional APIs for data manipulation and transformation.
     *
Improving dataframe interoperability with Pandas and Polars.
     *
Optimizing the FFI (Foreign Function Interface) layer for better performance.
     *
Enhancing documentation and examples for Python users.

Related Work & References

  *
Apache DataFusion<https://arrow.apache.org/datafusion/>
  *
PyO3 - Rust bindings for Python<https://pyo3.rs/>
  *
Pandas API Reference<https://pandas.pydata.org/docs/reference/>
  *
Polars API Reference<https://pola.rs/docs/reference/>

Personal Experience
Relevant Skills & Background

  *
Languages: Python, Rust, SQL, JavaScript, C++
  *
Data Analysis: Pandas, NumPy, Scikit-Learn, Power BI, Tableau
  *
Backend Development: FastAPI, Flask, Node.js, PostgreSQL
  *
Cloud & DevOps: Docker, AWS, Google Cloud Platform
  *
Open Source Experience:
     *
Contributed to TwitterOSS by optimizing key data processing metrics.
     *
Developed REST APIs and data pipelines for Unified Mentor.
     *
Built data analysis dashboards at Vizipa using Power BI and SQL.

Past Open-Source Contributions

  *
TwitterOSS: Implemented Python-based data pipelines, reducing execution time by 
30%.

Link: https://github.com/twitter/cloudhopper-commons/pull/42
Learning Plan

  *
Deepen understanding of DataFusion’s Rust internals.
  *
Study PyO3 for improving Python-Rust interoperability.
  *
Collaborate with the community to identify pain points and improvements.
  *
Regularly test implementations against real-world datasets.

Mentor & Communication

  *
Preferred Communication Channels: Slack, GitHub Discussions, Email
  *
Weekly Progress Updates Plan:
     *
Submit weekly reports on GitHub.
     *
Engage in mentor check-ins for feedback and improvements.
     *
Share learnings and challenges with the open-source community.

Additional Information
I am deeply passionate about data engineering and analytics, and I believe this 
project will allow me to contribute meaningfully to Apache DataFusion while 
honing my expertise in Python-Rust interoperability. My previous experience in 
API development, data pipelines, and open-source contributions equips me to 
tackle this project successfully. I look forward to working with the community 
to make DataFusion’s Python bindings more powerful and user-friendly!


Reply via email to