Personal Information * Name: Siddharth Shehria * GitHub ID: <https://github.com/sidshehria> sidshehria<https://github.com/sidshehria> * Email: sidsheh...@gmail.com * LinkedIn: <https://linkedin.com/in/sidshehria> linkedin.com/in/sidshehria<https://linkedin.com/in/sidshehria> * Time Zone & Available Hours Per Week: IST (Indian Standard Time), 25-30 hours per week
Project Proposal Title Improving Python Bindings in Apache DataFusion Synopsis Apache DataFusion offers Python bindings that enable users to build data systems using Python. However, these bindings are relatively low-level and do not expose all APIs that libraries like Pandas and Polars provide with an end-user focus. This project aims to enhance DataFusion’s Python bindings by adding high-level abstractions and better API support to improve usability and performance, making it more accessible to the broader data science and analytics community. Benefits to the Community * Improves the Python API usability of Apache DataFusion, making it more accessible for data engineers and analysts. * Bridges the gap between low-level bindings and high-level usability found in Pandas and Polars. * Expands DataFusion's reach by making it easier to integrate with data science workflows. * Enhances performance by optimizing APIs and query execution, making DataFusion a competitive choice for analytics applications. * Aligns DataFusion with modern data processing libraries, encouraging adoption within the open-source and industry ecosystem. Deliverables & Milestones Timeline Deliverable Community Bonding (May-June) Engage with mentors, understand the existing Python bindings, and finalize the project roadmap. Phase 1 (June-July) Implement missing high-level APIs, improve type annotations, and ensure feature parity with Pandas and Polars where applicable. Phase 2 (July-August) Optimize performance, improve documentation, and write comprehensive unit tests. Final Evaluation (August-September) Deliver production-ready bindings, complete tutorials, and submit final reports. Technical Details * Programming Languages: Python, Rust * Libraries & Tools: DataFusion, PyO3, Pandas, Polars * Key Focus Areas: * Exposing additional APIs for data manipulation and transformation. * Improving dataframe interoperability with Pandas and Polars. * Optimizing the FFI (Foreign Function Interface) layer for better performance. * Enhancing documentation and examples for Python users. Related Work & References * Apache DataFusion<https://arrow.apache.org/datafusion/> * PyO3 - Rust bindings for Python<https://pyo3.rs/> * Pandas API Reference<https://pandas.pydata.org/docs/reference/> * Polars API Reference<https://pola.rs/docs/reference/> Personal Experience Relevant Skills & Background * Languages: Python, Rust, SQL, JavaScript, C++ * Data Analysis: Pandas, NumPy, Scikit-Learn, Power BI, Tableau * Backend Development: FastAPI, Flask, Node.js, PostgreSQL * Cloud & DevOps: Docker, AWS, Google Cloud Platform * Open Source Experience: * Contributed to TwitterOSS by optimizing key data processing metrics. * Developed REST APIs and data pipelines for Unified Mentor. * Built data analysis dashboards at Vizipa using Power BI and SQL. Past Open-Source Contributions * TwitterOSS: Implemented Python-based data pipelines, reducing execution time by 30%. Link: https://github.com/twitter/cloudhopper-commons/pull/42 Learning Plan * Deepen understanding of DataFusion’s Rust internals. * Study PyO3 for improving Python-Rust interoperability. * Collaborate with the community to identify pain points and improvements. * Regularly test implementations against real-world datasets. Mentor & Communication * Preferred Communication Channels: Slack, GitHub Discussions, Email * Weekly Progress Updates Plan: * Submit weekly reports on GitHub. * Engage in mentor check-ins for feedback and improvements. * Share learnings and challenges with the open-source community. Additional Information I am deeply passionate about data engineering and analytics, and I believe this project will allow me to contribute meaningfully to Apache DataFusion while honing my expertise in Python-Rust interoperability. My previous experience in API development, data pipelines, and open-source contributions equips me to tackle this project successfully. I look forward to working with the community to make DataFusion’s Python bindings more powerful and user-friendly!