hi folks, We've been struggling for quite some time with the development workflow between the Arrow and Parquet C++ (and Python) codebases.
To explain the root issues: * parquet-cpp depends on "platform code" in Apache Arrow; this includes file interfaces, memory management, miscellaneous algorithms (e.g. dictionary encoding), etc. Note that before this "platform" dependency was introduced, there was significant duplicated code between these codebases and incompatible abstract interfaces for things like files * we maintain a Arrow conversion code in parquet-cpp for converting between Arrow columnar memory format and Parquet * we maintain Python bindings for parquet-cpp + Arrow interop in Apache Arrow. This introduces a circular dependency into our CI. * Substantial portions of our CMake build system and related tooling are duplicated between the Arrow and Parquet repos * API changes cause awkward release coordination issues between Arrow and Parquet I believe the best way to remedy the situation is to adopt a "Community over Code" approach and find a way for the Parquet and Arrow C++ development communities to operate out of the same code repository, i.e. the apache/arrow git repository. This would bring major benefits: * Shared CMake build infrastructure, developer tools, and CI infrastructure (Parquet is already being built as a dependency in Arrow's CI systems) * Share packaging and release management infrastructure * Reduce / eliminate problems due to API changes (where we currently introduce breakage into our CI workflow when there is a breaking / incompatible change) * Arrow releases would include a coordinated snapshot of the Parquet implementation as it stands Continuing with the status quo has become unsatisfactory to me and as a result I've become less motivated to work on the parquet-cpp codebase. The only Parquet C++ committer who is not an Arrow committer is Deepak Majeti. I think the issue of commit privileges could be resolved without too much difficulty or time. I also think if it is truly necessary that the Apache Parquet community could create release scripts to cut a miniml versioned Apache Parquet C++ release if that is deemed truly necessary. I know that some people are wary of monorepos and megaprojects, but as an example TensorFlow is at least 10 times as large of a projects in terms of LOCs and number of different platform components, and it seems to be getting along just fine. I think we should be able to work together as a community to function just as well. Interested in the opinions of others, and any other ideas for practical solutions to the above problems. Thanks, Wes