hi folks,

As some of you may have noticed, we are accumulating a mountain of
Parquet-related JIRA issues, many of them resulting from people using
Apache Arrow to do data engineering in Python and running into
problems.

To help with having better visibility into all the relevant Parquet
issues, and with the monorepo merge behind us, I created a couple wiki
pages linked to from the main
https://cwiki.apache.org/confluence/display/ARROW page:

* C++ issue dashboard: https://cwiki.apache.org/confluence/x/fpWzBQ
* Python issue dashboard:
https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Development

Many Parquet issues in the ARROW project are not found in these
dashboards because they lack the "parquet" label. Please help with
project organization by remembering to apply the "parquet" label to
any issue.

Since Ruby also supports Parquet now via GLib, and R support for
Parquet is coming soon, we need to do what we can to grow the
community of people working on the core Parquet libraries and the
things they depend on, like the IO and memory management subsystems of
the Arrow C++ libraries.

In general, I think it is very important for us to have fast and
reliable C++ support (and language bindings) for the 5 major file
formats in use in data warehousing:

* CSV
* JSON
* Parquet
* Avro
* ORC

Antoine has been leading efforts on reading CSV files, and we will
need to make a push into JSON and Avro at some point.

Thanks
Wes

Reply via email to