I believe we would benefit from modified language to describe the
nature and scope of the Arrow project.

Currently, our GitHub project description (and what we use in release
announcements) states:

"Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides IPC and
common algorithm implementations."

I think this could be perhaps restated in the following way:

"Apache Arrow is a cross-language development platform for in-memory
structured data access and analytics. It specifies a standardized
language-independent columnar memory format for flat and hierarchical
data, with support for zero-copy streaming messaging and interprocess
communication. It also provides computational libraries for efficient
in-memory analytics on modern hardware."

It is true that we have been mostly focused on hardening the details
of the Arrow format and related issues around messaging and IPC, which
are necessary for everything else we may contemplate building in the
future. Since I plan to be building a library of computational tools
in C++ for the native code community (Python, Ruby, R, etc.), I think
it would be a good idea to clearly state that building general purpose
analytics implementations (i.e. the sorts of things you find in "data
frame libraries" like pandas) is part of the mission of the project.

Feedback on the above would be appreciated how we could do a better
job representing our past, present, and future community goals.

Thanks
Wes

Reply via email to